Visual Navigation for Autonomous Vehicles · MIT 16.485 · Lecture 20

Visual & Visual-Inertial Odometry

Your drone has no GPS and must navigate a 200-meter loop entirely from its own sensors. A camera gives you rich scene detail but accumulates small pose errors that compound into a trajectory that misses the start by meters. Strap on an IMU, fuse the two, and suddenly drift shrinks by an order of magnitude and you recover metric scale. This lesson builds the full VO pipeline — feature tracking, relative-pose chaining, drift math — then adds the IMU: what it measures, why it drifts catastrophically alone, and how preintegration fuses hundreds of IMU readings between keyframes into a single constraint. MIT 16.485 by Luca Carlone, Lecture 20.

Prerequisites: VNAV L6 (feature detection & tracking) · VNAV L7 (two-view geometry, essential matrix) · VNAV L8 (RANSAC) · VNAV L10 (on-manifold optimization).
10
Chapters
5
Live Canvases
Hand
Derived Numbers

Chapter 0: No GPS — Navigate from Your Own Eyes

It's 2019. A DJI drone is sent to inspect a collapsed building. GPS is blocked by steel and concrete. The drone has one camera pointing forward and one small chip — an IMU — that measures acceleration and rotation at 200 Hz. That's it. How does it know where it is?

The same problem appears everywhere: a self-driving car in an underground garage, a robot in a warehouse without WiFi, a Mars rover 20 light-minutes from the nearest uplink. The answer in all of these cases is odometry — building a trajectory estimate step-by-step from your own sensor readings, without any external reference.

Visual odometry (VO) does this from a camera: track features across frames, recover the relative pose between each pair of frames via the epipolar constraint (L7), and chain those relative poses into a growing trajectory. Visual-inertial odometry (VIO) fuses the camera with an IMU for robustness and metric scale.

The fundamental tension: VO gives you a trajectory, but errors in each relative pose accumulate. After 100 frames a 1% per-frame error has compounded into a trajectory that misses the start position by multiple meters — even though every single relative-pose estimate was "pretty good." This is drift, and it's the central challenge of odometry.

Why not just use GPS?

GPS needs line-of-sight to 4+ satellites and updates at only 1–10 Hz with 2–5 m accuracy. Indoors it fails completely. A camera at 30 Hz with sub-pixel feature tracking gives you millimeter-scale relative pose changes between frames. An IMU at 200 Hz gives you continuous high-rate motion updates. Both are self-contained. Together they outperform GPS in short-range accuracy — if you can stop the drift.

The pipeline at a glance

Image stream
Camera at 30 Hz, typically grayscale 640×480
Feature track (L6)
KLT/Harris: same 3D points seen across frames
Relative pose (L7+L8)
5-pt + RANSAC: Rk, tk from frame k to k+1 (up to scale)
↓ chain SE(3)
Trajectory
T0, T1, ..., TN — the drone's path through the world
↻ drift accumulates
You run visual odometry on a 300-frame flight. At each frame the estimated position has a 0.5% relative error. After 300 frames, roughly how far off is the estimated final position if the total path length is 100 m?

Chapter 1: The VO Pipeline — Front-End and Back-End

A complete VO system has two conceptually separate parts. The front-end touches raw pixels: it detects features, tracks them across frames, and computes noisy relative-pose measurements. The back-end takes those measurements and finds the globally consistent trajectory via optimization (or filtering). The split matters because the front-end is fast but noisy; the back-end is slow but accurate.

Front-end: feature tracking and relative pose

At time k the camera captures frame Ik. The front-end detects corners with Harris or Shi-Tomasi (L6), then tracks them into frame Ik+1 with KLT optical flow. You now have a set of point correspondences {(pi, p'i)} — the same 3D point seen from two camera positions. Plug those into the 5-point algorithm (or 8-point with normalization) to estimate the essential matrix E = [t]× R (L7). RANSAC (L8) removes outliers. Decompose E into (R, t) — four hypotheses, cheirality picks the right one.

The relative pose is only determined up to scale. From a single camera with no prior knowledge of scene depth, you cannot tell if the camera moved 1 cm past a small cube or 1 m past a large room — both produce identical pixel correspondences. This is the monocular scale ambiguity, and it means the translation t from the 5-point algorithm has unit length by convention. Chapter 3 will show the exact math.

Back-end: trajectory building

The back-end chains relative poses. Given relative transform Tk,k+1 (pose of frame k+1 expressed in frame k), the absolute pose of frame k+1 in the world frame is:

Tw,k+1 = Tw,k · Tk,k+1

where Tw,k is the world-to-camera transform at frame k. This is SE(3) matrix multiplication — the on-manifold composition from L10.

Worked example: chaining two relative poses

Suppose the drone starts at the world origin T0 = I4. Between frames 0 and 1 it moves forward 1 m and turns 5° right. Between frames 1 and 2 it moves forward 1 m with no turn. In the (R | t) notation:

python
import numpy as np
from scipy.spatial.transform import Rotation

# Relative pose: frame 0 → 1 (5° yaw right, 1 m forward)
R01 = Rotation.from_euler('z', -5, degrees=True).as_matrix()
t01 = np.array([1.0, 0.0, 0.0])  # 1 m forward in frame-0 coords

# Relative pose: frame 1 → 2 (no turn, 1 m forward)
R12 = np.eye(3)
t12 = np.array([1.0, 0.0, 0.0])

# Chain: world pose of frame 2
# T_world_frame2 = T_world_frame0 @ T_01 @ T_12
# Starting at identity (T_world_frame0 = I)

# Position of frame 1 in world
p1 = t01  # = [1, 0, 0]

# Position of frame 2 in world = p1 + R01 @ t12
p2 = p1 + R01 @ t12
print("Frame 1 position:", np.round(p1, 4))   # [1.0, 0.0, 0.0]
print("Frame 2 position:", np.round(p2, 4))
# [1 + cos(-5°), 0 + sin(-5°), 0]
# = [1 + 0.9962, 0 + (-0.0872), 0]
# = [1.9962, -0.0872, 0.0]
# The drone veered slightly right on the second step — the 5° turn propagated!

This is the essence of the back-end: each step applies the new relative rotation Rk before adding the translation, so previous rotations shape all future translations. Errors in early rotations deviate all later positions.

VO Pipeline Visualized — Frames → Features → Poses → Trajectory

Animated illustration of the VO front-end and back-end. Watch features appear in each frame, get matched, yield a relative pose arrow, and chain into a growing trajectory.

In the VO back-end, you have the world pose Tw,k and the new relative measurement Tk,k+1. What is the correct formula for Tw,k+1?

Chapter 2: Drift — Why Small Errors Become Big Problems

Every relative-pose estimate carries a small error: the features matched had sub-pixel noise, RANSAC didn't find every inlier, the essential matrix decomposition had numerical error. In isolation, each error is tiny. But the back-end multiplies SE(3) transforms — and multiplication compounds errors multiplicatively, not additively.

The drift math

Suppose each step introduces a position error of ε meters (absolute) along the direction of motion. After N steps of length d, the position error grows. In the best case (errors partially cancel): O(ε√N). In the worst case (all errors in the same direction, e.g., systematic heading bias): O(εN). In practice VO drift is typically 0.1–2% of total path length — the error grows roughly linearly with distance traveled.

drift ≈ εrel × total path length

where εrel is the fractional per-step error (typically 0.5–1% for a good front-end).

Worked numbers

Path = 200 m loop. Per-step error εrel = 1%. Expected drift ≈ 2 m. The drone started at a door, flew 200 m around a room, and estimates it's 2 m from where it started. The loop doesn't close. That's drift.

Drift has no absolute reference to correct against. VO only knows relative measurements — "I moved this much from my last position." There's nothing to anchor the trajectory globally. The only fix is a loop closure (recognizing a previously visited place and correcting the trajectory to make the loop consistent) or an external reference like GPS. This is exactly why SLAM (L13) exists: it detects loop closures and eliminates accumulated drift.

Keyframes reduce drift (a little)

Running the optimization back-end over a sliding window of recent keyframes (selected frames that maximize information gain) helps. Bundle adjustment over N keyframes re-estimates all poses and map points jointly, which spreads and reduces error compared to purely sequential pose chaining. But it doesn't eliminate drift — it only slows its growth. The trajectory still drifts because the window doesn't extend back to the start.

Drift Accumulation — Adjust Error Per Step and Steps

Grey = ground truth circular path. Warm = VO estimated path. Watch the estimated path peel away and fail to close the loop as error or step count increases.

Per-step error (ε) 0.02
Steps (N) 80
A VO system has 0.5% per-step position error. Over a 400-meter path, what is the approximate drift at the end?

Chapter 3: Scale Ambiguity — the Monocular Blind Spot

The 5-point algorithm recovers R and t from pixel correspondences. R is determined fully. But t is determined only up to a positive scale factor: if (R, λt) is consistent with the observed correspondences for any λ > 0, so is (R, t). You can see this directly from the essential matrix: E = [t]×R = [λt]×R for any λ — the homogeneous constraint never pins the scale.

E = [t]× R   →   also consistent with [λt]× R for any λ > 0

This is not a numerical issue — it's a fundamental geometric one. A single camera is a projective device: it collapses 3D onto 2D and throws away all absolute depth information. Without knowing how big the scene is, you can't know how far you moved.

The scale factor propagates

In sequential VO, you can chain the up-to-scale relative translations if you assume a fixed scale: set |t0,1| = 1 (the baseline unit) and express all subsequent translations in the same unit. This gives a metrically consistent but incorrectly scaled trajectory. If you flew 10 m total but your unit baseline was actually 0.1 m, your trajectory reports 10 "units" but doesn't know those units are 0.1 m each. The shape is correct; the size is unknown.

Monocular VO cannot recover metric scale. The trajectory is only determined up to a global scale factor. If someone tells you your trajectory is "100 units long," you still don't know if you flew 1 m, 10 m, or 100 m. Everything in the map — feature depths, trajectory length — is off by the same unknown constant. Adding an IMU, a stereo baseline, or any known-size object in the scene immediately fixes this, because those provide absolute metric measurements.

How scale is recovered in practice

MethodHow it fixes scaleNotes
Stereo cameraKnown baseline b between cameras; triangulation gives metric depthMost reliable; b must be calibrated
IMU (VIO)Accelerometer integrates to metric displacement between framesChapter 6–7; preferred for drones
Known objectArUco marker, road lane width, etc.Fragile if object absent
GPS fusedGPS gives absolute position; scale follows from two GPS fixesFails indoors
Wheel odometryEncoder counts × wheel circumference = metric translationGround robots only; slippage error
Scale Ambiguity — Up-to-Scale vs Metric Trajectory

Drag the scale slider. The teal path is ground truth (metric). The warm path is monocular VO — same shape, unknown scale. The green path (IMU-fixed) snaps to metric. Watch how the scale factor λ stretches or compresses the VO path.

VO scale factor λ 0.6
A monocular VO system estimates a trajectory that is 47 "units" long. The true path was 9.4 m long. What is the unknown scale factor λ (units per meter)?

Chapter 4: The IMU — Fast, Metric, and Drifting

An Inertial Measurement Unit (IMU) is a tiny chip containing two sensors: an accelerometer and a gyroscope. Together they measure the complete 6-DoF motion of the body they're attached to, at very high rate (100–400 Hz typically). On a phone, a quadrotor, a car — every device with autonomous motion has at least one.

What the accelerometer measures

The accelerometer measures specific force: the sum of all forces except gravity, divided by mass. In other words, it measures acceleration relative to free-fall. If the IMU is at rest on a table, the accelerometer reads g = [0, 0, 9.81] m/s2 upward (the table pushes back against gravity). If it's in free fall, it reads zero. To get true linear acceleration you subtract gravity:

atrue = ameasured − RT gworld

where R is the current orientation (to rotate g from world to body frame) and gworld = [0, 0, 9.81]T m/s2.

What the gyroscope measures

The gyroscope measures angular velocity ω in the body frame — the instantaneous rotation rate around each body axis. Integrate ω over time to get orientation change. Compose with the current orientation to update it:

Rk+1 = Rk · Exp(ωk Δt)

This is the on-manifold SO(3) update from L10: take a step ωΔt in the tangent space, exponentiate it onto the manifold, compose with the current rotation.

Bias and noise — the killers

Every IMU reading is corrupted by two things:

CorruptionSymbolWhat it doesEffect on integration
Additive noisena, ngZero-mean white noise at each sampleRandom walk — grows as √T
Biasba, bgSlowly drifting DC offsetConstant drift — grows linearly with T

The bias is insidious. If the accelerometer has a bias of ba = 0.1 m/s2, then integrating once to get velocity gives a velocity error of baΔt per step. Integrating again to get position gives a position error of ½baΔt2. Over 10 seconds at 0.1 m/s2 bias, the position error is 5 m. This is why IMU-only dead reckoning fails within seconds.

Double integration amplifies bias quadratically. Accelerometer bias ba causes velocity error that grows as ba·T and position error that grows as ½ba·T2. With a typical MEMS bias of 0.05 m/s2, in 60 seconds the position error is ½×0.05×3600 = 90 m. This isn't a corner case — it's physics. The IMU cannot be used alone for positioning beyond a few seconds.
An IMU accelerometer has a bias ba = 0.05 m/s2. You double-integrate the IMU readings to estimate position for 10 seconds with no correction. What is the approximate position error due to this bias alone?

Chapter 5: IMU Dead-Reckoning — Fast but Blind

Dead reckoning means estimating your position by integrating known velocities or accelerations forward from a known starting point — no landmarks, no GPS, just pure integration. With an IMU, you integrate twice: acceleration to velocity, velocity to position. It's conceptually simple and blazing fast, but the errors are catastrophic on their own.

The three integration equations

At time step k, with Δt between samples, the IMU propagates the state (R, v, p) — orientation, velocity, position — as:

Rk+1 = Rk · Exp((ωk − bg) Δt)
vk+1 = vk + (Rk(ak − ba) + g) Δt
pk+1 = pk + vk Δt + ½ (Rk(ak − ba) + g) Δt2

where g = [0, 0, −9.81]T m/s2 in world frame. The key insight: every line depends on Rk, which accumulates its own errors. Orientation error pollutes velocity, which pollutes position.

Worked drift example: 1D car

Car drives straight for 10 s at 5 m/s (should end at 50 m). Accelerometer reads 0 (constant speed). But bias ba = 0.1 m/s2:

The car ends up at an estimated 55 m instead of the true 50 m — 10% error after just 10 seconds, with only a 0.1 m/s2 bias. And real IMU bias is often larger.

Bias is the enemy, not noise. Random noise averages out (sort of) — after N steps the noise contribution to position grows as √N. But bias never averages out. It's a systematic offset that compounds every integration step. VIO estimates bias jointly with the trajectory — and that's the key innovation.

Orientation-error coupling

Now add a gyroscope bias bg = 0.01 rad/s. After 10 s, the orientation error is 0.1 rad ≈ 5.7°. In the velocity integration, Rk rotates the accelerometer reading into world frame. A 5.7° orientation error means gravity (9.81 m/s2) leaks into the horizontal acceleration at sin(5.7°) × 9.81 ≈ 0.97 m/s2. This is 10× larger than the original accelerometer bias! Orientation error catastrophically amplifies position drift.

IMU Dead-Reckoning — Watch Position Diverge

Press Go to integrate a noisy IMU. Grey = true trajectory (straight line). Red = IMU-only estimate. Toggle bias and noise levels to see how quickly position drifts.

Accel bias (m/s²) 0.05
Accel noise σ 0.02
Why does gyroscope bias cause position error that grows faster than the bias × time would suggest?

Chapter 6: Why Fuse Vision + IMU — Complementary Failure Modes

The camera and the IMU fail in completely different ways. This is the key observation that makes VIO so powerful: their weaknesses are each other's strengths.

PropertyCamera alone (VO)IMU aloneVIO (fused)
Update rate~30 Hz (slow)100–400 Hz (fast)IMU rate, corrected by camera
Metric scaleUnknown (up-to-scale)Metric (m/s² is metric)Metric ✓
Gravity directionUnknownMeasures g directlyKnown ✓
Short-term accuracyGood (feature-rich scenes)Very good (sub-ms)Excellent ✓
Long-term drift~0.5–2% of pathDiverges in seconds~0.1–0.5% of path
Motion blurFails (features lost)UnaffectedIMU bridges dropout ✓
Low textureFails (no features)UnaffectedIMU bridges ✓
InitializationInstant (any scene)Needs initial alignmentRequires motion excitation
The IMU provides three critical "free gifts" to the camera:
1. Metric scale — the IMU measures in real SI units, pinning the scale of the camera's up-to-scale trajectory.
2. Gravity direction — knowing g immediately constrains 2 of the 3 orientation degrees of freedom (roll and pitch are fully observable from the accelerometer at rest).
3. Motion bridging — when the camera fails (blur, darkness, textureless walls), the IMU keeps propagating the state at high rate, providing a good initial guess for when the camera recovers.

The camera provides three critical gifts to the IMU

1. Bias correction: the optimizer estimates IMU bias ba, bg jointly with the trajectory. When the camera pins the trajectory accurately, the optimizer can infer what bias would explain the discrepancy between visual and inertial motion — and correct it. 2. Long-term stability: the camera's feature matches provide absolute constraints that prevent unbounded growth of IMU error. 3. Orientation observability: yaw is not observable from the IMU alone (spinning in place looks the same as being still to an accelerometer). Camera features give yaw information via the homography/essential matrix.

Loosely coupled vs. tightly coupled

Loosely coupled VIO: run VO separately on the camera (output: relative pose Tk,k+1 with covariance). Run IMU integration separately (output: predicted Tk,k+1IMU). Fuse the two estimates in an EKF or factor graph. Simple but suboptimal — errors in VO are opaque to the IMU subsystem.

Tightly coupled VIO: the optimizer jointly minimizes reprojection errors (from camera feature tracks) and IMU preintegration residuals in a single factor graph over poses, velocities, biases, and 3D landmark positions. This is harder to implement but produces the best accuracy. MSCKF and OKVIS, ORB-VIO, VINS-Mono are all tightly coupled.

Why can't a monocular camera alone (without IMU or stereo) determine whether a drone moved 1 m forward or 2 m forward, even with perfect feature tracking?

Chapter 7: IMU Preintegration — One Constraint from Many Readings

Between two camera keyframes there are typically many IMU readings — if the camera runs at 30 Hz and the IMU at 200 Hz, that's about 6–7 IMU samples per camera interval. The back-end optimizer works in terms of keyframe poses. How do you incorporate all those IMU readings efficiently?

Naive approach: re-integrate all IMU readings from the first keyframe whenever you want to compute the constraint between keyframes i and j. Problem: if the optimizer changes the estimate at keyframe i (as it will, repeatedly), you'd have to redo the entire integration chain — an O(N) cost per optimization step over N IMU samples.

Preintegration: recognize that the relative motion between two keyframes can be summarized as a single measurement — the preintegrated IMU measurement — that depends only on the IMU readings between those keyframes, not on the absolute pose at the start. When the optimizer changes the keyframe poses, the preintegrated constraint stays fixed; only the residual changes. This is O(1) per optimization step.

The key insight of preintegration: Factor out the global rotation. The IMU integration from keyframe i to j can be written as:
ΔRij, Δvij, Δpij — the relative rotation, velocity change, and position change in the local body frame at keyframe i. These three quantities depend only on the raw IMU readings between i and j (and the bias estimates), not on the absolute world-frame poses. They can be computed once and stored.

What the preintegrated measurement looks like

For IMU readings ωk (gyro) and ak (accel) at times tk between keyframes i and j:

ΔRij = ∏k=ij-1 Exp((ωk − bg) Δt)
Δvij = ∑k=ij-1 ΔRik (ak − ba) Δt
Δpij = ∑k=ij-1 [ Δvik Δt + ½ ΔRik (ak − ba) Δt2 ]

These are computed once in preprocessing. The residuals in the factor graph are then:

rR = Log(ΔRijT · RiT Rj)
rv = RiT(vj − vi − g Δtij) − Δvij
rp = RiT(pj − pi − vi Δtij − ½g Δtij2) − Δpij

Bias updates without re-integration

When the optimizer updates the bias estimate (ba, bg), you'd normally need to redo the preintegration. But using the first-order Jacobians ∂ΔR⁄∂bg, ∂Δv⁄∂ba, ∂Δp⁄∂ba (computed during preintegration), you can correct the preintegrated quantities cheaply with a linear approximation — no full re-integration needed unless the bias changes dramatically.

Preintegration is the backbone of modern VIO. Systems like VINS-Mono (2018), OKVIS (2015), and ORB-VIO all use it. It converts a sequence of IMU readings into a single 9-DoF relative motion constraint (R, v, p) with an associated covariance matrix, which plugs directly into the factor graph optimizer from L10.
Why does IMU preintegration need to be redone (or corrected with first-order Jacobians) when the bias estimate bg changes?

Chapter 8: Showcase — VIO Simulator

Watch ground truth, VO-only, and VIO trajectories side by side. Toggle vision dropouts and IMU noise to see them cover each other's weaknesses.

VIO Simulator — Ground Truth vs VO-only vs VIO

Teal = ground truth. Warm orange = VO-only (drifts). Green = VIO (stays close). Enable dropouts to see the camera fail; the IMU bridges the gap. Enable IMU noise to see the IMU fail; the camera corrects it.

VO error ε 0.015

Chapter 9: Connections & Cheat Sheet

You've now built the complete story from raw pixels and IMU readings to a fused VIO system. Here's how everything fits together.

Core concepts — quick reference

ConceptOne-linerKey equation / number
VO pipelineTrack features → estimate relative pose → chain SE(3) → trajectoryTw,k+1 = Tw,k · Tk,k+1
DriftRelative errors compound — no absolute reference in pure VO≈ εrel × total path length
Monocular scaleE = [λt]×R for any λ — the camera is projective|t| fixed to 1 by convention; scale unknown
IMU: accelMeasures specific force = true accel − gravity (in body frame)atrue = ameas − RTg
IMU: gyroMeasures angular velocity ω in body frameRk+1 = Rk·Exp(ωΔt)
BiasSystematic DC offset; position error ≈ ½b·T²10 s, 0.05 m/s² → 2.5 m error
VIO complementarityCamera: scale-free, slow; IMU: metric, fast, driftsTogether: metric, fast, low drift
IMU preintegrationSummarize N IMU readings → one (R,v,p) constraintΔRij, Δvij, Δpij in body frame of keyframe i
Tight vs looseTight: joint optimize reprojection + IMU residuals; loose: fuse VO output + IMU separatelyTight is more accurate
KeyframesSelected frames maximizing info gain; back-end only optimizes over theseReduce compute; marginalize old ones

Where VO/VIO fits in the VNAV curriculum

L6: Features (L6)
Harris, SIFT, KLT — the front-end input
L7: Two-View Geometry (L7)
Essential matrix, scale ambiguity — relative pose estimation
L8: RANSAC (L8)
Robust outlier rejection — VO front-end robustness
L9–L10: NLLS & Manifold (L9–L10)
Gauss-Newton, on-manifold — the back-end optimizer
L11: VO / VIO (this lesson)
Full pipeline: feature→pose→chain; drift; scale; IMU; preintegration; fusion
L12: Place Recognition
Bag-of-words, DBoW2 — detect loop closures
L13: SLAM (coming)
Loop closure → pose graph → global correction — the cure for drift
The big picture: VO gives you a local, drifting trajectory. VIO gives you a metric, low-drift trajectory. SLAM (L13) adds loop closure to give you a globally consistent, drift-corrected map. Each layer builds on the previous — and VIO's IMU preintegration factors plug directly into the SLAM factor graph.

Open-source VIO systems to study

SystemBack-endCouplingNotable for
MSCKFExtended Kalman FilterTightly coupledEKF-based; fast; no map points in state
OKVISNonlinear optimizationTightly coupledFirst sliding-window VIO optimizer
VINS-MonoFactor graph (Ceres)Tightly coupledLoop closure built in; popular for drones
ORB-SLAM3Factor graphTightly coupledFull SLAM with IMU; monocular/stereo/fisheye
OpenVINSMSCKF variantTightly coupledOpen-source, modular, research-friendly
"Every measurement tells you something about the world. Every integration multiplies your ignorance. The art of state estimation is to measure as often as possible and integrate as rarely as possible." — paraphrasing Stergios Roumeliotis