Your drone has no GPS and must navigate a 200-meter loop entirely from its own sensors. A camera gives you rich scene detail but accumulates small pose errors that compound into a trajectory that misses the start by meters. Strap on an IMU, fuse the two, and suddenly drift shrinks by an order of magnitude and you recover metric scale. This lesson builds the full VO pipeline — feature tracking, relative-pose chaining, drift math — then adds the IMU: what it measures, why it drifts catastrophically alone, and how preintegration fuses hundreds of IMU readings between keyframes into a single constraint. MIT 16.485 by Luca Carlone, Lecture 20.
It's 2019. A DJI drone is sent to inspect a collapsed building. GPS is blocked by steel and concrete. The drone has one camera pointing forward and one small chip — an IMU — that measures acceleration and rotation at 200 Hz. That's it. How does it know where it is?
The same problem appears everywhere: a self-driving car in an underground garage, a robot in a warehouse without WiFi, a Mars rover 20 light-minutes from the nearest uplink. The answer in all of these cases is odometry — building a trajectory estimate step-by-step from your own sensor readings, without any external reference.
Visual odometry (VO) does this from a camera: track features across frames, recover the relative pose between each pair of frames via the epipolar constraint (L7), and chain those relative poses into a growing trajectory. Visual-inertial odometry (VIO) fuses the camera with an IMU for robustness and metric scale.
GPS needs line-of-sight to 4+ satellites and updates at only 1–10 Hz with 2–5 m accuracy. Indoors it fails completely. A camera at 30 Hz with sub-pixel feature tracking gives you millimeter-scale relative pose changes between frames. An IMU at 200 Hz gives you continuous high-rate motion updates. Both are self-contained. Together they outperform GPS in short-range accuracy — if you can stop the drift.
A complete VO system has two conceptually separate parts. The front-end touches raw pixels: it detects features, tracks them across frames, and computes noisy relative-pose measurements. The back-end takes those measurements and finds the globally consistent trajectory via optimization (or filtering). The split matters because the front-end is fast but noisy; the back-end is slow but accurate.
At time k the camera captures frame Ik. The front-end detects corners with Harris or Shi-Tomasi (L6), then tracks them into frame Ik+1 with KLT optical flow. You now have a set of point correspondences {(pi, p'i)} — the same 3D point seen from two camera positions. Plug those into the 5-point algorithm (or 8-point with normalization) to estimate the essential matrix E = [t]× R (L7). RANSAC (L8) removes outliers. Decompose E into (R, t) — four hypotheses, cheirality picks the right one.
The back-end chains relative poses. Given relative transform Tk,k+1 (pose of frame k+1 expressed in frame k), the absolute pose of frame k+1 in the world frame is:
where Tw,k is the world-to-camera transform at frame k. This is SE(3) matrix multiplication — the on-manifold composition from L10.
Suppose the drone starts at the world origin T0 = I4. Between frames 0 and 1 it moves forward 1 m and turns 5° right. Between frames 1 and 2 it moves forward 1 m with no turn. In the (R | t) notation:
python import numpy as np from scipy.spatial.transform import Rotation # Relative pose: frame 0 → 1 (5° yaw right, 1 m forward) R01 = Rotation.from_euler('z', -5, degrees=True).as_matrix() t01 = np.array([1.0, 0.0, 0.0]) # 1 m forward in frame-0 coords # Relative pose: frame 1 → 2 (no turn, 1 m forward) R12 = np.eye(3) t12 = np.array([1.0, 0.0, 0.0]) # Chain: world pose of frame 2 # T_world_frame2 = T_world_frame0 @ T_01 @ T_12 # Starting at identity (T_world_frame0 = I) # Position of frame 1 in world p1 = t01 # = [1, 0, 0] # Position of frame 2 in world = p1 + R01 @ t12 p2 = p1 + R01 @ t12 print("Frame 1 position:", np.round(p1, 4)) # [1.0, 0.0, 0.0] print("Frame 2 position:", np.round(p2, 4)) # [1 + cos(-5°), 0 + sin(-5°), 0] # = [1 + 0.9962, 0 + (-0.0872), 0] # = [1.9962, -0.0872, 0.0] # The drone veered slightly right on the second step — the 5° turn propagated!
This is the essence of the back-end: each step applies the new relative rotation Rk before adding the translation, so previous rotations shape all future translations. Errors in early rotations deviate all later positions.
Animated illustration of the VO front-end and back-end. Watch features appear in each frame, get matched, yield a relative pose arrow, and chain into a growing trajectory.
Every relative-pose estimate carries a small error: the features matched had sub-pixel noise, RANSAC didn't find every inlier, the essential matrix decomposition had numerical error. In isolation, each error is tiny. But the back-end multiplies SE(3) transforms — and multiplication compounds errors multiplicatively, not additively.
Suppose each step introduces a position error of ε meters (absolute) along the direction of motion. After N steps of length d, the position error grows. In the best case (errors partially cancel): O(ε√N). In the worst case (all errors in the same direction, e.g., systematic heading bias): O(εN). In practice VO drift is typically 0.1–2% of total path length — the error grows roughly linearly with distance traveled.
where εrel is the fractional per-step error (typically 0.5–1% for a good front-end).
Path = 200 m loop. Per-step error εrel = 1%. Expected drift ≈ 2 m. The drone started at a door, flew 200 m around a room, and estimates it's 2 m from where it started. The loop doesn't close. That's drift.
Running the optimization back-end over a sliding window of recent keyframes (selected frames that maximize information gain) helps. Bundle adjustment over N keyframes re-estimates all poses and map points jointly, which spreads and reduces error compared to purely sequential pose chaining. But it doesn't eliminate drift — it only slows its growth. The trajectory still drifts because the window doesn't extend back to the start.
Grey = ground truth circular path. Warm = VO estimated path. Watch the estimated path peel away and fail to close the loop as error or step count increases.
The 5-point algorithm recovers R and t from pixel correspondences. R is determined fully. But t is determined only up to a positive scale factor: if (R, λt) is consistent with the observed correspondences for any λ > 0, so is (R, t). You can see this directly from the essential matrix: E = [t]×R = [λt]×R for any λ — the homogeneous constraint never pins the scale.
This is not a numerical issue — it's a fundamental geometric one. A single camera is a projective device: it collapses 3D onto 2D and throws away all absolute depth information. Without knowing how big the scene is, you can't know how far you moved.
In sequential VO, you can chain the up-to-scale relative translations if you assume a fixed scale: set |t0,1| = 1 (the baseline unit) and express all subsequent translations in the same unit. This gives a metrically consistent but incorrectly scaled trajectory. If you flew 10 m total but your unit baseline was actually 0.1 m, your trajectory reports 10 "units" but doesn't know those units are 0.1 m each. The shape is correct; the size is unknown.
| Method | How it fixes scale | Notes |
|---|---|---|
| Stereo camera | Known baseline b between cameras; triangulation gives metric depth | Most reliable; b must be calibrated |
| IMU (VIO) | Accelerometer integrates to metric displacement between frames | Chapter 6–7; preferred for drones |
| Known object | ArUco marker, road lane width, etc. | Fragile if object absent |
| GPS fused | GPS gives absolute position; scale follows from two GPS fixes | Fails indoors |
| Wheel odometry | Encoder counts × wheel circumference = metric translation | Ground robots only; slippage error |
Drag the scale slider. The teal path is ground truth (metric). The warm path is monocular VO — same shape, unknown scale. The green path (IMU-fixed) snaps to metric. Watch how the scale factor λ stretches or compresses the VO path.
An Inertial Measurement Unit (IMU) is a tiny chip containing two sensors: an accelerometer and a gyroscope. Together they measure the complete 6-DoF motion of the body they're attached to, at very high rate (100–400 Hz typically). On a phone, a quadrotor, a car — every device with autonomous motion has at least one.
The accelerometer measures specific force: the sum of all forces except gravity, divided by mass. In other words, it measures acceleration relative to free-fall. If the IMU is at rest on a table, the accelerometer reads g = [0, 0, 9.81] m/s2 upward (the table pushes back against gravity). If it's in free fall, it reads zero. To get true linear acceleration you subtract gravity:
where R is the current orientation (to rotate g from world to body frame) and gworld = [0, 0, 9.81]T m/s2.
The gyroscope measures angular velocity ω in the body frame — the instantaneous rotation rate around each body axis. Integrate ω over time to get orientation change. Compose with the current orientation to update it:
This is the on-manifold SO(3) update from L10: take a step ωΔt in the tangent space, exponentiate it onto the manifold, compose with the current rotation.
Every IMU reading is corrupted by two things:
| Corruption | Symbol | What it does | Effect on integration |
|---|---|---|---|
| Additive noise | na, ng | Zero-mean white noise at each sample | Random walk — grows as √T |
| Bias | ba, bg | Slowly drifting DC offset | Constant drift — grows linearly with T |
The bias is insidious. If the accelerometer has a bias of ba = 0.1 m/s2, then integrating once to get velocity gives a velocity error of baΔt per step. Integrating again to get position gives a position error of ½baΔt2. Over 10 seconds at 0.1 m/s2 bias, the position error is 5 m. This is why IMU-only dead reckoning fails within seconds.
Dead reckoning means estimating your position by integrating known velocities or accelerations forward from a known starting point — no landmarks, no GPS, just pure integration. With an IMU, you integrate twice: acceleration to velocity, velocity to position. It's conceptually simple and blazing fast, but the errors are catastrophic on their own.
At time step k, with Δt between samples, the IMU propagates the state (R, v, p) — orientation, velocity, position — as:
where g = [0, 0, −9.81]T m/s2 in world frame. The key insight: every line depends on Rk, which accumulates its own errors. Orientation error pollutes velocity, which pollutes position.
Car drives straight for 10 s at 5 m/s (should end at 50 m). Accelerometer reads 0 (constant speed). But bias ba = 0.1 m/s2:
The car ends up at an estimated 55 m instead of the true 50 m — 10% error after just 10 seconds, with only a 0.1 m/s2 bias. And real IMU bias is often larger.
Now add a gyroscope bias bg = 0.01 rad/s. After 10 s, the orientation error is 0.1 rad ≈ 5.7°. In the velocity integration, Rk rotates the accelerometer reading into world frame. A 5.7° orientation error means gravity (9.81 m/s2) leaks into the horizontal acceleration at sin(5.7°) × 9.81 ≈ 0.97 m/s2. This is 10× larger than the original accelerometer bias! Orientation error catastrophically amplifies position drift.
Press Go to integrate a noisy IMU. Grey = true trajectory (straight line). Red = IMU-only estimate. Toggle bias and noise levels to see how quickly position drifts.
The camera and the IMU fail in completely different ways. This is the key observation that makes VIO so powerful: their weaknesses are each other's strengths.
| Property | Camera alone (VO) | IMU alone | VIO (fused) |
|---|---|---|---|
| Update rate | ~30 Hz (slow) | 100–400 Hz (fast) | IMU rate, corrected by camera |
| Metric scale | Unknown (up-to-scale) | Metric (m/s² is metric) | Metric ✓ |
| Gravity direction | Unknown | Measures g directly | Known ✓ |
| Short-term accuracy | Good (feature-rich scenes) | Very good (sub-ms) | Excellent ✓ |
| Long-term drift | ~0.5–2% of path | Diverges in seconds | ~0.1–0.5% of path |
| Motion blur | Fails (features lost) | Unaffected | IMU bridges dropout ✓ |
| Low texture | Fails (no features) | Unaffected | IMU bridges ✓ |
| Initialization | Instant (any scene) | Needs initial alignment | Requires motion excitation |
1. Bias correction: the optimizer estimates IMU bias ba, bg jointly with the trajectory. When the camera pins the trajectory accurately, the optimizer can infer what bias would explain the discrepancy between visual and inertial motion — and correct it. 2. Long-term stability: the camera's feature matches provide absolute constraints that prevent unbounded growth of IMU error. 3. Orientation observability: yaw is not observable from the IMU alone (spinning in place looks the same as being still to an accelerometer). Camera features give yaw information via the homography/essential matrix.
Loosely coupled VIO: run VO separately on the camera (output: relative pose Tk,k+1 with covariance). Run IMU integration separately (output: predicted Tk,k+1IMU). Fuse the two estimates in an EKF or factor graph. Simple but suboptimal — errors in VO are opaque to the IMU subsystem.
Tightly coupled VIO: the optimizer jointly minimizes reprojection errors (from camera feature tracks) and IMU preintegration residuals in a single factor graph over poses, velocities, biases, and 3D landmark positions. This is harder to implement but produces the best accuracy. MSCKF and OKVIS, ORB-VIO, VINS-Mono are all tightly coupled.
Between two camera keyframes there are typically many IMU readings — if the camera runs at 30 Hz and the IMU at 200 Hz, that's about 6–7 IMU samples per camera interval. The back-end optimizer works in terms of keyframe poses. How do you incorporate all those IMU readings efficiently?
Naive approach: re-integrate all IMU readings from the first keyframe whenever you want to compute the constraint between keyframes i and j. Problem: if the optimizer changes the estimate at keyframe i (as it will, repeatedly), you'd have to redo the entire integration chain — an O(N) cost per optimization step over N IMU samples.
Preintegration: recognize that the relative motion between two keyframes can be summarized as a single measurement — the preintegrated IMU measurement — that depends only on the IMU readings between those keyframes, not on the absolute pose at the start. When the optimizer changes the keyframe poses, the preintegrated constraint stays fixed; only the residual changes. This is O(1) per optimization step.
For IMU readings ωk (gyro) and ak (accel) at times tk between keyframes i and j:
These are computed once in preprocessing. The residuals in the factor graph are then:
When the optimizer updates the bias estimate (ba, bg), you'd normally need to redo the preintegration. But using the first-order Jacobians ∂ΔR⁄∂bg, ∂Δv⁄∂ba, ∂Δp⁄∂ba (computed during preintegration), you can correct the preintegrated quantities cheaply with a linear approximation — no full re-integration needed unless the bias changes dramatically.
Watch ground truth, VO-only, and VIO trajectories side by side. Toggle vision dropouts and IMU noise to see them cover each other's weaknesses.
Teal = ground truth. Warm orange = VO-only (drifts). Green = VIO (stays close). Enable dropouts to see the camera fail; the IMU bridges the gap. Enable IMU noise to see the IMU fail; the camera corrects it.
You've now built the complete story from raw pixels and IMU readings to a fused VIO system. Here's how everything fits together.
| Concept | One-liner | Key equation / number |
|---|---|---|
| VO pipeline | Track features → estimate relative pose → chain SE(3) → trajectory | Tw,k+1 = Tw,k · Tk,k+1 |
| Drift | Relative errors compound — no absolute reference in pure VO | ≈ εrel × total path length |
| Monocular scale | E = [λt]×R for any λ — the camera is projective | |t| fixed to 1 by convention; scale unknown |
| IMU: accel | Measures specific force = true accel − gravity (in body frame) | atrue = ameas − RTg |
| IMU: gyro | Measures angular velocity ω in body frame | Rk+1 = Rk·Exp(ωΔt) |
| Bias | Systematic DC offset; position error ≈ ½b·T² | 10 s, 0.05 m/s² → 2.5 m error |
| VIO complementarity | Camera: scale-free, slow; IMU: metric, fast, drifts | Together: metric, fast, low drift |
| IMU preintegration | Summarize N IMU readings → one (R,v,p) constraint | ΔRij, Δvij, Δpij in body frame of keyframe i |
| Tight vs loose | Tight: joint optimize reprojection + IMU residuals; loose: fuse VO output + IMU separately | Tight is more accurate |
| Keyframes | Selected frames maximizing info gain; back-end only optimizes over these | Reduce compute; marginalize old ones |
| System | Back-end | Coupling | Notable for |
|---|---|---|---|
| MSCKF | Extended Kalman Filter | Tightly coupled | EKF-based; fast; no map points in state |
| OKVIS | Nonlinear optimization | Tightly coupled | First sliding-window VIO optimizer |
| VINS-Mono | Factor graph (Ceres) | Tightly coupled | Loop closure built in; popular for drones |
| ORB-SLAM3 | Factor graph | Tightly coupled | Full SLAM with IMU; monocular/stereo/fisheye |
| OpenVINS | MSCKF variant | Tightly coupled | Open-source, modular, research-friendly |
"Every measurement tells you something about the world. Every integration multiplies your ignorance. The art of state estimation is to measure as often as possible and integrate as rarely as possible." — paraphrasing Stergios Roumeliotis