Visual Navigation for Autonomous Vehicles · MIT 16.485 · Lecture 20

Visual & Visual-Inertial Odometry

Your drone has no GPS and must navigate a 200-meter loop entirely from its own sensors. A camera gives you rich scene detail but accumulates small pose errors that compound into a trajectory that misses the start by meters. Strap on an IMU, fuse the two, and suddenly drift shrinks by an order of magnitude and you recover metric scale. This lesson builds the full VO pipeline — feature tracking, relative-pose chaining, drift math — then adds the IMU: what it measures, why it drifts catastrophically alone, and how preintegration fuses hundreds of IMU readings between keyframes into a single constraint. MIT 16.485 by Luca Carlone, Lecture 20.

Prerequisites: VNAV L6 (feature detection & tracking) · VNAV L7 (two-view geometry, essential matrix) · VNAV L8 (RANSAC) · VNAV L10 (on-manifold optimization).

Chapters

Live Canvases

Hand

Derived Numbers

Chapter 0: No GPS — Navigate from Your Own Eyes

It's 2019. A DJI drone is sent to inspect a collapsed building. GPS is blocked by steel and concrete. The drone has one camera pointing forward and one small chip — an IMU — that measures acceleration and rotation at 200 Hz. That's it. How does it know where it is?

The same problem appears everywhere: a self-driving car in an underground garage, a robot in a warehouse without WiFi, a Mars rover 20 light-minutes from the nearest uplink. The answer in all of these cases is odometry — building a trajectory estimate step-by-step from your own sensor readings, without any external reference.

Visual odometry (VO) does this from a camera: track features across frames, recover the relative pose between each pair of frames via the epipolar constraint (L7), and chain those relative poses into a growing trajectory. Visual-inertial odometry (VIO) fuses the camera with an IMU for robustness and metric scale.

The fundamental tension: VO gives you a trajectory, but errors in each relative pose accumulate. After 100 frames a 1% per-frame error has compounded into a trajectory that misses the start position by multiple meters — even though every single relative-pose estimate was "pretty good." This is drift, and it's the central challenge of odometry.

Why not just use GPS?

GPS needs line-of-sight to 4+ satellites and updates at only 1–10 Hz with 2–5 m accuracy. Indoors it fails completely. A camera at 30 Hz with sub-pixel feature tracking gives you millimeter-scale relative pose changes between frames. An IMU at 200 Hz gives you continuous high-rate motion updates. Both are self-contained. Together they outperform GPS in short-range accuracy — if you can stop the drift.

The pipeline at a glance

Image stream

Camera at 30 Hz, typically grayscale 640×480

↓

Feature track (L6)

KLT/Harris: same 3D points seen across frames

↓

Relative pose (L7+L8)

5-pt + RANSAC: R_k, t_k from frame k to k+1 (up to scale)

↓ chain SE(3)

Trajectory

T₀, T₁, ..., T_N — the drone's path through the world

↻ drift accumulates

You run visual odometry on a 300-frame flight. At each frame the estimated position has a 0.5% relative error. After 300 frames, roughly how far off is the estimated final position if the total path length is 100 m?

About 0.5 m — errors average out over many steps About 50 cm to 1.5 m — errors compound and do not average to zero Exactly 0 m — if every relative estimate is unbiased, the trajectory is correct Less than 1 cm — 0.5% per frame is negligible

Chapter 1: The VO Pipeline — Front-End and Back-End

A complete VO system has two conceptually separate parts. The front-end touches raw pixels: it detects features, tracks them across frames, and computes noisy relative-pose measurements. The back-end takes those measurements and finds the globally consistent trajectory via optimization (or filtering). The split matters because the front-end is fast but noisy; the back-end is slow but accurate.

Front-end: feature tracking and relative pose

At time k the camera captures frame I_k. The front-end detects corners with Harris or Shi-Tomasi (L6), then tracks them into frame I_k+1 with KLT optical flow. You now have a set of point correspondences {(p_i, p'_i)} — the same 3D point seen from two camera positions. Plug those into the 5-point algorithm (or 8-point with normalization) to estimate the essential matrix E = [t]_× R (L7). RANSAC (L8) removes outliers. Decompose E into (R, t) — four hypotheses, cheirality picks the right one.

The relative pose is only determined up to scale. From a single camera with no prior knowledge of scene depth, you cannot tell if the camera moved 1 cm past a small cube or 1 m past a large room — both produce identical pixel correspondences. This is the monocular scale ambiguity, and it means the translation t from the 5-point algorithm has unit length by convention. Chapter 3 will show the exact math.

Back-end: trajectory building

The back-end chains relative poses. Given relative transform T_k,k+1 (pose of frame k+1 expressed in frame k), the absolute pose of frame k+1 in the world frame is:

T_w,k+1 = T_w,k · T_k,k+1

where T_w,k is the world-to-camera transform at frame k. This is SE(3) matrix multiplication — the on-manifold composition from L10.

Worked example: chaining two relative poses

Suppose the drone starts at the world origin T₀ = I₄. Between frames 0 and 1 it moves forward 1 m and turns 5° right. Between frames 1 and 2 it moves forward 1 m with no turn. In the (R | t) notation:

python
import numpy as np
from scipy.spatial.transform import Rotation

# Relative pose: frame 0 → 1 (5° yaw right, 1 m forward)
R01 = Rotation.from_euler('z', -5, degrees=True).as_matrix()
t01 = np.array([1.0, 0.0, 0.0])  # 1 m forward in frame-0 coords

# Relative pose: frame 1 → 2 (no turn, 1 m forward)
R12 = np.eye(3)
t12 = np.array([1.0, 0.0, 0.0])

# Chain: world pose of frame 2
# T_world_frame2 = T_world_frame0 @ T_01 @ T_12
# Starting at identity (T_world_frame0 = I)

# Position of frame 1 in world
p1 = t01  # = [1, 0, 0]

# Position of frame 2 in world = p1 + R01 @ t12
p2 = p1 + R01 @ t12
print("Frame 1 position:", np.round(p1, 4))   # [1.0, 0.0, 0.0]
print("Frame 2 position:", np.round(p2, 4))
# [1 + cos(-5°), 0 + sin(-5°), 0]
# = [1 + 0.9962, 0 + (-0.0872), 0]
# = [1.9962, -0.0872, 0.0]
# The drone veered slightly right on the second step — the 5° turn propagated!

This is the essence of the back-end: each step applies the new relative rotation R_k before adding the translation, so previous rotations shape all future translations. Errors in early rotations deviate all later positions.

VO Pipeline Visualized — Frames → Features → Poses → Trajectory

Animated illustration of the VO front-end and back-end. Watch features appear in each frame, get matched, yield a relative pose arrow, and chain into a growing trajectory.

In the VO back-end, you have the world pose T_w,k and the new relative measurement T_k,k+1. What is the correct formula for T_w,k+1?

T_w,k + T_k,k+1 (matrix addition) T_w,k · T_k,k+1 (SE(3) composition) T_k,k+1 · T_w,k (wrong order) The inverse of T_k,k+1 composed with T_w,k

Chapter 2: Drift — Why Small Errors Become Big Problems

Every relative-pose estimate carries a small error: the features matched had sub-pixel noise, RANSAC didn't find every inlier, the essential matrix decomposition had numerical error. In isolation, each error is tiny. But the back-end multiplies SE(3) transforms — and multiplication compounds errors multiplicatively, not additively.

The drift math

Suppose each step introduces a position error of ε meters (absolute) along the direction of motion. After N steps of length d, the position error grows. In the best case (errors partially cancel): O(ε√N). In the worst case (all errors in the same direction, e.g., systematic heading bias): O(εN). In practice VO drift is typically 0.1–2% of total path length — the error grows roughly linearly with distance traveled.

drift ≈ ε_rel × total path length

where ε_rel is the fractional per-step error (typically 0.5–1% for a good front-end).

Worked numbers

Path = 200 m loop. Per-step error ε_rel = 1%. Expected drift ≈ 2 m. The drone started at a door, flew 200 m around a room, and estimates it's 2 m from where it started. The loop doesn't close. That's drift.

Drift has no absolute reference to correct against. VO only knows relative measurements — "I moved this much from my last position." There's nothing to anchor the trajectory globally. The only fix is a loop closure (recognizing a previously visited place and correcting the trajectory to make the loop consistent) or an external reference like GPS. This is exactly why SLAM (L13) exists: it detects loop closures and eliminates accumulated drift.

Keyframes reduce drift (a little)

Running the optimization back-end over a sliding window of recent keyframes (selected frames that maximize information gain) helps. Bundle adjustment over N keyframes re-estimates all poses and map points jointly, which spreads and reduces error compared to purely sequential pose chaining. But it doesn't eliminate drift — it only slows its growth. The trajectory still drifts because the window doesn't extend back to the start.

Drift Accumulation — Adjust Error Per Step and Steps

Grey = ground truth circular path. Warm = VO estimated path. Watch the estimated path peel away and fail to close the loop as error or step count increases.

Per-step error (ε) 0.02

Steps (N) 80

A VO system has 0.5% per-step position error. Over a 400-meter path, what is the approximate drift at the end?

0.02 m (errors cancel out over many steps) 0.5 m (error grows as sqrt of steps) 2 m (error grows proportionally to distance traveled) 20 m (errors compound exponentially)

Chapter 3: Scale Ambiguity — the Monocular Blind Spot

The 5-point algorithm recovers R and t from pixel correspondences. R is determined fully. But t is determined only up to a positive scale factor: if (R, λt) is consistent with the observed correspondences for any λ > 0, so is (R, t). You can see this directly from the essential matrix: E = [t]_×R = [λt]_×R for any λ — the homogeneous constraint never pins the scale.

E = [t]_× R → also consistent with [λt]_× R for any λ > 0

This is not a numerical issue — it's a fundamental geometric one. A single camera is a projective device: it collapses 3D onto 2D and throws away all absolute depth information. Without knowing how big the scene is, you can't know how far you moved.

The scale factor propagates

In sequential VO, you can chain the up-to-scale relative translations if you assume a fixed scale: set |t_0,1| = 1 (the baseline unit) and express all subsequent translations in the same unit. This gives a metrically consistent but incorrectly scaled trajectory. If you flew 10 m total but your unit baseline was actually 0.1 m, your trajectory reports 10 "units" but doesn't know those units are 0.1 m each. The shape is correct; the size is unknown.

Monocular VO cannot recover metric scale. The trajectory is only determined up to a global scale factor. If someone tells you your trajectory is "100 units long," you still don't know if you flew 1 m, 10 m, or 100 m. Everything in the map — feature depths, trajectory length — is off by the same unknown constant. Adding an IMU, a stereo baseline, or any known-size object in the scene immediately fixes this, because those provide absolute metric measurements.

How scale is recovered in practice

Method	How it fixes scale	Notes
Stereo camera	Known baseline b between cameras; triangulation gives metric depth	Most reliable; b must be calibrated
IMU (VIO)	Accelerometer integrates to metric displacement between frames	Chapter 6–7; preferred for drones
Known object	ArUco marker, road lane width, etc.	Fragile if object absent
GPS fused	GPS gives absolute position; scale follows from two GPS fixes	Fails indoors
Wheel odometry	Encoder counts × wheel circumference = metric translation	Ground robots only; slippage error

Scale Ambiguity — Up-to-Scale vs Metric Trajectory

Drag the scale slider. The teal path is ground truth (metric). The warm path is monocular VO — same shape, unknown scale. The green path (IMU-fixed) snaps to metric. Watch how the scale factor λ stretches or compresses the VO path.

VO scale factor λ 0.6

A monocular VO system estimates a trajectory that is 47 "units" long. The true path was 9.4 m long. What is the unknown scale factor λ (units per meter)?

λ = 0.2 (the trajectory is underscaled) λ = 9.4 (units equal centimeters) λ = 5 (the VO unit is 20 cm) The question has no answer — scale is truly unknowable

Chapter 4: The IMU — Fast, Metric, and Drifting

An Inertial Measurement Unit (IMU) is a tiny chip containing two sensors: an accelerometer and a gyroscope. Together they measure the complete 6-DoF motion of the body they're attached to, at very high rate (100–400 Hz typically). On a phone, a quadrotor, a car — every device with autonomous motion has at least one.

What the accelerometer measures

The accelerometer measures specific force: the sum of all forces except gravity, divided by mass. In other words, it measures acceleration relative to free-fall. If the IMU is at rest on a table, the accelerometer reads g = [0, 0, 9.81] m/s² upward (the table pushes back against gravity). If it's in free fall, it reads zero. To get true linear acceleration you subtract gravity:

a_true = a_measured − R^T g_world

where R is the current orientation (to rotate g from world to body frame) and g_world = [0, 0, 9.81]^T m/s².

What the gyroscope measures

The gyroscope measures angular velocity ω in the body frame — the instantaneous rotation rate around each body axis. Integrate ω over time to get orientation change. Compose with the current orientation to update it:

R_k+1 = R_k · Exp(ω_k Δt)

This is the on-manifold SO(3) update from L10: take a step ωΔt in the tangent space, exponentiate it onto the manifold, compose with the current rotation.

Bias and noise — the killers

Every IMU reading is corrupted by two things:

Corruption	Symbol	What it does	Effect on integration
Additive noise	n_a, n_g	Zero-mean white noise at each sample	Random walk — grows as √T
Bias	b_a, b_g	Slowly drifting DC offset	Constant drift — grows linearly with T

The bias is insidious. If the accelerometer has a bias of b_a = 0.1 m/s², then integrating once to get velocity gives a velocity error of b_aΔt per step. Integrating again to get position gives a position error of ½b_aΔt². Over 10 seconds at 0.1 m/s² bias, the position error is 5 m. This is why IMU-only dead reckoning fails within seconds.

Double integration amplifies bias quadratically. Accelerometer bias b_a causes velocity error that grows as b_a·T and position error that grows as ½b_a·T². With a typical MEMS bias of 0.05 m/s², in 60 seconds the position error is ½×0.05×3600 = 90 m. This isn't a corner case — it's physics. The IMU cannot be used alone for positioning beyond a few seconds.

An IMU accelerometer has a bias b_a = 0.05 m/s². You double-integrate the IMU readings to estimate position for 10 seconds with no correction. What is the approximate position error due to this bias alone?

0.5 m (½ × 0.05 × 10² = 2.5 m... oh wait) 2.5 m (½ × 0.05 × 100 = 2.5 m) 0.05 m (bias is directly the position error) 0.005 m (bias is very small, negligible)

Chapter 5: IMU Dead-Reckoning — Fast but Blind

Dead reckoning means estimating your position by integrating known velocities or accelerations forward from a known starting point — no landmarks, no GPS, just pure integration. With an IMU, you integrate twice: acceleration to velocity, velocity to position. It's conceptually simple and blazing fast, but the errors are catastrophic on their own.

The three integration equations

At time step k, with Δt between samples, the IMU propagates the state (R, v, p) — orientation, velocity, position — as:

R_k+1 = R_k · Exp((ω_k − b_g) Δt)

v_k+1 = v_k + (R_k(a_k − b_a) + g) Δt

p_k+1 = p_k + v_k Δt + ½ (R_k(a_k − b_a) + g) Δt²

where g = [0, 0, −9.81]^T m/s² in world frame. The key insight: every line depends on R_k, which accumulates its own errors. Orientation error pollutes velocity, which pollutes position.

Worked drift example: 1D car

Car drives straight for 10 s at 5 m/s (should end at 50 m). Accelerometer reads 0 (constant speed). But bias b_a = 0.1 m/s²:

After 1 s: velocity estimate = 0.1 m/s (wrong by 0.1); position error = ½×0.1×1 = 0.05 m
After 5 s: velocity error = 0.5 m/s; position error = ½×0.1×25 = 1.25 m
After 10 s: velocity error = 1.0 m/s; position error = ½×0.1×100 = 5 m

The car ends up at an estimated 55 m instead of the true 50 m — 10% error after just 10 seconds, with only a 0.1 m/s² bias. And real IMU bias is often larger.

Bias is the enemy, not noise. Random noise averages out (sort of) — after N steps the noise contribution to position grows as √N. But bias never averages out. It's a systematic offset that compounds every integration step. VIO estimates bias jointly with the trajectory — and that's the key innovation.

Orientation-error coupling

Now add a gyroscope bias b_g = 0.01 rad/s. After 10 s, the orientation error is 0.1 rad ≈ 5.7°. In the velocity integration, R_k rotates the accelerometer reading into world frame. A 5.7° orientation error means gravity (9.81 m/s²) leaks into the horizontal acceleration at sin(5.7°) × 9.81 ≈ 0.97 m/s². This is 10× larger than the original accelerometer bias! Orientation error catastrophically amplifies position drift.

IMU Dead-Reckoning — Watch Position Diverge

Press Go to integrate a noisy IMU. Grey = true trajectory (straight line). Red = IMU-only estimate. Toggle bias and noise levels to see how quickly position drifts.

Accel bias (m/s²) 0.05

Accel noise σ 0.02

Why does gyroscope bias cause position error that grows faster than the bias × time would suggest?

It doesn't — gyroscope bias only affects orientation and has no impact on position Because the gyroscope reads faster than the accelerometer, so errors add up more quickly Orientation error causes gravity to leak into the acceleration estimate, amplifying position drift by up to g × sin(θ_err) Because integration is a nonlinear operation that squares the error

Chapter 6: Why Fuse Vision + IMU — Complementary Failure Modes

The camera and the IMU fail in completely different ways. This is the key observation that makes VIO so powerful: their weaknesses are each other's strengths.

Property	Camera alone (VO)	IMU alone	VIO (fused)
Update rate	~30 Hz (slow)	100–400 Hz (fast)	IMU rate, corrected by camera
Metric scale	Unknown (up-to-scale)	Metric (m/s² is metric)	Metric ✓
Gravity direction	Unknown	Measures g directly	Known ✓
Short-term accuracy	Good (feature-rich scenes)	Very good (sub-ms)	Excellent ✓
Long-term drift	~0.5–2% of path	Diverges in seconds	~0.1–0.5% of path
Motion blur	Fails (features lost)	Unaffected	IMU bridges dropout ✓
Low texture	Fails (no features)	Unaffected	IMU bridges ✓
Initialization	Instant (any scene)	Needs initial alignment	Requires motion excitation

The IMU provides three critical "free gifts" to the camera:
1. Metric scale — the IMU measures in real SI units, pinning the scale of the camera's up-to-scale trajectory.
2. Gravity direction — knowing g immediately constrains 2 of the 3 orientation degrees of freedom (roll and pitch are fully observable from the accelerometer at rest).
3. Motion bridging — when the camera fails (blur, darkness, textureless walls), the IMU keeps propagating the state at high rate, providing a good initial guess for when the camera recovers.

The camera provides three critical gifts to the IMU

1. Bias correction: the optimizer estimates IMU bias b_a, b_g jointly with the trajectory. When the camera pins the trajectory accurately, the optimizer can infer what bias would explain the discrepancy between visual and inertial motion — and correct it. 2. Long-term stability: the camera's feature matches provide absolute constraints that prevent unbounded growth of IMU error. 3. Orientation observability: yaw is not observable from the IMU alone (spinning in place looks the same as being still to an accelerometer). Camera features give yaw information via the homography/essential matrix.

Loosely coupled vs. tightly coupled

Loosely coupled VIO: run VO separately on the camera (output: relative pose T_k,k+1 with covariance). Run IMU integration separately (output: predicted T_k,k+1^IMU). Fuse the two estimates in an EKF or factor graph. Simple but suboptimal — errors in VO are opaque to the IMU subsystem.

Tightly coupled VIO: the optimizer jointly minimizes reprojection errors (from camera feature tracks) and IMU preintegration residuals in a single factor graph over poses, velocities, biases, and 3D landmark positions. This is harder to implement but produces the best accuracy. MSCKF and OKVIS, ORB-VIO, VINS-Mono are all tightly coupled.

Why can't a monocular camera alone (without IMU or stereo) determine whether a drone moved 1 m forward or 2 m forward, even with perfect feature tracking?

Because the essential matrix requires at least 8 points to be solved Because the essential matrix E=[t]×R is invariant to scaling t — multiplying t by any constant λ gives the same E Because the camera has limited pixel resolution and can't see sub-pixel motion Because the IMU hasn't been initialized yet

Chapter 7: IMU Preintegration — One Constraint from Many Readings

Between two camera keyframes there are typically many IMU readings — if the camera runs at 30 Hz and the IMU at 200 Hz, that's about 6–7 IMU samples per camera interval. The back-end optimizer works in terms of keyframe poses. How do you incorporate all those IMU readings efficiently?

Naive approach: re-integrate all IMU readings from the first keyframe whenever you want to compute the constraint between keyframes i and j. Problem: if the optimizer changes the estimate at keyframe i (as it will, repeatedly), you'd have to redo the entire integration chain — an O(N) cost per optimization step over N IMU samples.

Preintegration: recognize that the relative motion between two keyframes can be summarized as a single measurement — the preintegrated IMU measurement — that depends only on the IMU readings between those keyframes, not on the absolute pose at the start. When the optimizer changes the keyframe poses, the preintegrated constraint stays fixed; only the residual changes. This is O(1) per optimization step.

The key insight of preintegration: Factor out the global rotation. The IMU integration from keyframe i to j can be written as:
ΔR_ij, Δv_ij, Δp_ij — the relative rotation, velocity change, and position change in the local body frame at keyframe i. These three quantities depend only on the raw IMU readings between i and j (and the bias estimates), not on the absolute world-frame poses. They can be computed once and stored.

What the preintegrated measurement looks like

For IMU readings ω_k (gyro) and a_k (accel) at times t_k between keyframes i and j:

ΔR_ij = ∏_k=i^j-1 Exp((ω_k − b_g) Δt)

Δv_ij = ∑_k=i^j-1 ΔR_ik (a_k − b_a) Δt

Δp_ij = ∑_k=i^j-1 [ Δv_ik Δt + ½ ΔR_ik (a_k − b_a) Δt² ]

These are computed once in preprocessing. The residuals in the factor graph are then:

r_R = Log(ΔR_ij^T · R_i^T R_j)

r_v = R_i^T(v_j − v_i − g Δt_ij) − Δv_ij

r_p = R_i^T(p_j − p_i − v_i Δt_ij − ½g Δt_ij²) − Δp_ij

Bias updates without re-integration

When the optimizer updates the bias estimate (b_a, b_g), you'd normally need to redo the preintegration. But using the first-order Jacobians ∂ΔR⁄∂b_g, ∂Δv⁄∂b_a, ∂Δp⁄∂b_a (computed during preintegration), you can correct the preintegrated quantities cheaply with a linear approximation — no full re-integration needed unless the bias changes dramatically.

Preintegration is the backbone of modern VIO. Systems like VINS-Mono (2018), OKVIS (2015), and ORB-VIO all use it. It converts a sequence of IMU readings into a single 9-DoF relative motion constraint (R, v, p) with an associated covariance matrix, which plugs directly into the factor graph optimizer from L10.

Why does IMU preintegration need to be redone (or corrected with first-order Jacobians) when the bias estimate b_g changes?

Because the preintegrated ΔR, Δv, Δp depend on b_g through the subtraction (ω_k−b_g) in each integration step It doesn't — preintegration is fully independent of bias by definition Because ΔR_ij = ∏ Exp((ω_k−b_g)Δt) is a product that multiplies through all b_g corrections Only the noise term changes — bias has no effect on the mean estimate

Chapter 8: Showcase — VIO Simulator

Watch ground truth, VO-only, and VIO trajectories side by side. Toggle vision dropouts and IMU noise to see them cover each other's weaknesses.

VIO Simulator — Ground Truth vs VO-only vs VIO

Teal = ground truth. Warm orange = VO-only (drifts). Green = VIO (stays close). Enable dropouts to see the camera fail; the IMU bridges the gap. Enable IMU noise to see the IMU fail; the camera corrects it.

Vision dropouts High IMU noise

VO error ε 0.015

Chapter 9: Connections & Cheat Sheet

You've now built the complete story from raw pixels and IMU readings to a fused VIO system. Here's how everything fits together.

Core concepts — quick reference

Concept	One-liner	Key equation / number
VO pipeline	Track features → estimate relative pose → chain SE(3) → trajectory	T_w,k+1 = T_w,k · T_k,k+1
Drift	Relative errors compound — no absolute reference in pure VO	≈ ε_rel × total path length
Monocular scale	E = [λt]×R for any λ — the camera is projective	\|t\| fixed to 1 by convention; scale unknown
IMU: accel	Measures specific force = true accel − gravity (in body frame)	a_true = a_meas − R^Tg
IMU: gyro	Measures angular velocity ω in body frame	R_k+1 = R_k·Exp(ωΔt)
Bias	Systematic DC offset; position error ≈ ½b·T²	10 s, 0.05 m/s² → 2.5 m error
VIO complementarity	Camera: scale-free, slow; IMU: metric, fast, drifts	Together: metric, fast, low drift
IMU preintegration	Summarize N IMU readings → one (R,v,p) constraint	ΔR_ij, Δv_ij, Δp_ij in body frame of keyframe i
Tight vs loose	Tight: joint optimize reprojection + IMU residuals; loose: fuse VO output + IMU separately	Tight is more accurate
Keyframes	Selected frames maximizing info gain; back-end only optimizes over these	Reduce compute; marginalize old ones

Where VO/VIO fits in the VNAV curriculum

L6: Features (L6)

Harris, SIFT, KLT — the front-end input

↓

L7: Two-View Geometry (L7)

Essential matrix, scale ambiguity — relative pose estimation

↓

L8: RANSAC (L8)

Robust outlier rejection — VO front-end robustness

↓

L9–L10: NLLS & Manifold (L9–L10)

Gauss-Newton, on-manifold — the back-end optimizer

↓

L11: VO / VIO (this lesson)

Full pipeline: feature→pose→chain; drift; scale; IMU; preintegration; fusion

↓

L12: Place Recognition

Bag-of-words, DBoW2 — detect loop closures

↓

L13: SLAM (coming)

Loop closure → pose graph → global correction — the cure for drift

The big picture: VO gives you a local, drifting trajectory. VIO gives you a metric, low-drift trajectory. SLAM (L13) adds loop closure to give you a globally consistent, drift-corrected map. Each layer builds on the previous — and VIO's IMU preintegration factors plug directly into the SLAM factor graph.

Open-source VIO systems to study

System	Back-end	Coupling	Notable for
MSCKF	Extended Kalman Filter	Tightly coupled	EKF-based; fast; no map points in state
OKVIS	Nonlinear optimization	Tightly coupled	First sliding-window VIO optimizer
VINS-Mono	Factor graph (Ceres)	Tightly coupled	Loop closure built in; popular for drones
ORB-SLAM3	Factor graph	Tightly coupled	Full SLAM with IMU; monocular/stereo/fisheye
OpenVINS	MSCKF variant	Tightly coupled	Open-source, modular, research-friendly

"Every measurement tells you something about the world. Every integration multiplies your ignorance. The art of state estimation is to measure as often as possible and integrate as rarely as possible." — paraphrasing Stergios Roumeliotis