Workbook — Visual Navigation · MIT 16.485

Visual Navigation Workbook

Every number a VNAV engineer must derive cold: SE(3) composition, rotation angles from trace, camera projection, focal length in pixels, epipolar constraint, RANSAC iteration count, Gauss-Newton steps, VO drift budgets, BoW similarity, and SLAM loop-closure arithmetic. Concrete numeric questions, fully worked derivations, verifiable in-browser.

Prerequisites: Basic linear algebra (matrix multiply, determinant). Trig & calculus basics. Familiarity with the pinhole camera model helps but is not required.
10
Chapters
50
Exercises
5
Exercise Types
Mastery
0 / 50 exercises (0%)
0
Day Streak
Best: 0

Chapter 0: SE(3) & Lie Groups

SE(3) is the group of rigid-body transforms in 3D space. Every robot pose — every camera frame — lives here. Composing transforms is matrix multiplication; updating them optimally requires the Lie algebra so(3), where small corrections live as skew-symmetric matrices before being mapped back to SO(3) via the matrix exponential.

SE(3) element:
T = [[R, t], [0, 1]] ∈ SE(3),   R ∈ SO(3), t ∈ ℝ3

Composition: TAC = TAB · TBC

Angle from trace: θ = arccos((tr(R) − 1) ⁄ 2)

Rodrigues exp map: exp(φ̂) = I + sin(θ)φ̂⁄θ + (1 − cos(θ))(φ̂⁄θ)2
Key fact: Rotation matrices cannot be added — R1 + R2 is not a rotation. But they compose: R1R2 ∈ SO(3). Updates live in the tangent space so(3) and are applied via R ← R · exp(δ̂), which keeps R on the manifold.
Exercise 0.1: Compose Two SE(3) Transforms Derive

Frame A is the world origin. TAB translates by tAB = [3, 0, 0]T with R = I. TBC translates by tBC = [0, 2, 0]T with R = I. What is the y-component of tAC in the composed transform TAC = TAB · TBC?

meters
Show derivation
TAC = TAB · TBC
tAC = RAB · tBC + tAB
= I · [0, 2, 0]T + [3, 0, 0]T = [3, 2, 0]T

The SE(3) composition rule is tAC = RABtBC + tAB. Since RAB = I here, the translation from B to C is expressed unchanged in A's frame. The y-component is 2.

Exercise 0.2: Rotation Angle from Trace Derive

A rotation matrix R has trace tr(R) = 1.0. What is the rotation angle θ in degrees? (Use the formula θ = arccos((tr(R) − 1) ⁄ 2) and round to the nearest degree.)

degrees
Show derivation
θ = arccos((1.0 − 1) ⁄ 2) = arccos(0) = 90°

The trace of a 3D rotation by θ about any axis is 1 + 2cos(θ). Solving: cos(θ) = (tr(R) − 1)/2 = 0/2 = 0, so θ = 90°. Sanity check: tr(I) = 3 → θ = arccos(1) = 0°. tr(R180) = −1 → θ = arccos(−1) = 180°.

Exercise 0.3: Rotate a Point with SO(3) Derive

Rz(90°) rotates 90° around the z-axis: it maps x→y and y→−x. Apply Rz(90°) to the point p = [1, 0, 0]T. What is the y-component of Rz(90°)p?

(y-component)
Show derivation
Rz(90°) = [[0, −1, 0], [1, 0, 0], [0, 0, 1]]
Rz(90°) · [1, 0, 0]T = [0·1 + (−1)·0, 1·1 + 0·0, 0] = [0, 1, 0]T

Rz(θ) = [[cosθ, −sinθ, 0],[sinθ, cosθ, 0],[0,0,1]]. At 90°: cos=0, sin=1. Column 1 of Rz is [0, 1, 0]T, so p=[1,0,0] maps directly to it. The y-component is 1.

Exercise 0.4: exp Map Angle Recovery Derive

A Lie algebra element φ = [0, 0, 0.5236]T (axis-angle vector; the z-component is π⁄6 ≈ 0.5236 rad). The rotation angle is ||φ||. What is the rotation angle in degrees? (Round to the nearest integer.)

degrees
Show derivation
||φ|| = √(02 + 02 + 0.52362) = 0.5236 rad
0.5236 × (180⁄π) ≈ 0.5236 × 57.296 ≈ 30°

The Lie algebra element φ ∈ ℝ3 encodes the rotation axis as its direction and the rotation angle as its magnitude. Here φ = [0,0,π⁄6]T, so the rotation is 30° around the z-axis. exp(φ̂) = Rz(30°).

Exercise 0.5: SE(3) Inverse Translation Derive

T = [[R, t],[0,1]] with R = I and t = [5, 3, −2]T. The inverse of an SE(3) element is T−1 = [[RT, −RTt],[0,1]]. What is the x-component of the translation in T−1?

(x-component)
Show derivation
T−1 translation = −RTt = −I · [5, 3, −2]T = [−5, −3, 2]T

With R = I, RT = I, so −RTt = −t. The inverse translation is simply [−5,−3,2]T and x = −5. For non-identity R, the inverse rotation RT also rotates the translation vector, so order matters.

Chapter 1: Camera Projection

A pinhole camera collapses 3D space to 2D pixels. The forward model is p = K[R|t]P: a 3D world point P (homogeneous 4-vector) multiplied by the full projection matrix P = K[R|t] gives a 2D pixel (homogeneous 3-vector). Dividing by w recovers (u, v). Every calibration pipeline boils down to estimating K.

Intrinsic matrix K:
K = [[fx, s, cx], [0, fy, cy], [0, 0, 1]]

Focal length in pixels: fx = sx · f, where sx is pixels/meter and f is the physical focal length.

Radial distortion: r2 = ud2 + vd2; uu = ud(1 + k1r2 + k2r4)
Depth is lost. A pixel (u,v) corresponds to a full ray, not a point. All 3D points on the ray d·K−1[u,v,1]T project to the same pixel. Recovering depth requires stereo, SfM, or additional sensors.
Exercise 1.1: Project a 3D Point Derive

Camera with K = diag(800, 800, 1) and principal point (320, 240). A 3D point in camera coordinates is Pc = (0.3, 0.1, 4.0) m. What is the pixel u-coordinate?
Formula: u = fx · X⁄Z + cx

pixels
Show derivation
u = fx · X⁄Z + cx = 800 · 0.3⁄4.0 + 320
= 800 · 0.075 + 320 = 60 + 320 = 380 px

The projection u = fx(X/Z) + cx maps a camera-frame X coordinate to a pixel column. X/Z = 0.075 is the tangent angle; multiplying by fx=800 converts it from radians to pixels; adding cx=320 shifts from image center to image corner. The v-coordinate would be 800×(0.1/4.0)+240 = 260 px.

Exercise 1.2: Focal Length in Pixels Derive

A camera sensor has 4000 pixels across a 6 mm wide sensor (sx = 4000 ⁄ 0.006 pixels/meter). The physical focal length is f = 8 mm = 0.008 m. What is fx in pixels?

pixels
Show derivation
sx = 4000 ⁄ 0.006 ≈ 666,667 px⁄m
fx = sx · f = 666,667 × 0.008 = 5,333 px

Physical focal length × sensor density = focal length in pixels. A 50mm lens on a 35mm full-frame sensor (36mm wide, 8000px) would give fx = (8000/0.036)×0.05 ≈ 11,111 px. Focal length in pixels captures the combined effect of lens choice and sensor resolution.

Exercise 1.3: Back-Project a Pixel to a Ray Derive

Camera K = diag(600, 600, 1) with cx=320, cy=240. Pixel (u, v) = (440, 240). Back-project to normalized camera coordinates: xn = (u − cx) ⁄ fx. What is xn?

(unitless)
Show derivation
xn = (u − cx) ⁄ fx = (440 − 320) ⁄ 600 = 120 ⁄ 600 = 0.2

The back-projected ray direction in camera coordinates is [xn, yn, 1]T = [(u−cx)/fx, (v−cy)/fy, 1]T = [0.2, 0, 1]T. At depth Z=5m, the 3D point is [1.0, 0, 5]T. The ray direction is the unit vector K−1[u,v,1]T normalized.

Exercise 1.4: Radial Distortion Shift Derive

A distorted point in normalized coords: (ud, vd) = (0.3, 0.4). The radial distance r2 = ud2 + vd2. With k1 = 0.1 and k2 = 0, the undistorted uu = ud(1 + k1r2). What is uu? (Round to 3 decimal places.)

(unitless)
Show derivation
r2 = 0.32 + 0.42 = 0.09 + 0.16 = 0.25
uu = 0.3 × (1 + 0.1 × 0.25) = 0.3 × 1.025 = 0.3075

Radial distortion moves points outward (k1>0, barrel distortion) or inward (k1<0, pincushion). Here k1=0.1 gives a 2.5% outward push at r=0.5. Real wide-angle cameras can have k1 ≈ −0.3 (strong pincushion). Calibration tools (OpenCV, Kalibr) estimate k1, k2, k3 simultaneously with fx, fy, cx, cy.

Exercise 1.5: Depth Ambiguity — Scale Factor Derive

Point A at depth Z=2m projects to pixel u=380 with fx=800, cx=320 (X/Z = (380−320)/800 = 0.075, so X=0.15m). Point B is on the same ray at depth Z=6m. What is the X-coordinate of B (in meters)?

meters
Show derivation
Ray direction: xn = 0.075 (fixed by pixel)
X = xn · Z = 0.075 × 6 = 0.45 m

Both points project to the same pixel because X/Z = 0.075 is the same for both. The ray passes through A=(0.15, ?, 2) and B=(0.45, ?, 6) — three times farther gives three times larger X. This is the fundamental depth ambiguity: you know the ratio X/Z but not X or Z individually from a single frame.

Chapter 2: Features & Matching

Feature detection finds repeatable, distinctive keypoints. Harris corners use the structure tensor M — the sum of outer products of image gradients — to classify flat regions, edges, and corners by its eigenvalues. The ratio test rejects ambiguous matches, and descriptors (SIFT, ORB) encode local appearance for cross-frame comparison.

Structure tensor: M = ∑window[[Ix2, IxIy], [IxIy, Iy2]]

Harris response: R = det(M) − k · tr(M)2,   k ≈ 0.04–0.06

Lowe ratio test: accept if d1⁄d2 < 0.8 (d1 = best-match distance, d2 = 2nd-best)
Eigenvalue classification: λ1≈λ2≈0 → flat (no gradient). λ1≫λ2≈0 → edge. λ1≈λ2≫0 → corner. Harris encodes this without computing eigenvalues: det(M)−k·tr(M)2 is large only for corners.
Exercise 2.1: Structure Tensor Eigenvalue Classification Trace
A 3×3 window has Ix=5 at every pixel, Iy=0 at every pixel (strong horizontal gradient, no vertical gradient). What does the structure tensor M look like, and what is this region?
Show explanation

M = ∑[[Ix2, IxIy],[IxIy, Iy2]]. With Ix=5 and Iy=0 at every pixel in a 3×3 window (9 pixels): M = 9×[[25,0],[0,0]] = [[225,0],[0,0]]. The eigenvalues are 225 and 0: large in one direction (strong gradient perpendicular to the edge), zero in the other (no variation along the edge). This is a vertical edge (λ1≫λ2=0). Option C is qualitatively correct: one large eigenvalue, one zero.

Exercise 2.2: Harris Response Sign Derive

A window has structure tensor M with det(M) = 1200 and tr(M) = 50. Using k = 0.04, compute R = det(M) − k · tr(M)2. What is R?

(Harris R)
Show derivation
R = 1200 − 0.04 × 502
= 1200 − 0.04 × 2500
= 1200 − 100 = 100

R > 0 indicates a corner. R < 0 indicates an edge (det<k·tr2). R ≈ 0 indicates a flat region. For the threshold, practitioners keep keypoints with R > α · max(R) across the image, where α ≈ 0.01.

Exercise 2.3: Lowe Ratio Test Decision Derive

A SIFT descriptor from image A is compared to all descriptors in image B. The nearest-neighbor distance is d1 = 0.62 and the second-nearest distance is d2 = 0.90. Lowe's ratio threshold is 0.8. Is this match accepted or rejected?

d1 / d2 = 0.62 / 0.90 ≈ 0.689. With threshold 0.8:
Show explanation

Lowe's test: accept if d1/d2 < 0.8. Here 0.689 < 0.8, so the match is accepted. The intuition: if the best match is much closer than the second-best, it's probably correct. If the ratio is close to 1 (two equally plausible matches), the correspondence is ambiguous (e.g., repeated textures) and should be rejected. Lowe showed that 0.8 keeps 95% of correct matches and rejects 90% of incorrect ones.

Exercise 2.4: Corner vs. Flat — Determinant Sign Derive

Structure tensor M = [[4, 2], [2, 9]]. det(M) = 4×9 − 2×2 = ?

(det)
Show derivation
det(M) = 4 × 9 − 2 × 2 = 36 − 4 = 32

det(M) = λ1λ2 and tr(M) = λ12. With det=32>0 and tr=13, both eigenvalues are positive — this is a corner region. The Harris response is R = 32 − 0.04×132 = 32 − 6.76 = 25.24 > 0, confirming a corner.

Exercise 2.5: FAST Corners — Minimum Arc Trace
FAST corner detection checks a circle of 16 pixels around a candidate point. It requires a contiguous arc of N pixels all brighter (or all darker) than center + threshold. What is N in the original FAST-9 variant?
Show explanation

FAST-9 requires 9 consecutive pixels in the 16-pixel Bresenham circle all brighter or all darker than the center by a threshold. A 9-pixel arc is a little over half the circle, making it geometrically impossible for an edge (which would have brighter pixels on one side but at most 8 consecutive bright pixels). FAST-12 (12 pixels) is more conservative; ORB uses FAST-9 as its detector.

Chapter 3: Two-View Geometry

Two calibrated views of the same scene are linked by the essential matrix E = [t]×R. Every correct correspondence (y1, y2) satisfies the epipolar constraint y2TEy1 = 0. Triangulation recovers 3D depth, but depth uncertainty grows as Z2⁄baseline.

Epipolar constraint:2TEỹ1 = 0

Essential matrix: E = [t]×R, rank 2, scale-ambiguous

Triangulation depth error: σZ ≈ (Z2 / (f · b)) · σpx
Scale ambiguity: monocular SfM recovers R and the unit direction of t, but not ||t||. Multiplying all 3D points by λ and t by λ produces the same image observations. Absolute scale requires IMU, GPS, or a known-size object.
Exercise 3.1: Epipolar Constraint Check Derive

t = [1, 0, 0]T, R = I. Then E = [t]×R = [t]× = [[0,0,0],[0,0,−1],[0,1,0]]. A candidate match: ỹ1 = [0, 0, 1]T, ỹ2 = [−0.2, 0, 1]T. Compute ỹ2TEỹ1. What is the result?

(should be 0)
Show derivation
Eỹ1 = [[0,0,0],[0,0,−1],[0,1,0]] · [0,0,1]T
= [0, −1, 0]T
2T(Eỹ1) = [−0.2, 0, 1] · [0, −1, 0]T
= (−0.2)×0 + 0×(−1) + 1×0 = 0 ✓

The epipolar constraint is satisfied: this match is geometrically consistent with the given camera motion. A horizontal baseline (t along x) produces horizontal epipolar lines — y2 must have the same v as y1 (both have normalized v=0 here).

Exercise 3.2: Triangulate Depth from Two Views Derive

Two cameras: camera 1 at origin, camera 2 at t=[b, 0, 0]T with b=0.12m (12cm baseline), R=I. A 3D point is at X=0 (centred), Y=0, Z. In camera 1 normalized coords: y1 = [0, 0, 1]T. In camera 2: y2 = [−0.04, 0, 1]T. The horizontal disparity is −b/Z → Z = b/|disparity| = 0.12/0.04. What is Z?

meters
Show derivation
Z = b / |xn2 − xn1| = 0.12 / |−0.04 − 0| = 0.12 / 0.04 = 3.0 m

With a pure horizontal baseline and no rotation, the depth Z = baseline / disparity (in normalized coords). In pixel coords: Z = f · b / (u1 − u2). This is the stereo depth formula. Doubling the baseline halves depth error; doubling the depth quadruples it.

Exercise 3.3: Depth Error Scales as Z² Derive

Stereo depth error: σZ = (Z2 / (f · b)) · σpx. Given f=600px, b=0.1m, σpx=1 pixel. At Z=3m: σZ = (32/(600×0.1))×1. What is σZ in meters?

meters
Show derivation
σZ = (32 / (600 × 0.1)) × 1
= (9 / 60) × 1 = 0.15 m

At Z=6m (double): σZ = 36/60 = 0.6m — quadrupled, as expected from the Z2 scaling. At Z=10m with the same setup: σZ = 100/60 ≈ 1.67m, which is why stereo SLAM loses depth accuracy fast at range. LiDAR has roughly linear depth error, giving it a huge advantage beyond ~10m.

Exercise 3.4: Essential Matrix DOF Trace
The Essential matrix E encodes the relative pose between two calibrated cameras. How many degrees of freedom does E have (after removing the overall scale), and why does the 5-point algorithm use exactly 5 correspondences?
Show explanation

Relative pose has 6 DOF (3 rotation + 3 translation). But only the direction of translation is observable from image correspondences (multiply all 3D points by λ and t by λ → same images). So translation has 2 DOF (unit sphere). Total: 3 + 2 = 5 DOF. Each correspondence provides one scalar constraint (y2TEy1=0). Five constraints = 5 equations for 5 unknowns. The 8-point algorithm uses 8 correspondences and fits a general 3×3 rank-2 matrix (8 DOF before enforcing rank-2), which is over-parameterized but simpler to implement.

Exercise 3.5: Disparity at Double the Baseline Derive

A stereo pair with f=500px and baseline b=0.08m observes a point at depth Z=4m. Disparity d = f·b⁄Z. Now the baseline is doubled to 0.16m. By what factor does the disparity change?

× (factor)
Show derivation
d1 = 500 × 0.08 / 4 = 10 px
d2 = 500 × 0.16 / 4 = 20 px
Factor = 20 / 10 = 2×

Disparity is proportional to baseline: doubling b doubles d. Larger disparity means the depth estimate is more precise (same sub-pixel error over a larger signal). The ZED stereo camera (12cm baseline) and the Intel D435 (5cm baseline) were designed with this tradeoff in mind.

Chapter 4: RANSAC

RANSAC tolerates high outlier rates by repeatedly sampling a minimal subset, fitting a model, and counting inliers. The required iteration count N guarantees that at least one all-inlier sample is drawn with probability p, given inlier fraction w and minimal sample size s.

Iteration count: N = log(1 − p) ⁄ log(1 − ws)

Minimal sample sizes: line: s=2; homography: s=4; fundamental/essential matrix: s=8 (or 5); plane: s=3

Inlier test: point i is inlier if ||residuali|| < ε
Practical tip: the 5-point essential-matrix solver (s=5) drastically reduces N vs. the 8-point (s=8). At w=0.5: N5pt = log(0.01)/log(1−0.03125) ≈ 145; N8pt = log(0.01)/log(1−0.00391) ≈ 1177. Use the minimal solver.
Exercise 4.1: RANSAC Iteration Count — Line Fit Derive

You fit a 2D line (s=2). Inlier fraction w=0.7. Desired confidence p=0.99. Use N = log(1−p) / log(1−ws). Compute N (round up to integer). Note: log(0.01)/log(1−0.49) ≈ log(0.01)/log(0.51). Use ln(0.01)≈−4.605 and ln(0.51)≈−0.673.

iterations
Show derivation
ws = 0.72 = 0.49
1 − ws = 0.51
N = ln(1−0.99) / ln(0.51)
= ln(0.01) / ln(0.51)
= −4.605 / −0.673 ≈ 6.84 → round up to 7

With 70% inliers and s=2, you only need 7 iterations to have 99% confidence of hitting an all-inlier pair. The simplicity of the line fit (small s) compensates for uncertainty. Compare to the homography (s=4): w=0.7, N = ln(0.01)/ln(1−0.74) = −4.605/ln(0.7599) ≈ −4.605/−0.2744 ≈ 17 iterations.

Exercise 4.2: RANSAC with Low Inlier Fraction Derive

8-point essential matrix (s=8). w=0.5 (50% outliers). p=0.99. Compute N. Use w8=0.58=0.00391, so 1−ws=0.99609. ln(0.01)/ln(0.99609). Use ln(0.99609)≈−0.003918.

iterations
Show derivation
N = ln(0.01) / ln(0.99609)
= −4.605 / −0.003918 ≈ 1176 → 1177 iterations

Compare to the 5-point algorithm at w=0.5: w5=0.03125, 1−ws=0.96875, N = ln(0.01)/ln(0.96875) = −4.605/−0.03175 ≈ 145 iterations. The 5-point solver reduces iterations by 8× at 50% inliers. With 60% outliers (w=0.4): 8-point needs N≈17,000 iterations — effectively unusable. 5-point needs ≈600.

Exercise 4.3: Inlier Count Derive

A RANSAC iteration proposes a homography H. You test all N=120 correspondences: residuals below ε=2px count as inliers. You find 78 inliers. What is the estimated inlier fraction w (as a percentage, rounded to integer)?

%
Show derivation
w = 78 / 120 = 0.65 = 65%

This inlier fraction can be used to dynamically update N during the RANSAC loop. Adaptive RANSAC starts with N=∞ and, after each iteration, updates N = log(1−p)/log(1−ws) using the current best w. If w improves, N shrinks. PROSAC (Progressive Sample Consensus) additionally sorts correspondences by quality to try high-quality pairs first.

Exercise 4.4: Minimal Sample Size for Homography Trace
Why does homography estimation require exactly 4 point correspondences (s=4) as the minimal sample?
Show explanation

A 3×3 homography matrix H has 9 entries but is defined up to scale (multiply H by any λ ≠ 0 and it represents the same mapping). So it has 8 DOF. Each correspondence (x',y') ↔ (x,y) contributes 2 scalar equations (one for x', one for y' after the homogeneous divide). Four correspondences give 8 equations for 8 unknowns: exactly determined. With 3 correspondences you have 6 equations for 8 unknowns: a 2-parameter family of solutions.

Exercise 4.5: Required Iterations at p=0.95 vs. p=0.99 Derive

Homography RANSAC, s=4, w=0.6. At p=0.99: N99=ln(0.01)/ln(1−0.64). 0.64=0.1296; ln(0.8704)≈−0.1386; N99≈ln(0.01)/(−0.1386)=33.2 → 34. What is N at p=0.95? (ln(0.05)≈−2.996)

iterations
Show derivation
N95 = ln(0.05) / ln(0.8704)
= −2.996 / −0.1386 ≈ 21.6 → 22 iterations

Going from 95% to 99% confidence requires 34 vs. 22 iterations — only 55% more work for a significantly stronger guarantee. In practice, 500–2000 iterations are used for robustness; the formula gives the theoretical minimum. Note that increasing w from 0.6 to 0.8 would reduce N99 from 34 to just 7 iterations.

Chapter 5: Nonlinear Least Squares

Camera bundle adjustment, SLAM graph optimization, and IMU preintegration all reduce to NLS: minimize ∑||ri(x)||2. The Gauss-Newton step linearizes each residual with its Jacobian and solves the resulting linear system. Levenberg-Marquardt blends Gauss-Newton with gradient descent via a damping parameter λ.

Normal equations (linear LS): (ATA)x̂ = ATb

Gauss-Newton step: (JTJ)δ* = −JTr(x̄);   x̄ ← x̄ + δ*

Levenberg-Marquardt: (JTJ + λI)δ* = −JTr(x̄)
LM interpolates: when λ→0, LM = Gauss-Newton (fast near solution). When λ→∞, LM ≈ gradient descent with step 1/λ (robust far from solution). Increase λ if cost increases; decrease if cost decreases. This automatic trust-region control makes LM the workhorse of bundle adjustment.
Exercise 5.1: Normal Equations — 1D Linear Fit Derive

Two data points: (x=1, y=2) and (x=3, y=4). Fit y = ax. Model matrix A = [[1],[3]], b = [[2],[4]]. Solve (ATA)â = ATb. ATA = 12+32 = 10; ATb = 1×2+3×4 = 14. What is â = ATb / ATA?

(slope a)
Show derivation
ATA = 12 + 32 = 10
ATb = 1 × 2 + 3 × 4 = 14
â = 14 / 10 = 1.4

The residuals are r = b − Aâ = [2−1.4, 4−4.2]T = [0.6, −0.2]T. Cost = 0.36 + 0.04 = 0.40. Verify: d(cost)/da = 2AT(Aa−b) = 2(10×1.4−14) = 0 ✓. The data doesn't pass through the origin perfectly (truth would be a=1 for y=x), so the least-squares fit balances both residuals.

Exercise 5.2: One Gauss-Newton Step Derive

Scalar problem: minimize f(x) = r(x)2 where r(x) = x2 − 3. Current estimate: x̄ = 2. r(x̄) = 4−3 = 1. J = dr/dx = 2x̄ = 4. Gauss-Newton step: δ = −(JTJ)−1JTr = −r/J. What is the new estimate x̄ + δ?

(new x)
Show derivation
δ = −r(x̄) / J = −1 / 4 = −0.25
xnew = x̄ + δ = 2 + (−0.25) = 1.75

The true solution is x* = √3 ≈ 1.732. After one step we're at 1.75 — already very close. A second GN step: r(1.75) = 1.752−3 = 0.0625; J=3.5; δ=−0.0179; x=1.732 — converged. GN converges quadratically near the solution for well-conditioned problems.

Exercise 5.3: LM Damping Effect on Step Size Derive

Scalar problem: JTJ = 4, JTr = −2 (so GN step δGN = −JTr / JTJ = 2/4 = 0.5). Now apply LM with λ=4: (JTJ + λ)δ = −JTr = 2. What is δLM?

(step size)
Show derivation
(JTJ + λ)δ = 2
(4 + 4)δ = 2
δLM = 2 / 8 = 0.25

Damping λ=4 (equal to JTJ) halves the step from 0.5 to 0.25. As λ→∞, δ→0 (infinitesimal gradient-descent steps). As λ→0, δ→0.5 (pure Gauss-Newton). Typical LM implementations start with λ=10−4·max(diag(JTJ)) and multiply by 10 on failure, divide by 10 on success.

Exercise 5.4: Residual After GN Step Derive

Before GN step: cost f = ∑r2 = 9 (r1=2, r2=−2, r3=1 → cost=4+4+1=9). After one Gauss-Newton step with J = [[2,0],[−1,1],[0,1]], JTJ = [[5,−1],[−1,2]], JTr = [[2],[−1]], the step δ satisfies (JTJ)δ=−JTr = [[−2],[1]]. det(JTJ)=10−1=9; δ1 = (−2×2−1×(−1))⁄9 = (−4+1)⁄9 = −3⁄9 = −1⁄3. What is δ2? (Use Cramer's rule: δ2=(5×1−(−1)×(−2))⁄9)

2)
Show derivation
δ2 = (5 × 1 − (−1) × (−2)) / 9
= (5 − 2) / 9 = 3 / 9 = 1⁄3 ≈ 0.333

Cramer's rule: for [[a,b],[c,d]]δ=v, δ1=(dv1−bv2)/det, δ2=(av2−cv1)/det. With v=[[−2],[1]] and det=9: δ2=(5×1−(−1)×(−2))/9=3/9=1/3. In practice, Cholesky factorization solves JTJδ=−JTr in O(n3/3) for dense systems or sparse Cholesky for large SLAM graphs.

Exercise 5.5: Information Matrix Interpretation Trace
In weighted least squares, the information matrix is Ω = diag(1/σ12, 1/σ22, ...). If sensor A has σ=0.01m and sensor B has σ=1m, and both measure the same quantity, the WLS solution weights sensor A by how much more than sensor B?
Show explanation

Information = 1/σ2. Ratio = (1/0.0001) / (1/1) = 1/0.0001 = 10,000. A sensor with 100× better precision gets 10,000× more weight in the cost function. This means a single good measurement can dominate thousands of poor ones. In SLAM this matters: a well-calibrated camera (sub-pixel, σ≈1px) should heavily outweigh a poor depth prior.

Chapter 6: Manifold Optimization

Naive Gauss-Newton on SE(3) adds a δ vector to R and t, but R + δR is not a rotation. The fix: parameterize the update in the Lie algebra so(3), apply a retraction R ← R·exp(δ̂), and define boxminus on the manifold to compute differences. This keeps all iterates on SO(3) throughout optimization.

Retraction: R ← R · exp(δ̂) (stays on SO(3))

Boxminus: R1 ⊟ R2 = log(R2TR1) ∈ so(3) (relative rotation as tangent vector)

Boxplus: R ⊞ δ = R · exp(δ̂)
Why not add R + δR? R + δR violates RTR = I after even one step — you immediately leave the manifold. Projecting back (re-orthogonalizing) is expensive and introduces approximation errors. The exponential map is exact: it defines a path that stays on SO(3).
Exercise 6.1: Direct Addition Leaves SO(3) Trace
R = I (identity rotation). You add δ = [[0, −ε, 0], [ε, 0, 0], [0, 0, 0]] (a skew-symmetric perturbation, ε=0.1). What is (I + δ)T(I + δ)?
Show explanation

(I+δ)T(I+δ) = I + δ + δT + δTδ. Since δ is skew-symmetric: δ + δT = 0. So (I+δ)T(I+δ) = I + δTδ. With ε=0.1: δTδ has entries of order ε2=0.01. So the result is I + O(ε2) ≠ I. For large enough ε this is a significant violation. The exponential map avoids this: exp(δ)Texp(δ) = I exactly.

Exercise 6.2: Boxminus — Angle Between Two Rotations Derive

R1 = Rz(30°) and R2 = Rz(50°). The boxminus R1 ⊟ R2 = log(R2TR1) = log(Rz(−50°)Rz(30°)) = log(Rz(−20°)). What is ||R1 ⊟ R2|| in degrees?

degrees
Show derivation
R2TR1 = Rz(−50°)Rz(30°) = Rz(−20°)
log(Rz(−20°)) is a vector of magnitude 20° (converted to rad: −20×π⁄180 ≈ −0.349)
||R1 ⊟ R2|| = 20°

Boxminus gives the "angular distance" between two rotations: the rotation that takes R2 to R1. This is the correct Riemannian geodesic distance on SO(3). The Euclidean distance ||R1−R2||F is NOT a good rotation distance — it depends on the embedding in ℝ9 rather than the manifold geometry.

Exercise 6.3: Retraction vs. Addition — Which Stays on SO(3)? Trace
You have R ∈ SO(3) and a tangent update δ ∈ ℝ3. Which update rule guarantees the result is still in SO(3)?
Show explanation

Only R·exp(δ̂) is guaranteed to be in SO(3) for any δ. exp(δ̂) ∈ SO(3) because δ̂ is skew-symmetric (the matrix exponential of a skew-symmetric matrix is orthogonal). The product of two SO(3) elements is in SO(3) (group closure). Options A and C break orthogonality for finite δ. Option D (Gram-Schmidt re-orthogonalization) works but is not the natural manifold retraction.

Exercise 6.4: SE(3) Perturbation Dimension Derive

SE(3) is a 6-dimensional manifold (3 for rotation, 3 for translation). Its Lie algebra se(3) has tangent vectors ξ = [ρ, φ]T ∈ ℝ6. How many scalar parameters does a single SE(3) Gauss-Newton update δ have?

parameters
Show derivation
dim(SE(3)) = dim(SO(3)) + dim(ℝ3) = 3 + 3 = 6

Each camera pose update δ ∈ ℝ6: three components for the rotation update (φ, applied via exp(φ̂)) and three for the translation update (ρ). In a factor graph with N camera poses and M 3D points, the full update vector δ has 6N + 3M components. A typical SLAM graph with 100 keyframes and 1000 landmarks has a 1500-dimensional update vector solved at each LM iteration.

Exercise 6.5: Rodrigues at 180° Derive

Rodrigues formula: exp(φ̂) = I + sin(θ)/θ · φ̂ + (1−cos(θ))/θ2 · φ̂2, where φ = [0, 0, π]T (180° around z). Then sin(π)=0 and (1−cos(π))/π2 = 2/π2. φ̂2 = [[−1,0,0],[0,−1,0],[0,0,0]] (skew of z-unit).
Result: R = I + 2/π2×π2×diag(−1,−1,0) = I + diag(−2,−2,0). What is R[0][0] (top-left entry)?

(R[0][0])
Show derivation
For φ = [0,0,π], θ = π: sin(π)=0, cos(π)=−1
(1−cosθ)/θ2 = 2/π2
(φ/θ)̂2 = [0,0,1]̂2 = [[−1,0,0],[0,−1,0],[0,0,0]]
R = I + 0 + (2/π2)×π2×[[−1,0,0],[0,−1,0],[0,0,0]]
= I + [[−2,0,0],[0,−2,0],[0,0,0]] = [[−1,0,0],[0,−1,0],[0,0,1]]

R[0][0] = −1. This is Rz(180°): x→−x, y→−y, z→z. Sanity check: tr(R) = −1+−1+1 = −1, and arccos((−1−1)/2)=arccos(−1)=180° ✓. At 180° the formula is well-defined (sin=0 term vanishes cleanly).

Chapter 7: Visual Odometry & Visual-Inertial Odometry

Visual odometry (VO) accumulates per-frame pose increments. Each 2% relative error compounds: after 50 steps, total drift can exceed 100% of the path length. VIO fuses camera and IMU: the IMU constrains short-term motion; the camera corrects long-term drift. Without loop closure, both still drift — but VIO drifts much slower.

Drift after N steps at ε% per step: accumulated drift ≈ N × ε% × step_length

Monocular scale factor: unobservable from images alone; set by first stereo frame, IMU, or known-size object

IMU dead-reckoning: position error ≈ ½ · abias · t2 (accelerometer bias → quadratic position drift)
Drift is the enemy of navigation. A 1% drift per meter of travel is excellent for VO. At 100m of travel, the absolute position error is ~1m. After 1km, ~10m. SLAM closes loops to bound drift. Without loop closures, long hallways and featureless environments remain a fundamental challenge.
Exercise 7.1: VO Drift Over N Steps Derive

A VO system has 0.5% relative translation error per step. Each step moves 0.3m. After 200 steps, what is the accumulated position drift in meters? (Assume errors add independently: drift ≈ N × ε × dstep.)

meters
Show derivation
drift = N × ε × dstep
= 200 × 0.005 × 0.3 = 0.3 m

200 steps × 0.3m = 60m total path. Drift = 0.5% × 60m = 0.30m. This is 0.5% of path length, consistent with the per-step error. State-of-the-art VO systems like VINS-Mono achieve ~0.5% drift over hundreds of meters. Loop closure can reduce accumulated drift to zero at the cost of a global optimization.

Exercise 7.2: Monocular Scale Initialization Trace
A monocular camera observes a scene for 10 frames. Without additional information, what can be recovered and what cannot?
Show explanation

Monocular SfM recovers structure and motion up to a global scale. If the camera moves 1m or 1km, the image observations are identical after scaling all 3D points and the translation by the same factor. Scale can be set by: (a) a stereo frame (known baseline), (b) IMU integration (metric acceleration), (c) a known-size object, or (d) GPS. Rotation is scale-independent and fully recoverable from pure image correspondences.

Exercise 7.3: IMU Dead-Reckoning Position Drift Derive

An accelerometer has a bias b = 0.01 m/s2 (typical MEMS). The robot dead-reckons for t = 10 seconds with no camera correction. Position error ≈ ½ · b · t2. What is the position drift in meters?

meters
Show derivation
drift = ½ × 0.01 × 102 = ½ × 0.01 × 100 = 0.5 m

0.5m of drift in 10 seconds is typical for consumer-grade IMU dead-reckoning. At t=30s: drift = ½×0.01×900 = 4.5m. This is why VIO fuses camera and IMU: the camera corrects IMU drift every few milliseconds, keeping overall drift manageable. High-end IMUs (tactical grade) have b≈0.0001 m/s2, giving only 0.005m drift in 10s.

Exercise 7.4: Keyframe Selection Rule Trace
A VO system processes frames at 30Hz. It only creates a new keyframe when the tracked features fall below 80% retention OR the camera has moved more than 0.5m. Why is this better than keyframing every frame?
Show explanation

Two motivations: (1) computational efficiency — the bundle adjustment graph grows linearly with keyframe count; selecting keyframes with sufficient baseline keeps the graph manageable. (2) geometric quality — two near-identical views produce near-zero baseline, making triangulation numerically ill-conditioned (depth uncertainty diverges). The 0.5m threshold ensures a meaningful baseline for each new keyframe. ORB-SLAM3 uses a similar strategy, marginalizing redundant keyframes to bound graph size.

Exercise 7.5: VIO vs. VO Drift Comparison Derive

VO system: 1.5% drift per meter. VIO system: 0.3% drift per meter. Both traverse a 500m corridor. What is the absolute position error (in meters) of each system at the end?

m (VIO error only)
Show derivation
VO error = 1.5% × 500 = 7.5 m
VIO error = 0.3% × 500 = 1.5 m

VIO reduces drift by 5× over VO alone. At 500m the difference is 6m — enough to miss a doorway or hit a wall. State-of-the-art VIO systems (VINS-Fusion, Kimera) achieve ~0.1–0.5% drift over hundreds of meters in indoor environments. Outdoor long-range navigation still requires GNSS or loop closures.

Chapter 8: Place Recognition & BoW

Place recognition matches a query image to a database of past frames. Bag-of-Visual-Words (BoW) maps each local descriptor to its nearest cluster center (visual word), building a histogram over K words. TF-IDF reweights the histogram to downweight ubiquitous words and amplify distinctive ones. Cosine similarity ranks candidates; geometric RANSAC verification confirms the loop closure.

TF-IDF weight for word k in image i:
wik = tfik × idfk = (nik⁄ni) × log(N⁄nk)

Cosine similarity: s(i,j) = (wi·wj) / (||wi|| · ||wj||)

Geometric verification: run RANSAC on top-K candidates to confirm inlier count > threshold
Two-stage retrieval: BoW retrieves top-K candidates in O(log N) per query using an inverted index. Geometric verification then runs RANSAC on only the top K (typically 5-10) candidates. Most false positives from BoW are eliminated by geometry; the overall pipeline is fast even for large databases.
Exercise 8.1: IDF Computation Derive

Database: N=1000 images. Visual word k appears in nk=10 images. Compute idfk = log(N⁄nk) (natural log). Use ln(100) ≈ 4.605.

(IDF value)
Show derivation
idfk = ln(1000 ⁄ 10) = ln(100) ≈ 4.605

A word appearing in 1% of images has IDF ≈ 4.6. A word appearing in 50% of images: idf = ln(2) ≈ 0.69. A word appearing in every image: idf = ln(1) = 0 (completely suppressed). The IDF is the primary discriminability signal: rare visual words distinguish places far better than common ones (which appear everywhere from generic textures like sky or asphalt).

Exercise 8.2: TF-IDF Weight for a Specific Word Derive

Query image has 500 features total. Visual word k = "red octagon" appears 25 times in this image (tfk = 25/500 = 0.05). The database IDF for word k is 4.605 (from Ex 8.1). What is the TF-IDF weight wk = tfk × idfk?

(TF-IDF weight)
Show derivation
tfk = 25 / 500 = 0.05
wk = tfk × idfk = 0.05 × 4.605 = 0.230

This visual word contributes weight 0.230 to the image's TF-IDF vector. A common word appearing 50 times but with idf=0.69 would only contribute 0.050×0.69=0.069 — 3.3× less influential despite being twice as frequent. TF-IDF correctly prioritizes distinctive rare words over common ones.

Exercise 8.3: Cosine Similarity of Two BoW Vectors Derive

Query wq = [0.3, 0.4] (2D example, already normalized: ||wq||=0.5). Database image wd = [0.4, 0.3] (||wd||=0.5). Cosine similarity = (wq·wd) / (||wq||·||wd||). What is the cosine similarity?

(similarity, 0-1)
Show derivation
wq·wd = 0.3×0.4 + 0.4×0.3 = 0.12 + 0.12 = 0.24
||wq|| = √(0.09+0.16) = √0.25 = 0.5
||wd|| = 0.5
cos sim = 0.24 / (0.5×0.5) = 0.24 / 0.25 = 0.96

Cosine similarity 0.96 is very high — the two images share most of their visual content. A threshold of 0.7–0.8 is typically used to identify loop-closure candidates. Identical images have similarity 1.0; completely different scenes have similarity ≈ 0 (orthogonal BoW histograms in high-dimensional vocabulary space).

Exercise 8.4: Geometric Verification Threshold Trace
BoW retrieval returns 5 candidate images for the query. Geometric verification runs RANSAC on each. Results: candidates get inlier counts [42, 38, 8, 5, 3]. With a threshold of 20 inliers to confirm a loop closure, which candidates are accepted?
Show explanation

Candidates 1 and 2 pass geometric verification (42 and 38 > 20 threshold). Candidates 3-5 are rejected as false positives from BoW (8, 5, 3 inliers — these are coincidental visual similarity, not actual loop closures). In practice, systems often require temporal consistency too: the loop-closure candidate must be consistent across multiple consecutive frames, reducing false positives further.

Exercise 8.5: Vocabulary Tree Branching Factor Derive

A vocabulary tree with branching factor b=10 and depth d=6 levels. Total number of leaf nodes (visual words) = bd. What is the vocabulary size?

words
Show derivation
leaves = bd = 106 = 1,000,000 words

DBoW2 (ORB-SLAM2) uses exactly these parameters: 10-ary tree with 6 levels = 106 words. Quantizing a descriptor takes only 6 comparisons (one per level), not 106. The tree structure gives O(b·d) = O(60) assignment time vs. O(106) for flat k-means. This is why 1M-word vocabularies are practical in real-time SLAM.

Chapter 9: SLAM & Robustness — Capstone

SLAM fuses odometry, landmarks, and loop closures in a factor graph. A single outlier loop closure can fold the entire map because squared cost grows unboundedly. Robust costs (Huber, Cauchy, truncated LS) cap the influence of high-residual measurements. IRLS re-weights each measurement by its robust weight at each iteration, effectively down-weighting outliers to near-zero influence.

Squared: ρ(u) = u2⁄2
Huber (k=1): ρ(u) = u2⁄2 if |u|≤1;   |u|−½ if |u|>1
Truncated LS: ρ(u) = min(u2⁄2, c2⁄2)
IRLS weight: wi = ρ′(ui) / ui; solve (JTWJ)δ = −JTWr
Capstone insight: robust estimation is not about rejecting outliers by hand — it is about choosing a cost function whose influence function saturates. IRLS makes this a drop-in replacement for Gauss-Newton: same structure, just multiply each residual by its robust weight. The entire SLAM machinery still applies.
Exercise 9.1: Loop-Closure Error Redistribution Derive

A SLAM trajectory has 10 poses. VO accumulated 2.0m drift between pose 1 and pose 10. A loop closure is detected: pose 10 should be 0.5m from pose 1 (instead of the VO-predicted 2.5m). The correction of 2.0m is distributed evenly across all 9 inter-pose segments. How much does each segment shift (in meters)?

m/segment
Show derivation
correction per segment = 2.0 m / 9 segments ≈ 0.222 m

This is the simplest pose-graph correction: uniform distribution of loop-closure error across the path. A full back-end optimizer (g2o, GTSAM) distributes the correction proportionally to odometry uncertainty at each segment — high-uncertainty segments absorb more of the correction. But even the naive equal-distribution approach dramatically reduces drift vs. no loop closure.

Exercise 9.2: Squared vs. Huber Cost at Large Residual Derive

Whitened residual u = 10 (outlier). k=1 for Huber. Compute: ρsq(10) = 102⁄2 = 50. ρHuber(10) = 10−½ = 9.5 (since |10|>1, use |u|−½). What is the ratio ρsq⁄ρHuber? (Round to nearest integer.)

× larger
Show derivation
ρsq(10) = 100/2 = 50
ρHuber(10) = 10 − 0.5 = 9.5
ratio = 50 / 9.5 ≈ 5.3 → ≈ 5

The squared cost is ~5× larger for u=10. At u=100: ρsq=5000 vs. ρHuber=99.5, a ratio of ~50. The Huber loss only grows linearly for large residuals, preventing a single outlier from dominating the graph. From the VNAV lesson's table: at u=10, Squared=50, Huber=9.5, Cauchy≈2.3, TLS=0.5 (fully saturated).

Exercise 9.3: Truncated LS — Full Saturation Derive

Truncated LS: ρTLS(u) = min(u2⁄2, c2⁄2) with c=1. At u=0.5 (inlier): ρTLS = min(0.125, 0.5) = 0.125. At u=5 (outlier): ρTLS = min(12.5, 0.5) = 0.5. What is the IRLS weight w = ρ′(u)/u for the outlier at u=5? (ρ′TLS(u) = 0 when |u|>c.)

(IRLS weight)
Show derivation
For |u| > c: ρTLS(u) = c2⁄2 (constant)
ρ′TLS(u) = 0 for |u| > c
w(u) = ρ′(u) / u = 0 / 5 = 0

TLS completely suppresses outliers: once the residual exceeds the threshold c, the IRLS weight is exactly zero. This is the harshest robust cost — outliers are entirely ignored. The downside: TLS has flat regions with zero gradient, making it non-convex and prone to local minima. GNC (Graduated Non-Convexity) solves this by starting with a convex surrogate and annealing toward TLS.

Exercise 9.4: IRLS Reweight — One Step Derive

Two measurements. Measurement 1 (inlier): u1=0.8, Huber k=1 → ρ′(u)=u=0.8 → w1=0.8/0.8=1.0. Measurement 2 (outlier): u2=8.0 → ρ′(u)=1 (linear branch) → w2=1/8.0 = 0.125. The IRLS weighted least-squares step uses W=diag(w1, w2). After reweighting, by what factor is the inlier weighted MORE than the outlier?

× more weight
Show derivation
w1 = 1.0;   w2 = 1/8 = 0.125
ratio = w1 / w2 = 1.0 / 0.125 = 8

The inlier has 8× more influence than the outlier after one IRLS step. In subsequent iterations, if the outlier residual grows further, its weight decreases further. IRLS converges when the weights stop changing — typically 5–15 iterations for well-conditioned problems. Cauchy weights decrease even faster: w(u)=1/(1+u2/c2); at u=8, c=1: w=1/65≈0.015 — 67× smaller than the inlier.

Exercise 9.5: Cost Comparison — Inlier vs. Outlier Budget Derive

A factor graph has 100 odometry factors (inliers, u≈1 each) and 1 loop-closure outlier (u=10). With squared cost: total inlier cost = 100×0.5=50. Outlier cost = 50. With Huber (k=1): total inlier cost = 100×0.5=50. Outlier Huber cost = 9.5. What fraction of total robust cost does the outlier contribute? (Round to 1 decimal place.)

% of total cost
Show derivation
Inlier total (Huber): 100 × 0.5 = 50
Outlier (Huber): 9.5
Total = 59.5
Outlier fraction = 9.5 / 59.5 ≈ 0.160 = 16.0%

With squared cost: outlier fraction = 50/(50+50)=50% — the single outlier equals ALL odometry combined. With Huber: 16%. With Cauchy (c=1, u=10): ρ(10)=ln(1+100)/2≈2.3; fraction=2.3/52.3≈4.4%. With TLS (c=1): ρ(10)=0.5; fraction=0.5/50.5≈1.0%. The robust cost hierarchy correctly assigns less and less budget to the outlier.