Visual Navigation Workbook

Chapter 0: SE(3) & Lie Groups

SE(3) is the group of rigid-body transforms in 3D space. Every robot pose — every camera frame — lives here. Composing transforms is matrix multiplication; updating them optimally requires the Lie algebra so(3), where small corrections live as skew-symmetric matrices before being mapped back to SO(3) via the matrix exponential.

SE(3) element:
T = [[R, t], [0, 1]] ∈ SE(3), R ∈ SO(3), t ∈ ℝ³

Composition: T_AC = T_AB · T_BC

Angle from trace: θ = arccos((tr(R) − 1) ⁄ 2)

Rodrigues exp map: exp(φ̂) = I + sin(θ)φ̂⁄θ + (1 − cos(θ))(φ̂⁄θ)²

Key fact: Rotation matrices cannot be added — R₁ + R₂ is not a rotation. But they compose: R₁R₂ ∈ SO(3). Updates live in the tangent space so(3) and are applied via R ← R · exp(δ̂), which keeps R on the manifold.

Exercise 0.1: Compose Two SE(3) Transforms Derive

Frame A is the world origin. T_AB translates by t_AB = [3, 0, 0]^T with R = I. T_BC translates by t_BC = [0, 2, 0]^T with R = I. What is the y-component of t_AC in the composed transform T_AC = T_AB · T_BC?

meters

Show derivation

T_AC = T_AB · T_BC
t_AC = R_AB · t_BC + t_AB
= I · [0, 2, 0]^T + [3, 0, 0]^T = [3, 2, 0]^T

The SE(3) composition rule is t_AC = R_ABt_BC + t_AB. Since R_AB = I here, the translation from B to C is expressed unchanged in A's frame. The y-component is 2.

Exercise 0.2: Rotation Angle from Trace Derive

A rotation matrix R has trace tr(R) = 1.0. What is the rotation angle θ in degrees? (Use the formula θ = arccos((tr(R) − 1) ⁄ 2) and round to the nearest degree.)

degrees

Show derivation

θ = arccos((1.0 − 1) ⁄ 2) = arccos(0) = 90°

The trace of a 3D rotation by θ about any axis is 1 + 2cos(θ). Solving: cos(θ) = (tr(R) − 1)/2 = 0/2 = 0, so θ = 90°. Sanity check: tr(I) = 3 → θ = arccos(1) = 0°. tr(R₁₈₀) = −1 → θ = arccos(−1) = 180°.

Exercise 0.3: Rotate a Point with SO(3) Derive

R_z(90°) rotates 90° around the z-axis: it maps x→y and y→−x. Apply R_z(90°) to the point p = [1, 0, 0]^T. What is the y-component of R_z(90°)p?

(y-component)

Show derivation

R_z(90°) = [[0, −1, 0], [1, 0, 0], [0, 0, 1]]
R_z(90°) · [1, 0, 0]^T = [0·1 + (−1)·0, 1·1 + 0·0, 0] = [0, 1, 0]^T

R_z(θ) = [[cosθ, −sinθ, 0],[sinθ, cosθ, 0],[0,0,1]]. At 90°: cos=0, sin=1. Column 1 of R_z is [0, 1, 0]^T, so p=[1,0,0] maps directly to it. The y-component is 1.

Exercise 0.4: exp Map Angle Recovery Derive

A Lie algebra element φ = [0, 0, 0.5236]^T (axis-angle vector; the z-component is π⁄6 ≈ 0.5236 rad). The rotation angle is ||φ||. What is the rotation angle in degrees? (Round to the nearest integer.)

degrees

Show derivation

||φ|| = √(0² + 0² + 0.5236²) = 0.5236 rad
0.5236 × (180⁄π) ≈ 0.5236 × 57.296 ≈ 30°

The Lie algebra element φ ∈ ℝ³ encodes the rotation axis as its direction and the rotation angle as its magnitude. Here φ = [0,0,π⁄6]^T, so the rotation is 30° around the z-axis. exp(φ̂) = R_z(30°).

Exercise 0.5: SE(3) Inverse Translation Derive

T = [[R, t],[0,1]] with R = I and t = [5, 3, −2]^T. The inverse of an SE(3) element is T⁻¹ = [[R^T, −R^Tt],[0,1]]. What is the x-component of the translation in T⁻¹?

(x-component)

Show derivation

T⁻¹ translation = −R^Tt = −I · [5, 3, −2]^T = [−5, −3, 2]^T

With R = I, R^T = I, so −R^Tt = −t. The inverse translation is simply [−5,−3,2]^T and x = −5. For non-identity R, the inverse rotation R^T also rotates the translation vector, so order matters.

Chapter 1: Camera Projection

A pinhole camera collapses 3D space to 2D pixels. The forward model is p = K[R|t]P: a 3D world point P (homogeneous 4-vector) multiplied by the full projection matrix P = K[R|t] gives a 2D pixel (homogeneous 3-vector). Dividing by w recovers (u, v). Every calibration pipeline boils down to estimating K.

Intrinsic matrix K:
K = [[f_x, s, c_x], [0, f_y, c_y], [0, 0, 1]]

Focal length in pixels: f_x = s_x · f, where s_x is pixels/meter and f is the physical focal length.

Radial distortion: r² = u_d² + v_d²; u_u = u_d(1 + k₁r² + k₂r⁴)

Depth is lost. A pixel (u,v) corresponds to a full ray, not a point. All 3D points on the ray d·K⁻¹[u,v,1]^T project to the same pixel. Recovering depth requires stereo, SfM, or additional sensors.

Exercise 1.1: Project a 3D Point Derive

Camera with K = diag(800, 800, 1) and principal point (320, 240). A 3D point in camera coordinates is P^c = (0.3, 0.1, 4.0) m. What is the pixel u-coordinate?
Formula: u = f_x · X⁄Z + c_x

pixels

Show derivation

u = f_x · X⁄Z + c_x = 800 · 0.3⁄4.0 + 320
= 800 · 0.075 + 320 = 60 + 320 = 380 px

The projection u = f_x(X/Z) + c_x maps a camera-frame X coordinate to a pixel column. X/Z = 0.075 is the tangent angle; multiplying by f_x=800 converts it from radians to pixels; adding c_x=320 shifts from image center to image corner. The v-coordinate would be 800×(0.1/4.0)+240 = 260 px.

Exercise 1.2: Focal Length in Pixels Derive

A camera sensor has 4000 pixels across a 6 mm wide sensor (s_x = 4000 ⁄ 0.006 pixels/meter). The physical focal length is f = 8 mm = 0.008 m. What is f_x in pixels?

pixels

Show derivation

s_x = 4000 ⁄ 0.006 ≈ 666,667 px⁄m
f_x = s_x · f = 666,667 × 0.008 = 5,333 px

Physical focal length × sensor density = focal length in pixels. A 50mm lens on a 35mm full-frame sensor (36mm wide, 8000px) would give f_x = (8000/0.036)×0.05 ≈ 11,111 px. Focal length in pixels captures the combined effect of lens choice and sensor resolution.

Exercise 1.3: Back-Project a Pixel to a Ray Derive

Camera K = diag(600, 600, 1) with c_x=320, c_y=240. Pixel (u, v) = (440, 240). Back-project to normalized camera coordinates: x_n = (u − c_x) ⁄ f_x. What is x_n?

(unitless)

Show derivation

x_n = (u − c_x) ⁄ f_x = (440 − 320) ⁄ 600 = 120 ⁄ 600 = 0.2

The back-projected ray direction in camera coordinates is [x_n, y_n, 1]^T = [(u−c_x)/f_x, (v−c_y)/f_y, 1]^T = [0.2, 0, 1]^T. At depth Z=5m, the 3D point is [1.0, 0, 5]^T. The ray direction is the unit vector K⁻¹[u,v,1]^T normalized.

Exercise 1.4: Radial Distortion Shift Derive

A distorted point in normalized coords: (u_d, v_d) = (0.3, 0.4). The radial distance r² = u_d² + v_d². With k₁ = 0.1 and k₂ = 0, the undistorted u_u = u_d(1 + k₁r²). What is u_u? (Round to 3 decimal places.)

(unitless)

Show derivation

r² = 0.3² + 0.4² = 0.09 + 0.16 = 0.25
u_u = 0.3 × (1 + 0.1 × 0.25) = 0.3 × 1.025 = 0.3075

Radial distortion moves points outward (k₁>0, barrel distortion) or inward (k₁<0, pincushion). Here k₁=0.1 gives a 2.5% outward push at r=0.5. Real wide-angle cameras can have k₁ ≈ −0.3 (strong pincushion). Calibration tools (OpenCV, Kalibr) estimate k₁, k₂, k₃ simultaneously with f_x, f_y, c_x, c_y.

Exercise 1.5: Depth Ambiguity — Scale Factor Derive

Point A at depth Z=2m projects to pixel u=380 with f_x=800, c_x=320 (X/Z = (380−320)/800 = 0.075, so X=0.15m). Point B is on the same ray at depth Z=6m. What is the X-coordinate of B (in meters)?

meters

Show derivation

Ray direction: x_n = 0.075 (fixed by pixel)
X = x_n · Z = 0.075 × 6 = 0.45 m

Both points project to the same pixel because X/Z = 0.075 is the same for both. The ray passes through A=(0.15, ?, 2) and B=(0.45, ?, 6) — three times farther gives three times larger X. This is the fundamental depth ambiguity: you know the ratio X/Z but not X or Z individually from a single frame.

Chapter 3: Two-View Geometry

Two calibrated views of the same scene are linked by the essential matrix E = [t]_×R. Every correct correspondence (y₁, y₂) satisfies the epipolar constraint y₂^TEy₁ = 0. Triangulation recovers 3D depth, but depth uncertainty grows as Z²⁄baseline.

Epipolar constraint: ỹ₂^TEỹ₁ = 0

Essential matrix: E = [t]_×R, rank 2, scale-ambiguous

Triangulation depth error: σ_Z ≈ (Z² / (f · b)) · σ_px

Scale ambiguity: monocular SfM recovers R and the unit direction of t, but not ||t||. Multiplying all 3D points by λ and t by λ produces the same image observations. Absolute scale requires IMU, GPS, or a known-size object.

Exercise 3.1: Epipolar Constraint Check Derive

t = [1, 0, 0]^T, R = I. Then E = [t]_×R = [t]_× = [[0,0,0],[0,0,−1],[0,1,0]]. A candidate match: ỹ₁ = [0, 0, 1]^T, ỹ₂ = [−0.2, 0, 1]^T. Compute ỹ₂^TEỹ₁. What is the result?

(should be 0)

Show derivation

Eỹ₁ = [[0,0,0],[0,0,−1],[0,1,0]] · [0,0,1]^T
= [0, −1, 0]^T
ỹ₂^T(Eỹ₁) = [−0.2, 0, 1] · [0, −1, 0]^T
= (−0.2)×0 + 0×(−1) + 1×0 = 0 ✓

The epipolar constraint is satisfied: this match is geometrically consistent with the given camera motion. A horizontal baseline (t along x) produces horizontal epipolar lines — y₂ must have the same v as y₁ (both have normalized v=0 here).

Exercise 3.2: Triangulate Depth from Two Views Derive

Two cameras: camera 1 at origin, camera 2 at t=[b, 0, 0]^T with b=0.12m (12cm baseline), R=I. A 3D point is at X=0 (centred), Y=0, Z. In camera 1 normalized coords: y₁ = [0, 0, 1]^T. In camera 2: y₂ = [−0.04, 0, 1]^T. The horizontal disparity is −b/Z → Z = b/|disparity| = 0.12/0.04. What is Z?

meters

Show derivation

Z = b / |x_n2 − x_n1| = 0.12 / |−0.04 − 0| = 0.12 / 0.04 = 3.0 m

With a pure horizontal baseline and no rotation, the depth Z = baseline / disparity (in normalized coords). In pixel coords: Z = f · b / (u₁ − u₂). This is the stereo depth formula. Doubling the baseline halves depth error; doubling the depth quadruples it.

Exercise 3.3: Depth Error Scales as Z² Derive

Stereo depth error: σ_Z = (Z² / (f · b)) · σ_px. Given f=600px, b=0.1m, σ_px=1 pixel. At Z=3m: σ_Z = (3²/(600×0.1))×1. What is σ_Z in meters?

meters

Show derivation

σ_Z = (3² / (600 × 0.1)) × 1
= (9 / 60) × 1 = 0.15 m

At Z=6m (double): σ_Z = 36/60 = 0.6m — quadrupled, as expected from the Z² scaling. At Z=10m with the same setup: σ_Z = 100/60 ≈ 1.67m, which is why stereo SLAM loses depth accuracy fast at range. LiDAR has roughly linear depth error, giving it a huge advantage beyond ~10m.

Exercise 3.4: Essential Matrix DOF Trace

The Essential matrix E encodes the relative pose between two calibrated cameras. How many degrees of freedom does E have (after removing the overall scale), and why does the 5-point algorithm use exactly 5 correspondences?

E has 9 DOF because it is a 3×3 matrix; 5 points over-constrain it. E has 5 DOF (3 for R, 2 for unit translation direction — scale is unobservable); 5 correspondences each contribute 1 scalar constraint, exactly determining E. E has 6 DOF (full relative pose); 5 points under-constrain it. E has 8 DOF; the 8-point algorithm is always preferable.

Show explanation

Relative pose has 6 DOF (3 rotation + 3 translation). But only the direction of translation is observable from image correspondences (multiply all 3D points by λ and t by λ → same images). So translation has 2 DOF (unit sphere). Total: 3 + 2 = 5 DOF. Each correspondence provides one scalar constraint (y₂^TEy₁=0). Five constraints = 5 equations for 5 unknowns. The 8-point algorithm uses 8 correspondences and fits a general 3×3 rank-2 matrix (8 DOF before enforcing rank-2), which is over-parameterized but simpler to implement.

Exercise 3.5: Disparity at Double the Baseline Derive

A stereo pair with f=500px and baseline b=0.08m observes a point at depth Z=4m. Disparity d = f·b⁄Z. Now the baseline is doubled to 0.16m. By what factor does the disparity change?

× (factor)

Show derivation

d₁ = 500 × 0.08 / 4 = 10 px
d₂ = 500 × 0.16 / 4 = 20 px
Factor = 20 / 10 = 2×

Disparity is proportional to baseline: doubling b doubles d. Larger disparity means the depth estimate is more precise (same sub-pixel error over a larger signal). The ZED stereo camera (12cm baseline) and the Intel D435 (5cm baseline) were designed with this tradeoff in mind.

Chapter 4: RANSAC

RANSAC tolerates high outlier rates by repeatedly sampling a minimal subset, fitting a model, and counting inliers. The required iteration count N guarantees that at least one all-inlier sample is drawn with probability p, given inlier fraction w and minimal sample size s.

Iteration count: N = log(1 − p) ⁄ log(1 − w^s)

Minimal sample sizes: line: s=2; homography: s=4; fundamental/essential matrix: s=8 (or 5); plane: s=3

Inlier test: point i is inlier if ||residual_i|| < ε

Practical tip: the 5-point essential-matrix solver (s=5) drastically reduces N vs. the 8-point (s=8). At w=0.5: N_5pt = log(0.01)/log(1−0.03125) ≈ 145; N_8pt = log(0.01)/log(1−0.00391) ≈ 1177. Use the minimal solver.

Exercise 4.1: RANSAC Iteration Count — Line Fit Derive

You fit a 2D line (s=2). Inlier fraction w=0.7. Desired confidence p=0.99. Use N = log(1−p) / log(1−w^s). Compute N (round up to integer). Note: log(0.01)/log(1−0.49) ≈ log(0.01)/log(0.51). Use ln(0.01)≈−4.605 and ln(0.51)≈−0.673.

iterations

Show derivation

w^s = 0.7² = 0.49
1 − w^s = 0.51
N = ln(1−0.99) / ln(0.51)
= ln(0.01) / ln(0.51)
= −4.605 / −0.673 ≈ 6.84 → round up to 7

With 70% inliers and s=2, you only need 7 iterations to have 99% confidence of hitting an all-inlier pair. The simplicity of the line fit (small s) compensates for uncertainty. Compare to the homography (s=4): w=0.7, N = ln(0.01)/ln(1−0.7⁴) = −4.605/ln(0.7599) ≈ −4.605/−0.2744 ≈ 17 iterations.

Exercise 4.2: RANSAC with Low Inlier Fraction Derive

8-point essential matrix (s=8). w=0.5 (50% outliers). p=0.99. Compute N. Use w⁸=0.5⁸=0.00391, so 1−w^s=0.99609. ln(0.01)/ln(0.99609). Use ln(0.99609)≈−0.003918.

iterations

Show derivation

N = ln(0.01) / ln(0.99609)
= −4.605 / −0.003918 ≈ 1176 → 1177 iterations

Compare to the 5-point algorithm at w=0.5: w⁵=0.03125, 1−w^s=0.96875, N = ln(0.01)/ln(0.96875) = −4.605/−0.03175 ≈ 145 iterations. The 5-point solver reduces iterations by 8× at 50% inliers. With 60% outliers (w=0.4): 8-point needs N≈17,000 iterations — effectively unusable. 5-point needs ≈600.

Exercise 4.3: Inlier Count Derive

A RANSAC iteration proposes a homography H. You test all N=120 correspondences: residuals below ε=2px count as inliers. You find 78 inliers. What is the estimated inlier fraction w (as a percentage, rounded to integer)?

Show derivation

w = 78 / 120 = 0.65 = 65%

This inlier fraction can be used to dynamically update N during the RANSAC loop. Adaptive RANSAC starts with N=∞ and, after each iteration, updates N = log(1−p)/log(1−w^s) using the current best w. If w improves, N shrinks. PROSAC (Progressive Sample Consensus) additionally sorts correspondences by quality to try high-quality pairs first.

Exercise 4.4: Minimal Sample Size for Homography Trace

Why does homography estimation require exactly 4 point correspondences (s=4) as the minimal sample?

Because 4 points are needed to make the least-squares system full-rank A homography has 8 DOF (3×3 matrix, scale-ambiguous → 8 free parameters); each 2D-to-2D correspondence supplies 2 equations; 4 × 2 = 8 — exactly determined. 4 is the minimum to avoid degenerate configurations (3 collinear points) Because we need to estimate the rotation (3 DOF) and translation (2 DOF) plus scale (1 DOF) plus 2 more

Show explanation

A 3×3 homography matrix H has 9 entries but is defined up to scale (multiply H by any λ ≠ 0 and it represents the same mapping). So it has 8 DOF. Each correspondence (x',y') ↔ (x,y) contributes 2 scalar equations (one for x', one for y' after the homogeneous divide). Four correspondences give 8 equations for 8 unknowns: exactly determined. With 3 correspondences you have 6 equations for 8 unknowns: a 2-parameter family of solutions.

Exercise 4.5: Required Iterations at p=0.95 vs. p=0.99 Derive

Homography RANSAC, s=4, w=0.6. At p=0.99: N₉₉=ln(0.01)/ln(1−0.6⁴). 0.6⁴=0.1296; ln(0.8704)≈−0.1386; N₉₉≈ln(0.01)/(−0.1386)=33.2 → 34. What is N at p=0.95? (ln(0.05)≈−2.996)

iterations

Show derivation

N₉₅ = ln(0.05) / ln(0.8704)
= −2.996 / −0.1386 ≈ 21.6 → 22 iterations

Going from 95% to 99% confidence requires 34 vs. 22 iterations — only 55% more work for a significantly stronger guarantee. In practice, 500–2000 iterations are used for robustness; the formula gives the theoretical minimum. Note that increasing w from 0.6 to 0.8 would reduce N₉₉ from 34 to just 7 iterations.

Chapter 5: Nonlinear Least Squares

Camera bundle adjustment, SLAM graph optimization, and IMU preintegration all reduce to NLS: minimize ∑||r_i(x)||². The Gauss-Newton step linearizes each residual with its Jacobian and solves the resulting linear system. Levenberg-Marquardt blends Gauss-Newton with gradient descent via a damping parameter λ.

Normal equations (linear LS): (A^TA)x̂ = A^Tb

Gauss-Newton step: (J^TJ)δ* = −J^Tr(x̄); x̄ ← x̄ + δ*

Levenberg-Marquardt: (J^TJ + λI)δ* = −J^Tr(x̄)

LM interpolates: when λ→0, LM = Gauss-Newton (fast near solution). When λ→∞, LM ≈ gradient descent with step 1/λ (robust far from solution). Increase λ if cost increases; decrease if cost decreases. This automatic trust-region control makes LM the workhorse of bundle adjustment.

Exercise 5.1: Normal Equations — 1D Linear Fit Derive

Two data points: (x=1, y=2) and (x=3, y=4). Fit y = ax. Model matrix A = [[1],[3]], b = [[2],[4]]. Solve (A^TA)â = A^Tb. A^TA = 1²+3² = 10; A^Tb = 1×2+3×4 = 14. What is â = A^Tb / A^TA?

(slope a)

Show derivation

A^TA = 1² + 3² = 10
A^Tb = 1 × 2 + 3 × 4 = 14
â = 14 / 10 = 1.4

The residuals are r = b − Aâ = [2−1.4, 4−4.2]^T = [0.6, −0.2]^T. Cost = 0.36 + 0.04 = 0.40. Verify: d(cost)/da = 2A^T(Aa−b) = 2(10×1.4−14) = 0 ✓. The data doesn't pass through the origin perfectly (truth would be a=1 for y=x), so the least-squares fit balances both residuals.

Exercise 5.2: One Gauss-Newton Step Derive

Scalar problem: minimize f(x) = r(x)² where r(x) = x² − 3. Current estimate: x̄ = 2. r(x̄) = 4−3 = 1. J = dr/dx = 2x̄ = 4. Gauss-Newton step: δ = −(J^TJ)⁻¹J^Tr = −r/J. What is the new estimate x̄ + δ?

(new x)

Show derivation

δ = −r(x̄) / J = −1 / 4 = −0.25
x_new = x̄ + δ = 2 + (−0.25) = 1.75

The true solution is x* = √3 ≈ 1.732. After one step we're at 1.75 — already very close. A second GN step: r(1.75) = 1.75²−3 = 0.0625; J=3.5; δ=−0.0179; x=1.732 — converged. GN converges quadratically near the solution for well-conditioned problems.

Exercise 5.3: LM Damping Effect on Step Size Derive

Scalar problem: J^TJ = 4, J^Tr = −2 (so GN step δ_GN = −J^Tr / J^TJ = 2/4 = 0.5). Now apply LM with λ=4: (J^TJ + λ)δ = −J^Tr = 2. What is δ_LM?

(step size)

Show derivation

(J^TJ + λ)δ = 2
(4 + 4)δ = 2
δ_LM = 2 / 8 = 0.25

Damping λ=4 (equal to J^TJ) halves the step from 0.5 to 0.25. As λ→∞, δ→0 (infinitesimal gradient-descent steps). As λ→0, δ→0.5 (pure Gauss-Newton). Typical LM implementations start with λ=10⁻⁴·max(diag(J^TJ)) and multiply by 10 on failure, divide by 10 on success.

Exercise 5.4: Residual After GN Step Derive

Before GN step: cost f = ∑r² = 9 (r₁=2, r₂=−2, r₃=1 → cost=4+4+1=9). After one Gauss-Newton step with J = [[2,0],[−1,1],[0,1]], J^TJ = [[5,−1],[−1,2]], J^Tr = [[2],[−1]], the step δ satisfies (J^TJ)δ=−J^Tr = [[−2],[1]]. det(J^TJ)=10−1=9; δ₁ = (−2×2−1×(−1))⁄9 = (−4+1)⁄9 = −3⁄9 = −1⁄3. What is δ₂? (Use Cramer's rule: δ₂=(5×1−(−1)×(−2))⁄9)

(δ₂)

Show derivation

δ₂ = (5 × 1 − (−1) × (−2)) / 9
= (5 − 2) / 9 = 3 / 9 = 1⁄3 ≈ 0.333

Cramer's rule: for [[a,b],[c,d]]δ=v, δ₁=(dv₁−bv₂)/det, δ₂=(av₂−cv₁)/det. With v=[[−2],[1]] and det=9: δ₂=(5×1−(−1)×(−2))/9=3/9=1/3. In practice, Cholesky factorization solves J^TJδ=−J^Tr in O(n³/3) for dense systems or sparse Cholesky for large SLAM graphs.

Exercise 5.5: Information Matrix Interpretation Trace

In weighted least squares, the information matrix is Ω = diag(1/σ₁², 1/σ₂², ...). If sensor A has σ=0.01m and sensor B has σ=1m, and both measure the same quantity, the WLS solution weights sensor A by how much more than sensor B?

100× more (linear in σ ratio) 10× more (σ ratio) 10,000× more (information = 1/σ²: ratio = (1/0.01²)/(1/1²) = 10,000) Equal weight — both measure the same thing

Show explanation

Information = 1/σ². Ratio = (1/0.0001) / (1/1) = 1/0.0001 = 10,000. A sensor with 100× better precision gets 10,000× more weight in the cost function. This means a single good measurement can dominate thousands of poor ones. In SLAM this matters: a well-calibrated camera (sub-pixel, σ≈1px) should heavily outweigh a poor depth prior.

Chapter 6: Manifold Optimization

Naive Gauss-Newton on SE(3) adds a δ vector to R and t, but R + δR is not a rotation. The fix: parameterize the update in the Lie algebra so(3), apply a retraction R ← R·exp(δ̂), and define boxminus on the manifold to compute differences. This keeps all iterates on SO(3) throughout optimization.

Retraction: R ← R · exp(δ̂) (stays on SO(3))

Boxminus: R₁ ⊟ R₂ = log(R₂^TR₁) ∈ so(3) (relative rotation as tangent vector)

Boxplus: R ⊞ δ = R · exp(δ̂)

Why not add R + δR? R + δR violates R^TR = I after even one step — you immediately leave the manifold. Projecting back (re-orthogonalizing) is expensive and introduces approximation errors. The exponential map is exact: it defines a path that stays on SO(3).

Exercise 6.1: Direct Addition Leaves SO(3) Trace

R = I (identity rotation). You add δ = [[0, −ε, 0], [ε, 0, 0], [0, 0, 0]] (a skew-symmetric perturbation, ε=0.1). What is (I + δ)^T(I + δ)?

Exactly I — adding a skew-symmetric matrix preserves orthogonality Not I — it equals I + δ^Tδ + δ + δ^T = I + 2δ·sym + δ^Tδ ≠ I because δ^Tδ ≠ 0 I + ε²I — so it is close to I for small ε

Show explanation

(I+δ)^T(I+δ) = I + δ + δ^T + δ^Tδ. Since δ is skew-symmetric: δ + δ^T = 0. So (I+δ)^T(I+δ) = I + δ^Tδ. With ε=0.1: δ^Tδ has entries of order ε²=0.01. So the result is I + O(ε²) ≠ I. For large enough ε this is a significant violation. The exponential map avoids this: exp(δ)^Texp(δ) = I exactly.

Exercise 6.2: Boxminus — Angle Between Two Rotations Derive

R₁ = R_z(30°) and R₂ = R_z(50°). The boxminus R₁ ⊟ R₂ = log(R₂^TR₁) = log(R_z(−50°)R_z(30°)) = log(R_z(−20°)). What is ||R₁ ⊟ R₂|| in degrees?

degrees

Show derivation

R₂^TR₁ = R_z(−50°)R_z(30°) = R_z(−20°)
log(R_z(−20°)) is a vector of magnitude 20° (converted to rad: −20×π⁄180 ≈ −0.349)
||R₁ ⊟ R₂|| = 20°

Boxminus gives the "angular distance" between two rotations: the rotation that takes R₂ to R₁. This is the correct Riemannian geodesic distance on SO(3). The Euclidean distance ||R₁−R₂||_F is NOT a good rotation distance — it depends on the embedding in ℝ⁹ rather than the manifold geometry.

Exercise 6.3: Retraction vs. Addition — Which Stays on SO(3)? Trace

You have R ∈ SO(3) and a tangent update δ ∈ ℝ³. Which update rule guarantees the result is still in SO(3)?

R_new = R + δ̂ (add the skew-symmetric matrix directly) R_new = R · exp(δ̂) (retraction via matrix exponential) R_new = R · (I + δ̂) (first-order approximation) R_new = R + δ, then normalize columns (project back)

Show explanation

Only R·exp(δ̂) is guaranteed to be in SO(3) for any δ. exp(δ̂) ∈ SO(3) because δ̂ is skew-symmetric (the matrix exponential of a skew-symmetric matrix is orthogonal). The product of two SO(3) elements is in SO(3) (group closure). Options A and C break orthogonality for finite δ. Option D (Gram-Schmidt re-orthogonalization) works but is not the natural manifold retraction.

Exercise 6.4: SE(3) Perturbation Dimension Derive

SE(3) is a 6-dimensional manifold (3 for rotation, 3 for translation). Its Lie algebra se(3) has tangent vectors ξ = [ρ, φ]^T ∈ ℝ⁶. How many scalar parameters does a single SE(3) Gauss-Newton update δ have?

parameters

Show derivation

dim(SE(3)) = dim(SO(3)) + dim(ℝ³) = 3 + 3 = 6

Each camera pose update δ ∈ ℝ⁶: three components for the rotation update (φ, applied via exp(φ̂)) and three for the translation update (ρ). In a factor graph with N camera poses and M 3D points, the full update vector δ has 6N + 3M components. A typical SLAM graph with 100 keyframes and 1000 landmarks has a 1500-dimensional update vector solved at each LM iteration.

Exercise 6.5: Rodrigues at 180° Derive

Rodrigues formula: exp(φ̂) = I + sin(θ)/θ · φ̂ + (1−cos(θ))/θ² · φ̂², where φ = [0, 0, π]^T (180° around z). Then sin(π)=0 and (1−cos(π))/π² = 2/π². φ̂² = [[−1,0,0],[0,−1,0],[0,0,0]] (skew of z-unit).
Result: R = I + 2/π²×π²×diag(−1,−1,0) = I + diag(−2,−2,0). What is R[0][0] (top-left entry)?

(R[0][0])

Show derivation

For φ = [0,0,π], θ = π: sin(π)=0, cos(π)=−1
(1−cosθ)/θ² = 2/π²
(φ/θ)̂² = [0,0,1]̂² = [[−1,0,0],[0,−1,0],[0,0,0]]
R = I + 0 + (2/π²)×π²×[[−1,0,0],[0,−1,0],[0,0,0]]
= I + [[−2,0,0],[0,−2,0],[0,0,0]] = [[−1,0,0],[0,−1,0],[0,0,1]]

R[0][0] = −1. This is R_z(180°): x→−x, y→−y, z→z. Sanity check: tr(R) = −1+−1+1 = −1, and arccos((−1−1)/2)=arccos(−1)=180° ✓. At 180° the formula is well-defined (sin=0 term vanishes cleanly).

Chapter 7: Visual Odometry & Visual-Inertial Odometry

Visual odometry (VO) accumulates per-frame pose increments. Each 2% relative error compounds: after 50 steps, total drift can exceed 100% of the path length. VIO fuses camera and IMU: the IMU constrains short-term motion; the camera corrects long-term drift. Without loop closure, both still drift — but VIO drifts much slower.

Drift after N steps at ε% per step: accumulated drift ≈ N × ε% × step_length

Monocular scale factor: unobservable from images alone; set by first stereo frame, IMU, or known-size object

IMU dead-reckoning: position error ≈ ½ · a_bias · t² (accelerometer bias → quadratic position drift)

Drift is the enemy of navigation. A 1% drift per meter of travel is excellent for VO. At 100m of travel, the absolute position error is ~1m. After 1km, ~10m. SLAM closes loops to bound drift. Without loop closures, long hallways and featureless environments remain a fundamental challenge.

Exercise 7.1: VO Drift Over N Steps Derive

A VO system has 0.5% relative translation error per step. Each step moves 0.3m. After 200 steps, what is the accumulated position drift in meters? (Assume errors add independently: drift ≈ N × ε × d_step.)

meters

Show derivation

drift = N × ε × d_step
= 200 × 0.005 × 0.3 = 0.3 m

200 steps × 0.3m = 60m total path. Drift = 0.5% × 60m = 0.30m. This is 0.5% of path length, consistent with the per-step error. State-of-the-art VO systems like VINS-Mono achieve ~0.5% drift over hundreds of meters. Loop closure can reduce accumulated drift to zero at the cost of a global optimization.

Exercise 7.2: Monocular Scale Initialization Trace

A monocular camera observes a scene for 10 frames. Without additional information, what can be recovered and what cannot?

Nothing can be recovered — monocular SfM is mathematically impossible Metric trajectory (position in meters) and metric 3D map Rotation R between frames, and relative translation direction; 3D structure up to an unknown scale factor λ (the scene could be 1m or 100m away and produce the same images) Only rotation; translation is unobservable monocularly

Show explanation

Monocular SfM recovers structure and motion up to a global scale. If the camera moves 1m or 1km, the image observations are identical after scaling all 3D points and the translation by the same factor. Scale can be set by: (a) a stereo frame (known baseline), (b) IMU integration (metric acceleration), (c) a known-size object, or (d) GPS. Rotation is scale-independent and fully recoverable from pure image correspondences.

Exercise 7.3: IMU Dead-Reckoning Position Drift Derive

An accelerometer has a bias b = 0.01 m/s² (typical MEMS). The robot dead-reckons for t = 10 seconds with no camera correction. Position error ≈ ½ · b · t². What is the position drift in meters?

meters

Show derivation

drift = ½ × 0.01 × 10² = ½ × 0.01 × 100 = 0.5 m

0.5m of drift in 10 seconds is typical for consumer-grade IMU dead-reckoning. At t=30s: drift = ½×0.01×900 = 4.5m. This is why VIO fuses camera and IMU: the camera corrects IMU drift every few milliseconds, keeping overall drift manageable. High-end IMUs (tactical grade) have b≈0.0001 m/s², giving only 0.005m drift in 10s.

Exercise 7.4: Keyframe Selection Rule Trace

A VO system processes frames at 30Hz. It only creates a new keyframe when the tracked features fall below 80% retention OR the camera has moved more than 0.5m. Why is this better than keyframing every frame?

Processing every frame reduces drift because more pose estimates are available Keyframe selection controls the graph size (fewer nodes → faster optimization) while ensuring enough baseline between keyframes for good triangulation. Near-identical frames contribute little new information but add significant compute. Keyframe selection helps with loop closure detection by spreading keyframes further apart

Show explanation

Two motivations: (1) computational efficiency — the bundle adjustment graph grows linearly with keyframe count; selecting keyframes with sufficient baseline keeps the graph manageable. (2) geometric quality — two near-identical views produce near-zero baseline, making triangulation numerically ill-conditioned (depth uncertainty diverges). The 0.5m threshold ensures a meaningful baseline for each new keyframe. ORB-SLAM3 uses a similar strategy, marginalizing redundant keyframes to bound graph size.

Exercise 7.5: VIO vs. VO Drift Comparison Derive

VO system: 1.5% drift per meter. VIO system: 0.3% drift per meter. Both traverse a 500m corridor. What is the absolute position error (in meters) of each system at the end?

m (VIO error only)

Show derivation

VO error = 1.5% × 500 = 7.5 m
VIO error = 0.3% × 500 = 1.5 m

VIO reduces drift by 5× over VO alone. At 500m the difference is 6m — enough to miss a doorway or hit a wall. State-of-the-art VIO systems (VINS-Fusion, Kimera) achieve ~0.1–0.5% drift over hundreds of meters in indoor environments. Outdoor long-range navigation still requires GNSS or loop closures.

Chapter 8: Place Recognition & BoW

Place recognition matches a query image to a database of past frames. Bag-of-Visual-Words (BoW) maps each local descriptor to its nearest cluster center (visual word), building a histogram over K words. TF-IDF reweights the histogram to downweight ubiquitous words and amplify distinctive ones. Cosine similarity ranks candidates; geometric RANSAC verification confirms the loop closure.

TF-IDF weight for word k in image i:
w_ik = tf_ik × idf_k = (n_ik⁄n_i) × log(N⁄n_k)

Cosine similarity: s(i,j) = (w_i·w_j) / (||w_i|| · ||w_j||)

Geometric verification: run RANSAC on top-K candidates to confirm inlier count > threshold

Two-stage retrieval: BoW retrieves top-K candidates in O(log N) per query using an inverted index. Geometric verification then runs RANSAC on only the top K (typically 5-10) candidates. Most false positives from BoW are eliminated by geometry; the overall pipeline is fast even for large databases.

Exercise 8.1: IDF Computation Derive

Database: N=1000 images. Visual word k appears in n_k=10 images. Compute idf_k = log(N⁄n_k) (natural log). Use ln(100) ≈ 4.605.

(IDF value)

Show derivation

idf_k = ln(1000 ⁄ 10) = ln(100) ≈ 4.605

A word appearing in 1% of images has IDF ≈ 4.6. A word appearing in 50% of images: idf = ln(2) ≈ 0.69. A word appearing in every image: idf = ln(1) = 0 (completely suppressed). The IDF is the primary discriminability signal: rare visual words distinguish places far better than common ones (which appear everywhere from generic textures like sky or asphalt).

Exercise 8.2: TF-IDF Weight for a Specific Word Derive

Query image has 500 features total. Visual word k = "red octagon" appears 25 times in this image (tf_k = 25/500 = 0.05). The database IDF for word k is 4.605 (from Ex 8.1). What is the TF-IDF weight w_k = tf_k × idf_k?

(TF-IDF weight)

Show derivation

tf_k = 25 / 500 = 0.05
w_k = tf_k × idf_k = 0.05 × 4.605 = 0.230

This visual word contributes weight 0.230 to the image's TF-IDF vector. A common word appearing 50 times but with idf=0.69 would only contribute 0.050×0.69=0.069 — 3.3× less influential despite being twice as frequent. TF-IDF correctly prioritizes distinctive rare words over common ones.

Exercise 8.3: Cosine Similarity of Two BoW Vectors Derive

Query w_q = [0.3, 0.4] (2D example, already normalized: ||w_q||=0.5). Database image w_d = [0.4, 0.3] (||w_d||=0.5). Cosine similarity = (w_q·w_d) / (||w_q||·||w_d||). What is the cosine similarity?

(similarity, 0-1)

Show derivation

w_q·w_d = 0.3×0.4 + 0.4×0.3 = 0.12 + 0.12 = 0.24
||w_q|| = √(0.09+0.16) = √0.25 = 0.5
||w_d|| = 0.5
cos sim = 0.24 / (0.5×0.5) = 0.24 / 0.25 = 0.96

Cosine similarity 0.96 is very high — the two images share most of their visual content. A threshold of 0.7–0.8 is typically used to identify loop-closure candidates. Identical images have similarity 1.0; completely different scenes have similarity ≈ 0 (orthogonal BoW histograms in high-dimensional vocabulary space).

Exercise 8.4: Geometric Verification Threshold Trace

BoW retrieval returns 5 candidate images for the query. Geometric verification runs RANSAC on each. Results: candidates get inlier counts [42, 38, 8, 5, 3]. With a threshold of 20 inliers to confirm a loop closure, which candidates are accepted?

All 5 candidates (BoW already verified them) Candidates 1 and 2 only (42 and 38 inliers > 20 threshold) Only candidate 1 (highest inlier count is sufficient) None — BoW similarity must also exceed 0.9

Show explanation

Candidates 1 and 2 pass geometric verification (42 and 38 > 20 threshold). Candidates 3-5 are rejected as false positives from BoW (8, 5, 3 inliers — these are coincidental visual similarity, not actual loop closures). In practice, systems often require temporal consistency too: the loop-closure candidate must be consistent across multiple consecutive frames, reducing false positives further.

Exercise 8.5: Vocabulary Tree Branching Factor Derive

A vocabulary tree with branching factor b=10 and depth d=6 levels. Total number of leaf nodes (visual words) = b^d. What is the vocabulary size?

words

Show derivation

leaves = b^d = 10⁶ = 1,000,000 words

DBoW2 (ORB-SLAM2) uses exactly these parameters: 10-ary tree with 6 levels = 10⁶ words. Quantizing a descriptor takes only 6 comparisons (one per level), not 10⁶. The tree structure gives O(b·d) = O(60) assignment time vs. O(10⁶) for flat k-means. This is why 1M-word vocabularies are practical in real-time SLAM.

Chapter 9: SLAM & Robustness — Capstone

SLAM fuses odometry, landmarks, and loop closures in a factor graph. A single outlier loop closure can fold the entire map because squared cost grows unboundedly. Robust costs (Huber, Cauchy, truncated LS) cap the influence of high-residual measurements. IRLS re-weights each measurement by its robust weight at each iteration, effectively down-weighting outliers to near-zero influence.

Squared: ρ(u) = u²⁄2
Huber (k=1): ρ(u) = u²⁄2 if |u|≤1; |u|−½ if |u|>1
Truncated LS: ρ(u) = min(u²⁄2, c²⁄2)
IRLS weight: w_i = ρ′(u_i) / u_i; solve (J^TWJ)δ = −J^TWr

Capstone insight: robust estimation is not about rejecting outliers by hand — it is about choosing a cost function whose influence function saturates. IRLS makes this a drop-in replacement for Gauss-Newton: same structure, just multiply each residual by its robust weight. The entire SLAM machinery still applies.

Exercise 9.1: Loop-Closure Error Redistribution Derive

A SLAM trajectory has 10 poses. VO accumulated 2.0m drift between pose 1 and pose 10. A loop closure is detected: pose 10 should be 0.5m from pose 1 (instead of the VO-predicted 2.5m). The correction of 2.0m is distributed evenly across all 9 inter-pose segments. How much does each segment shift (in meters)?

m/segment

Show derivation

correction per segment = 2.0 m / 9 segments ≈ 0.222 m

This is the simplest pose-graph correction: uniform distribution of loop-closure error across the path. A full back-end optimizer (g2o, GTSAM) distributes the correction proportionally to odometry uncertainty at each segment — high-uncertainty segments absorb more of the correction. But even the naive equal-distribution approach dramatically reduces drift vs. no loop closure.

Exercise 9.2: Squared vs. Huber Cost at Large Residual Derive

Whitened residual u = 10 (outlier). k=1 for Huber. Compute: ρ_sq(10) = 10²⁄2 = 50. ρ_Huber(10) = 10−½ = 9.5 (since |10|>1, use |u|−½). What is the ratio ρ_sq⁄ρ_Huber? (Round to nearest integer.)

× larger

Show derivation

ρ_sq(10) = 100/2 = 50
ρ_Huber(10) = 10 − 0.5 = 9.5
ratio = 50 / 9.5 ≈ 5.3 → ≈ 5

The squared cost is ~5× larger for u=10. At u=100: ρ_sq=5000 vs. ρ_Huber=99.5, a ratio of ~50. The Huber loss only grows linearly for large residuals, preventing a single outlier from dominating the graph. From the VNAV lesson's table: at u=10, Squared=50, Huber=9.5, Cauchy≈2.3, TLS=0.5 (fully saturated).

Exercise 9.3: Truncated LS — Full Saturation Derive

Truncated LS: ρ_TLS(u) = min(u²⁄2, c²⁄2) with c=1. At u=0.5 (inlier): ρ_TLS = min(0.125, 0.5) = 0.125. At u=5 (outlier): ρ_TLS = min(12.5, 0.5) = 0.5. What is the IRLS weight w = ρ′(u)/u for the outlier at u=5? (ρ′_TLS(u) = 0 when |u|>c.)

(IRLS weight)

Show derivation

For |u| > c: ρ_TLS(u) = c²⁄2 (constant)
ρ′_TLS(u) = 0 for |u| > c
w(u) = ρ′(u) / u = 0 / 5 = 0

TLS completely suppresses outliers: once the residual exceeds the threshold c, the IRLS weight is exactly zero. This is the harshest robust cost — outliers are entirely ignored. The downside: TLS has flat regions with zero gradient, making it non-convex and prone to local minima. GNC (Graduated Non-Convexity) solves this by starting with a convex surrogate and annealing toward TLS.

Exercise 9.4: IRLS Reweight — One Step Derive

Two measurements. Measurement 1 (inlier): u₁=0.8, Huber k=1 → ρ′(u)=u=0.8 → w₁=0.8/0.8=1.0. Measurement 2 (outlier): u₂=8.0 → ρ′(u)=1 (linear branch) → w₂=1/8.0 = 0.125. The IRLS weighted least-squares step uses W=diag(w₁, w₂). After reweighting, by what factor is the inlier weighted MORE than the outlier?

× more weight

Show derivation

w₁ = 1.0; w₂ = 1/8 = 0.125
ratio = w₁ / w₂ = 1.0 / 0.125 = 8

The inlier has 8× more influence than the outlier after one IRLS step. In subsequent iterations, if the outlier residual grows further, its weight decreases further. IRLS converges when the weights stop changing — typically 5–15 iterations for well-conditioned problems. Cauchy weights decrease even faster: w(u)=1/(1+u²/c²); at u=8, c=1: w=1/65≈0.015 — 67× smaller than the inlier.

Exercise 9.5: Cost Comparison — Inlier vs. Outlier Budget Derive

A factor graph has 100 odometry factors (inliers, u≈1 each) and 1 loop-closure outlier (u=10). With squared cost: total inlier cost = 100×0.5=50. Outlier cost = 50. With Huber (k=1): total inlier cost = 100×0.5=50. Outlier Huber cost = 9.5. What fraction of total robust cost does the outlier contribute? (Round to 1 decimal place.)

% of total cost

Show derivation

Inlier total (Huber): 100 × 0.5 = 50
Outlier (Huber): 9.5
Total = 59.5
Outlier fraction = 9.5 / 59.5 ≈ 0.160 = 16.0%

With squared cost: outlier fraction = 50/(50+50)=50% — the single outlier equals ALL odometry combined. With Huber: 16%. With Cauchy (c=1, u=10): ρ(10)=ln(1+100)/2≈2.3; fraction=2.3/52.3≈4.4%. With TLS (c=1): ρ(10)=0.5; fraction=0.5/50.5≈1.0%. The robust cost hierarchy correctly assigns less and less budget to the outlier.

Chapter 0: SE(3) & Lie Groups

Chapter 1: Camera Projection

Chapter 2: Features & Matching

Chapter 3: Two-View Geometry

Chapter 4: RANSAC

Chapter 5: Nonlinear Least Squares

Chapter 6: Manifold Optimization

Chapter 7: Visual Odometry & Visual-Inertial Odometry

Chapter 8: Place Recognition & BoW

Chapter 9: SLAM & Robustness — Capstone