Every number a VNAV engineer must derive cold: SE(3) composition, rotation angles from trace, camera projection, focal length in pixels, epipolar constraint, RANSAC iteration count, Gauss-Newton steps, VO drift budgets, BoW similarity, and SLAM loop-closure arithmetic. Concrete numeric questions, fully worked derivations, verifiable in-browser.
SE(3) is the group of rigid-body transforms in 3D space. Every robot pose — every camera frame — lives here. Composing transforms is matrix multiplication; updating them optimally requires the Lie algebra so(3), where small corrections live as skew-symmetric matrices before being mapped back to SO(3) via the matrix exponential.
Frame A is the world origin. TAB translates by tAB = [3, 0, 0]T with R = I. TBC translates by tBC = [0, 2, 0]T with R = I. What is the y-component of tAC in the composed transform TAC = TAB · TBC?
The SE(3) composition rule is tAC = RABtBC + tAB. Since RAB = I here, the translation from B to C is expressed unchanged in A's frame. The y-component is 2.
A rotation matrix R has trace tr(R) = 1.0. What is the rotation angle θ in degrees? (Use the formula θ = arccos((tr(R) − 1) ⁄ 2) and round to the nearest degree.)
The trace of a 3D rotation by θ about any axis is 1 + 2cos(θ). Solving: cos(θ) = (tr(R) − 1)/2 = 0/2 = 0, so θ = 90°. Sanity check: tr(I) = 3 → θ = arccos(1) = 0°. tr(R180) = −1 → θ = arccos(−1) = 180°.
Rz(90°) rotates 90° around the z-axis: it maps x→y and y→−x. Apply Rz(90°) to the point p = [1, 0, 0]T. What is the y-component of Rz(90°)p?
Rz(θ) = [[cosθ, −sinθ, 0],[sinθ, cosθ, 0],[0,0,1]]. At 90°: cos=0, sin=1. Column 1 of Rz is [0, 1, 0]T, so p=[1,0,0] maps directly to it. The y-component is 1.
A Lie algebra element φ = [0, 0, 0.5236]T (axis-angle vector; the z-component is π⁄6 ≈ 0.5236 rad). The rotation angle is ||φ||. What is the rotation angle in degrees? (Round to the nearest integer.)
The Lie algebra element φ ∈ ℝ3 encodes the rotation axis as its direction and the rotation angle as its magnitude. Here φ = [0,0,π⁄6]T, so the rotation is 30° around the z-axis. exp(φ̂) = Rz(30°).
T = [[R, t],[0,1]] with R = I and t = [5, 3, −2]T. The inverse of an SE(3) element is T−1 = [[RT, −RTt],[0,1]]. What is the x-component of the translation in T−1?
With R = I, RT = I, so −RTt = −t. The inverse translation is simply [−5,−3,2]T and x = −5. For non-identity R, the inverse rotation RT also rotates the translation vector, so order matters.
A pinhole camera collapses 3D space to 2D pixels. The forward model is p = K[R|t]P: a 3D world point P (homogeneous 4-vector) multiplied by the full projection matrix P = K[R|t] gives a 2D pixel (homogeneous 3-vector). Dividing by w recovers (u, v). Every calibration pipeline boils down to estimating K.
Camera with K = diag(800, 800, 1) and principal point (320, 240). A 3D point in camera coordinates is Pc = (0.3, 0.1, 4.0) m. What is the pixel u-coordinate?
Formula: u = fx · X⁄Z + cx
The projection u = fx(X/Z) + cx maps a camera-frame X coordinate to a pixel column. X/Z = 0.075 is the tangent angle; multiplying by fx=800 converts it from radians to pixels; adding cx=320 shifts from image center to image corner. The v-coordinate would be 800×(0.1/4.0)+240 = 260 px.
A camera sensor has 4000 pixels across a 6 mm wide sensor (sx = 4000 ⁄ 0.006 pixels/meter). The physical focal length is f = 8 mm = 0.008 m. What is fx in pixels?
Physical focal length × sensor density = focal length in pixels. A 50mm lens on a 35mm full-frame sensor (36mm wide, 8000px) would give fx = (8000/0.036)×0.05 ≈ 11,111 px. Focal length in pixels captures the combined effect of lens choice and sensor resolution.
Camera K = diag(600, 600, 1) with cx=320, cy=240. Pixel (u, v) = (440, 240). Back-project to normalized camera coordinates: xn = (u − cx) ⁄ fx. What is xn?
The back-projected ray direction in camera coordinates is [xn, yn, 1]T = [(u−cx)/fx, (v−cy)/fy, 1]T = [0.2, 0, 1]T. At depth Z=5m, the 3D point is [1.0, 0, 5]T. The ray direction is the unit vector K−1[u,v,1]T normalized.
A distorted point in normalized coords: (ud, vd) = (0.3, 0.4). The radial distance r2 = ud2 + vd2. With k1 = 0.1 and k2 = 0, the undistorted uu = ud(1 + k1r2). What is uu? (Round to 3 decimal places.)
Radial distortion moves points outward (k1>0, barrel distortion) or inward (k1<0, pincushion). Here k1=0.1 gives a 2.5% outward push at r=0.5. Real wide-angle cameras can have k1 ≈ −0.3 (strong pincushion). Calibration tools (OpenCV, Kalibr) estimate k1, k2, k3 simultaneously with fx, fy, cx, cy.
Point A at depth Z=2m projects to pixel u=380 with fx=800, cx=320 (X/Z = (380−320)/800 = 0.075, so X=0.15m). Point B is on the same ray at depth Z=6m. What is the X-coordinate of B (in meters)?
Both points project to the same pixel because X/Z = 0.075 is the same for both. The ray passes through A=(0.15, ?, 2) and B=(0.45, ?, 6) — three times farther gives three times larger X. This is the fundamental depth ambiguity: you know the ratio X/Z but not X or Z individually from a single frame.
Feature detection finds repeatable, distinctive keypoints. Harris corners use the structure tensor M — the sum of outer products of image gradients — to classify flat regions, edges, and corners by its eigenvalues. The ratio test rejects ambiguous matches, and descriptors (SIFT, ORB) encode local appearance for cross-frame comparison.
M = ∑[[Ix2, IxIy],[IxIy, Iy2]]. With Ix=5 and Iy=0 at every pixel in a 3×3 window (9 pixels): M = 9×[[25,0],[0,0]] = [[225,0],[0,0]]. The eigenvalues are 225 and 0: large in one direction (strong gradient perpendicular to the edge), zero in the other (no variation along the edge). This is a vertical edge (λ1≫λ2=0). Option C is qualitatively correct: one large eigenvalue, one zero.
A window has structure tensor M with det(M) = 1200 and tr(M) = 50. Using k = 0.04, compute R = det(M) − k · tr(M)2. What is R?
R > 0 indicates a corner. R < 0 indicates an edge (det<k·tr2). R ≈ 0 indicates a flat region. For the threshold, practitioners keep keypoints with R > α · max(R) across the image, where α ≈ 0.01.
A SIFT descriptor from image A is compared to all descriptors in image B. The nearest-neighbor distance is d1 = 0.62 and the second-nearest distance is d2 = 0.90. Lowe's ratio threshold is 0.8. Is this match accepted or rejected?
Lowe's test: accept if d1/d2 < 0.8. Here 0.689 < 0.8, so the match is accepted. The intuition: if the best match is much closer than the second-best, it's probably correct. If the ratio is close to 1 (two equally plausible matches), the correspondence is ambiguous (e.g., repeated textures) and should be rejected. Lowe showed that 0.8 keeps 95% of correct matches and rejects 90% of incorrect ones.
Structure tensor M = [[4, 2], [2, 9]]. det(M) = 4×9 − 2×2 = ?
det(M) = λ1λ2 and tr(M) = λ1+λ2. With det=32>0 and tr=13, both eigenvalues are positive — this is a corner region. The Harris response is R = 32 − 0.04×132 = 32 − 6.76 = 25.24 > 0, confirming a corner.
FAST-9 requires 9 consecutive pixels in the 16-pixel Bresenham circle all brighter or all darker than the center by a threshold. A 9-pixel arc is a little over half the circle, making it geometrically impossible for an edge (which would have brighter pixels on one side but at most 8 consecutive bright pixels). FAST-12 (12 pixels) is more conservative; ORB uses FAST-9 as its detector.
Two calibrated views of the same scene are linked by the essential matrix E = [t]×R. Every correct correspondence (y1, y2) satisfies the epipolar constraint y2TEy1 = 0. Triangulation recovers 3D depth, but depth uncertainty grows as Z2⁄baseline.
t = [1, 0, 0]T, R = I. Then E = [t]×R = [t]× = [[0,0,0],[0,0,−1],[0,1,0]]. A candidate match: ỹ1 = [0, 0, 1]T, ỹ2 = [−0.2, 0, 1]T. Compute ỹ2TEỹ1. What is the result?
The epipolar constraint is satisfied: this match is geometrically consistent with the given camera motion. A horizontal baseline (t along x) produces horizontal epipolar lines — y2 must have the same v as y1 (both have normalized v=0 here).
Two cameras: camera 1 at origin, camera 2 at t=[b, 0, 0]T with b=0.12m (12cm baseline), R=I. A 3D point is at X=0 (centred), Y=0, Z. In camera 1 normalized coords: y1 = [0, 0, 1]T. In camera 2: y2 = [−0.04, 0, 1]T. The horizontal disparity is −b/Z → Z = b/|disparity| = 0.12/0.04. What is Z?
With a pure horizontal baseline and no rotation, the depth Z = baseline / disparity (in normalized coords). In pixel coords: Z = f · b / (u1 − u2). This is the stereo depth formula. Doubling the baseline halves depth error; doubling the depth quadruples it.
Stereo depth error: σZ = (Z2 / (f · b)) · σpx. Given f=600px, b=0.1m, σpx=1 pixel. At Z=3m: σZ = (32/(600×0.1))×1. What is σZ in meters?
At Z=6m (double): σZ = 36/60 = 0.6m — quadrupled, as expected from the Z2 scaling. At Z=10m with the same setup: σZ = 100/60 ≈ 1.67m, which is why stereo SLAM loses depth accuracy fast at range. LiDAR has roughly linear depth error, giving it a huge advantage beyond ~10m.
Relative pose has 6 DOF (3 rotation + 3 translation). But only the direction of translation is observable from image correspondences (multiply all 3D points by λ and t by λ → same images). So translation has 2 DOF (unit sphere). Total: 3 + 2 = 5 DOF. Each correspondence provides one scalar constraint (y2TEy1=0). Five constraints = 5 equations for 5 unknowns. The 8-point algorithm uses 8 correspondences and fits a general 3×3 rank-2 matrix (8 DOF before enforcing rank-2), which is over-parameterized but simpler to implement.
A stereo pair with f=500px and baseline b=0.08m observes a point at depth Z=4m. Disparity d = f·b⁄Z. Now the baseline is doubled to 0.16m. By what factor does the disparity change?
Disparity is proportional to baseline: doubling b doubles d. Larger disparity means the depth estimate is more precise (same sub-pixel error over a larger signal). The ZED stereo camera (12cm baseline) and the Intel D435 (5cm baseline) were designed with this tradeoff in mind.
RANSAC tolerates high outlier rates by repeatedly sampling a minimal subset, fitting a model, and counting inliers. The required iteration count N guarantees that at least one all-inlier sample is drawn with probability p, given inlier fraction w and minimal sample size s.
You fit a 2D line (s=2). Inlier fraction w=0.7. Desired confidence p=0.99. Use N = log(1−p) / log(1−ws). Compute N (round up to integer). Note: log(0.01)/log(1−0.49) ≈ log(0.01)/log(0.51). Use ln(0.01)≈−4.605 and ln(0.51)≈−0.673.
With 70% inliers and s=2, you only need 7 iterations to have 99% confidence of hitting an all-inlier pair. The simplicity of the line fit (small s) compensates for uncertainty. Compare to the homography (s=4): w=0.7, N = ln(0.01)/ln(1−0.74) = −4.605/ln(0.7599) ≈ −4.605/−0.2744 ≈ 17 iterations.
8-point essential matrix (s=8). w=0.5 (50% outliers). p=0.99. Compute N. Use w8=0.58=0.00391, so 1−ws=0.99609. ln(0.01)/ln(0.99609). Use ln(0.99609)≈−0.003918.
Compare to the 5-point algorithm at w=0.5: w5=0.03125, 1−ws=0.96875, N = ln(0.01)/ln(0.96875) = −4.605/−0.03175 ≈ 145 iterations. The 5-point solver reduces iterations by 8× at 50% inliers. With 60% outliers (w=0.4): 8-point needs N≈17,000 iterations — effectively unusable. 5-point needs ≈600.
A RANSAC iteration proposes a homography H. You test all N=120 correspondences: residuals below ε=2px count as inliers. You find 78 inliers. What is the estimated inlier fraction w (as a percentage, rounded to integer)?
This inlier fraction can be used to dynamically update N during the RANSAC loop. Adaptive RANSAC starts with N=∞ and, after each iteration, updates N = log(1−p)/log(1−ws) using the current best w. If w improves, N shrinks. PROSAC (Progressive Sample Consensus) additionally sorts correspondences by quality to try high-quality pairs first.
A 3×3 homography matrix H has 9 entries but is defined up to scale (multiply H by any λ ≠ 0 and it represents the same mapping). So it has 8 DOF. Each correspondence (x',y') ↔ (x,y) contributes 2 scalar equations (one for x', one for y' after the homogeneous divide). Four correspondences give 8 equations for 8 unknowns: exactly determined. With 3 correspondences you have 6 equations for 8 unknowns: a 2-parameter family of solutions.
Homography RANSAC, s=4, w=0.6. At p=0.99: N99=ln(0.01)/ln(1−0.64). 0.64=0.1296; ln(0.8704)≈−0.1386; N99≈ln(0.01)/(−0.1386)=33.2 → 34. What is N at p=0.95? (ln(0.05)≈−2.996)
Going from 95% to 99% confidence requires 34 vs. 22 iterations — only 55% more work for a significantly stronger guarantee. In practice, 500–2000 iterations are used for robustness; the formula gives the theoretical minimum. Note that increasing w from 0.6 to 0.8 would reduce N99 from 34 to just 7 iterations.
Camera bundle adjustment, SLAM graph optimization, and IMU preintegration all reduce to NLS: minimize ∑||ri(x)||2. The Gauss-Newton step linearizes each residual with its Jacobian and solves the resulting linear system. Levenberg-Marquardt blends Gauss-Newton with gradient descent via a damping parameter λ.
Two data points: (x=1, y=2) and (x=3, y=4). Fit y = ax. Model matrix A = [[1],[3]], b = [[2],[4]]. Solve (ATA)â = ATb. ATA = 12+32 = 10; ATb = 1×2+3×4 = 14. What is â = ATb / ATA?
The residuals are r = b − Aâ = [2−1.4, 4−4.2]T = [0.6, −0.2]T. Cost = 0.36 + 0.04 = 0.40. Verify: d(cost)/da = 2AT(Aa−b) = 2(10×1.4−14) = 0 ✓. The data doesn't pass through the origin perfectly (truth would be a=1 for y=x), so the least-squares fit balances both residuals.
Scalar problem: minimize f(x) = r(x)2 where r(x) = x2 − 3. Current estimate: x̄ = 2. r(x̄) = 4−3 = 1. J = dr/dx = 2x̄ = 4. Gauss-Newton step: δ = −(JTJ)−1JTr = −r/J. What is the new estimate x̄ + δ?
The true solution is x* = √3 ≈ 1.732. After one step we're at 1.75 — already very close. A second GN step: r(1.75) = 1.752−3 = 0.0625; J=3.5; δ=−0.0179; x=1.732 — converged. GN converges quadratically near the solution for well-conditioned problems.
Scalar problem: JTJ = 4, JTr = −2 (so GN step δGN = −JTr / JTJ = 2/4 = 0.5). Now apply LM with λ=4: (JTJ + λ)δ = −JTr = 2. What is δLM?
Damping λ=4 (equal to JTJ) halves the step from 0.5 to 0.25. As λ→∞, δ→0 (infinitesimal gradient-descent steps). As λ→0, δ→0.5 (pure Gauss-Newton). Typical LM implementations start with λ=10−4·max(diag(JTJ)) and multiply by 10 on failure, divide by 10 on success.
Before GN step: cost f = ∑r2 = 9 (r1=2, r2=−2, r3=1 → cost=4+4+1=9). After one Gauss-Newton step with J = [[2,0],[−1,1],[0,1]], JTJ = [[5,−1],[−1,2]], JTr = [[2],[−1]], the step δ satisfies (JTJ)δ=−JTr = [[−2],[1]]. det(JTJ)=10−1=9; δ1 = (−2×2−1×(−1))⁄9 = (−4+1)⁄9 = −3⁄9 = −1⁄3. What is δ2? (Use Cramer's rule: δ2=(5×1−(−1)×(−2))⁄9)
Cramer's rule: for [[a,b],[c,d]]δ=v, δ1=(dv1−bv2)/det, δ2=(av2−cv1)/det. With v=[[−2],[1]] and det=9: δ2=(5×1−(−1)×(−2))/9=3/9=1/3. In practice, Cholesky factorization solves JTJδ=−JTr in O(n3/3) for dense systems or sparse Cholesky for large SLAM graphs.
Information = 1/σ2. Ratio = (1/0.0001) / (1/1) = 1/0.0001 = 10,000. A sensor with 100× better precision gets 10,000× more weight in the cost function. This means a single good measurement can dominate thousands of poor ones. In SLAM this matters: a well-calibrated camera (sub-pixel, σ≈1px) should heavily outweigh a poor depth prior.
Naive Gauss-Newton on SE(3) adds a δ vector to R and t, but R + δR is not a rotation. The fix: parameterize the update in the Lie algebra so(3), apply a retraction R ← R·exp(δ̂), and define boxminus on the manifold to compute differences. This keeps all iterates on SO(3) throughout optimization.
(I+δ)T(I+δ) = I + δ + δT + δTδ. Since δ is skew-symmetric: δ + δT = 0. So (I+δ)T(I+δ) = I + δTδ. With ε=0.1: δTδ has entries of order ε2=0.01. So the result is I + O(ε2) ≠ I. For large enough ε this is a significant violation. The exponential map avoids this: exp(δ)Texp(δ) = I exactly.
R1 = Rz(30°) and R2 = Rz(50°). The boxminus R1 ⊟ R2 = log(R2TR1) = log(Rz(−50°)Rz(30°)) = log(Rz(−20°)). What is ||R1 ⊟ R2|| in degrees?
Boxminus gives the "angular distance" between two rotations: the rotation that takes R2 to R1. This is the correct Riemannian geodesic distance on SO(3). The Euclidean distance ||R1−R2||F is NOT a good rotation distance — it depends on the embedding in ℝ9 rather than the manifold geometry.
Only R·exp(δ̂) is guaranteed to be in SO(3) for any δ. exp(δ̂) ∈ SO(3) because δ̂ is skew-symmetric (the matrix exponential of a skew-symmetric matrix is orthogonal). The product of two SO(3) elements is in SO(3) (group closure). Options A and C break orthogonality for finite δ. Option D (Gram-Schmidt re-orthogonalization) works but is not the natural manifold retraction.
SE(3) is a 6-dimensional manifold (3 for rotation, 3 for translation). Its Lie algebra se(3) has tangent vectors ξ = [ρ, φ]T ∈ ℝ6. How many scalar parameters does a single SE(3) Gauss-Newton update δ have?
Each camera pose update δ ∈ ℝ6: three components for the rotation update (φ, applied via exp(φ̂)) and three for the translation update (ρ). In a factor graph with N camera poses and M 3D points, the full update vector δ has 6N + 3M components. A typical SLAM graph with 100 keyframes and 1000 landmarks has a 1500-dimensional update vector solved at each LM iteration.
Rodrigues formula: exp(φ̂) = I + sin(θ)/θ · φ̂ + (1−cos(θ))/θ2 · φ̂2, where φ = [0, 0, π]T (180° around z). Then sin(π)=0 and (1−cos(π))/π2 = 2/π2. φ̂2 = [[−1,0,0],[0,−1,0],[0,0,0]] (skew of z-unit).
Result: R = I + 2/π2×π2×diag(−1,−1,0) = I + diag(−2,−2,0). What is R[0][0] (top-left entry)?
R[0][0] = −1. This is Rz(180°): x→−x, y→−y, z→z. Sanity check: tr(R) = −1+−1+1 = −1, and arccos((−1−1)/2)=arccos(−1)=180° ✓. At 180° the formula is well-defined (sin=0 term vanishes cleanly).
Visual odometry (VO) accumulates per-frame pose increments. Each 2% relative error compounds: after 50 steps, total drift can exceed 100% of the path length. VIO fuses camera and IMU: the IMU constrains short-term motion; the camera corrects long-term drift. Without loop closure, both still drift — but VIO drifts much slower.
A VO system has 0.5% relative translation error per step. Each step moves 0.3m. After 200 steps, what is the accumulated position drift in meters? (Assume errors add independently: drift ≈ N × ε × dstep.)
200 steps × 0.3m = 60m total path. Drift = 0.5% × 60m = 0.30m. This is 0.5% of path length, consistent with the per-step error. State-of-the-art VO systems like VINS-Mono achieve ~0.5% drift over hundreds of meters. Loop closure can reduce accumulated drift to zero at the cost of a global optimization.
Monocular SfM recovers structure and motion up to a global scale. If the camera moves 1m or 1km, the image observations are identical after scaling all 3D points and the translation by the same factor. Scale can be set by: (a) a stereo frame (known baseline), (b) IMU integration (metric acceleration), (c) a known-size object, or (d) GPS. Rotation is scale-independent and fully recoverable from pure image correspondences.
An accelerometer has a bias b = 0.01 m/s2 (typical MEMS). The robot dead-reckons for t = 10 seconds with no camera correction. Position error ≈ ½ · b · t2. What is the position drift in meters?
0.5m of drift in 10 seconds is typical for consumer-grade IMU dead-reckoning. At t=30s: drift = ½×0.01×900 = 4.5m. This is why VIO fuses camera and IMU: the camera corrects IMU drift every few milliseconds, keeping overall drift manageable. High-end IMUs (tactical grade) have b≈0.0001 m/s2, giving only 0.005m drift in 10s.
Two motivations: (1) computational efficiency — the bundle adjustment graph grows linearly with keyframe count; selecting keyframes with sufficient baseline keeps the graph manageable. (2) geometric quality — two near-identical views produce near-zero baseline, making triangulation numerically ill-conditioned (depth uncertainty diverges). The 0.5m threshold ensures a meaningful baseline for each new keyframe. ORB-SLAM3 uses a similar strategy, marginalizing redundant keyframes to bound graph size.
VO system: 1.5% drift per meter. VIO system: 0.3% drift per meter. Both traverse a 500m corridor. What is the absolute position error (in meters) of each system at the end?
VIO reduces drift by 5× over VO alone. At 500m the difference is 6m — enough to miss a doorway or hit a wall. State-of-the-art VIO systems (VINS-Fusion, Kimera) achieve ~0.1–0.5% drift over hundreds of meters in indoor environments. Outdoor long-range navigation still requires GNSS or loop closures.
Place recognition matches a query image to a database of past frames. Bag-of-Visual-Words (BoW) maps each local descriptor to its nearest cluster center (visual word), building a histogram over K words. TF-IDF reweights the histogram to downweight ubiquitous words and amplify distinctive ones. Cosine similarity ranks candidates; geometric RANSAC verification confirms the loop closure.
Database: N=1000 images. Visual word k appears in nk=10 images. Compute idfk = log(N⁄nk) (natural log). Use ln(100) ≈ 4.605.
A word appearing in 1% of images has IDF ≈ 4.6. A word appearing in 50% of images: idf = ln(2) ≈ 0.69. A word appearing in every image: idf = ln(1) = 0 (completely suppressed). The IDF is the primary discriminability signal: rare visual words distinguish places far better than common ones (which appear everywhere from generic textures like sky or asphalt).
Query image has 500 features total. Visual word k = "red octagon" appears 25 times in this image (tfk = 25/500 = 0.05). The database IDF for word k is 4.605 (from Ex 8.1). What is the TF-IDF weight wk = tfk × idfk?
This visual word contributes weight 0.230 to the image's TF-IDF vector. A common word appearing 50 times but with idf=0.69 would only contribute 0.050×0.69=0.069 — 3.3× less influential despite being twice as frequent. TF-IDF correctly prioritizes distinctive rare words over common ones.
Query wq = [0.3, 0.4] (2D example, already normalized: ||wq||=0.5). Database image wd = [0.4, 0.3] (||wd||=0.5). Cosine similarity = (wq·wd) / (||wq||·||wd||). What is the cosine similarity?
Cosine similarity 0.96 is very high — the two images share most of their visual content. A threshold of 0.7–0.8 is typically used to identify loop-closure candidates. Identical images have similarity 1.0; completely different scenes have similarity ≈ 0 (orthogonal BoW histograms in high-dimensional vocabulary space).
Candidates 1 and 2 pass geometric verification (42 and 38 > 20 threshold). Candidates 3-5 are rejected as false positives from BoW (8, 5, 3 inliers — these are coincidental visual similarity, not actual loop closures). In practice, systems often require temporal consistency too: the loop-closure candidate must be consistent across multiple consecutive frames, reducing false positives further.
A vocabulary tree with branching factor b=10 and depth d=6 levels. Total number of leaf nodes (visual words) = bd. What is the vocabulary size?
DBoW2 (ORB-SLAM2) uses exactly these parameters: 10-ary tree with 6 levels = 106 words. Quantizing a descriptor takes only 6 comparisons (one per level), not 106. The tree structure gives O(b·d) = O(60) assignment time vs. O(106) for flat k-means. This is why 1M-word vocabularies are practical in real-time SLAM.
SLAM fuses odometry, landmarks, and loop closures in a factor graph. A single outlier loop closure can fold the entire map because squared cost grows unboundedly. Robust costs (Huber, Cauchy, truncated LS) cap the influence of high-residual measurements. IRLS re-weights each measurement by its robust weight at each iteration, effectively down-weighting outliers to near-zero influence.
A SLAM trajectory has 10 poses. VO accumulated 2.0m drift between pose 1 and pose 10. A loop closure is detected: pose 10 should be 0.5m from pose 1 (instead of the VO-predicted 2.5m). The correction of 2.0m is distributed evenly across all 9 inter-pose segments. How much does each segment shift (in meters)?
This is the simplest pose-graph correction: uniform distribution of loop-closure error across the path. A full back-end optimizer (g2o, GTSAM) distributes the correction proportionally to odometry uncertainty at each segment — high-uncertainty segments absorb more of the correction. But even the naive equal-distribution approach dramatically reduces drift vs. no loop closure.
Whitened residual u = 10 (outlier). k=1 for Huber. Compute: ρsq(10) = 102⁄2 = 50. ρHuber(10) = 10−½ = 9.5 (since |10|>1, use |u|−½). What is the ratio ρsq⁄ρHuber? (Round to nearest integer.)
The squared cost is ~5× larger for u=10. At u=100: ρsq=5000 vs. ρHuber=99.5, a ratio of ~50. The Huber loss only grows linearly for large residuals, preventing a single outlier from dominating the graph. From the VNAV lesson's table: at u=10, Squared=50, Huber=9.5, Cauchy≈2.3, TLS=0.5 (fully saturated).
Truncated LS: ρTLS(u) = min(u2⁄2, c2⁄2) with c=1. At u=0.5 (inlier): ρTLS = min(0.125, 0.5) = 0.125. At u=5 (outlier): ρTLS = min(12.5, 0.5) = 0.5. What is the IRLS weight w = ρ′(u)/u for the outlier at u=5? (ρ′TLS(u) = 0 when |u|>c.)
TLS completely suppresses outliers: once the residual exceeds the threshold c, the IRLS weight is exactly zero. This is the harshest robust cost — outliers are entirely ignored. The downside: TLS has flat regions with zero gradient, making it non-convex and prone to local minima. GNC (Graduated Non-Convexity) solves this by starting with a convex surrogate and annealing toward TLS.
Two measurements. Measurement 1 (inlier): u1=0.8, Huber k=1 → ρ′(u)=u=0.8 → w1=0.8/0.8=1.0. Measurement 2 (outlier): u2=8.0 → ρ′(u)=1 (linear branch) → w2=1/8.0 = 0.125. The IRLS weighted least-squares step uses W=diag(w1, w2). After reweighting, by what factor is the inlier weighted MORE than the outlier?
The inlier has 8× more influence than the outlier after one IRLS step. In subsequent iterations, if the outlier residual grows further, its weight decreases further. IRLS converges when the weights stop changing — typically 5–15 iterations for well-conditioned problems. Cauchy weights decrease even faster: w(u)=1/(1+u2/c2); at u=8, c=1: w=1/65≈0.015 — 67× smaller than the inlier.
A factor graph has 100 odometry factors (inliers, u≈1 each) and 1 loop-closure outlier (u=10). With squared cost: total inlier cost = 100×0.5=50. Outlier cost = 50. With Huber (k=1): total inlier cost = 100×0.5=50. Outlier Huber cost = 9.5. What fraction of total robust cost does the outlier contribute? (Round to 1 decimal place.)
With squared cost: outlier fraction = 50/(50+50)=50% — the single outlier equals ALL odometry combined. With Huber: 16%. With Cauchy (c=1, u=10): ρ(10)=ln(1+100)/2≈2.3; fraction=2.3/52.3≈4.4%. With TLS (c=1): ρ(10)=0.5; fraction=0.5/50.5≈1.0%. The robust cost hierarchy correctly assigns less and less budget to the outlier.