Your robot has driven 2 km, building a map as it goes. Visual odometry accumulated 20 m of drift. A single place-recognition match (L12) just detected a loop: the robot is back where it started. That one constraint — out of thousands — should snap the entire 2 km trajectory and all landmark positions into a consistent global map. Simultaneously. In under a second. How? This is the SLAM back-end: joint MAP estimation over all poses and all landmarks, structured as a sparse factor graph, solved with sparse Cholesky decomposition. MIT 16.485 Lectures 23–24 by Luca Carlone.
A drone completes a 2 km outdoor flight collecting stereo images and IMU data. Visual odometry (L11) estimates poses one step at a time, compounding small errors into ~20 m of drift. The map of 3D landmarks is distorted by the same errors. Now place recognition (L12) fires: the current image matches a keyframe from 8 minutes ago — the drone is back at the start. That single measurement says "pose now ≈ pose then." How does this one constraint fix 2 km of trajectory and tens of thousands of landmark positions?
The answer: we never actually threw away the old poses. A SLAM back-end holds the full state — all robot poses T1, T2, ..., Tn and all landmark positions l1, ..., lK — and finds the joint MAP estimate that best explains every measurement we have ever collected. When the loop-closure measurement arrives, the optimizer re-solves for everything simultaneously. The drift redistributes evenly around the loop.
Define the joint state as θ = (T1, ..., Tn, l1, ..., lK). Every measurement — odometry step, landmark observation, and now a loop closure — contributes a residual that penalizes disagreement with θ. The SLAM back-end solves:
where E is the set of all measurements (odometry edges, landmark measurements, loop closures), rij is the residual for measurement (i,j), and Σij is its noise covariance. This is the same nonlinear least-squares framework from L9 — now applied to a graph with thousands of variables.
If we have n=1000 poses in SE(3) (6 DOF each) and K=5000 landmarks (3D each), the state vector θ has 6000+15000 = 21 000 dimensions. A dense Gauss-Newton step requires inverting a 21 000 × 21 000 matrix — about 9 × 1012 floating-point operations. At 10 GFLOPS, that is 900 seconds per iteration. Clearly the dense approach is hopeless. The saving grace: this matrix is overwhelmingly sparse. Exploiting that sparsity is what makes SLAM tractable.
Three poses, one landmark, one loop closure. The state θ = (T1, T2, T3, l1). Measurements: odometry T1→T2, odometry T2→T3, landmark observations (T1,l1) and (T3,l1), plus one loop closure T1↔T3. Imagine: without the loop closure, accumulated odometry drift puts T3 0.8 m off where it should be. With the loop closure, the optimizer pulls all three poses into consistency — each shifting by about 0.27 m — and the landmark lands in the correct global position. We will track this example through Chapters 1–7.
Variable nodes (circles = poses, squares = landmarks) and factor nodes (tiny diamonds). Hover over a factor to see what it constrains.
The Jacobian J of the full residual vector r(θ) has one block-row per measurement and one block-column per variable. Row (i,j) corresponds to measurement (i,j); its only nonzero blocks are in the columns for variables i and j. Everything else is zero. This is because the residual rij depends only on variable i and variable j — no other variable affects it.
The Hessian H = JTW J has a block at position (a,b) if and only if variables a and b share at least one common factor. In a pose graph: consecutive poses Tt and Tt+1 share odometry factors, so H has blocks at (Tt, Tt+1) and (Tt+1, Tt). A loop closure between T5 and T47 adds blocks at (T5, T47). Poses that never share a measurement have no block in H — those are exactly the zero entries.
Return to our 3-pose + 1-landmark example, 2D (pose dims: 3, landmark dims: 2). Variable ordering: [T1, T2, T3, l1] — block sizes [3,3,3,2] → total 11. Measurement set: odometry (1,2), odometry (2,3), observation (1,l1), observation (3,l1), loop closure (1,3). Count nonzero blocks in H (symmetric, so off-diagonal counts double):
• (1,1): appears in odo(1,2), obs(1,l1), lc(1,3) — 3 measurements → diagonal block nonzero
• (1,2): only odo(1,2) → nonzero
• (1,3): loop closure (1,3) → nonzero
• (1,l1): obs(1,l1) → nonzero
• (2,2): only odo(1,2) and odo(2,3)
• (2,3): odo(2,3) → nonzero
• (3,3): odo(2,3), obs(3,l1), lc(1,3)
• (3,l1): obs(3,l1) → nonzero
• (l1,l1): obs(1,l1), obs(3,l1)
Total: 9 nonzero blocks out of 16 possible (4×4). Without the loop closure: only 7 blocks. Loop closure adds 1 new off-diagonal block (T1,T3).
Filled blocks = nonzero. Toggle loop closure to see the extra block it adds. Hover a block to see which measurement creates it.
One Gauss-Newton iteration linearizes r(θ̂) ≈ J δ + r0 and solves for the step δ. Substituting into the NLLS objective and differentiating gives the normal equations:
The matrix H = JTWJ is the information matrix (also called the Hessian). It is symmetric positive definite (when the problem has a unique minimum). The right-hand side b = JTW r0 is the gradient. Both H and b are assembled by summing contributions from each measurement: H += JijT Wij Jij, b += JijT Wij rij.
Since H is sparse, symmetric, and positive definite, we solve Hδ = b using sparse Cholesky factorization: H = L LT where L is a lower-triangular matrix. Then solve L y = b (forward substitution) and LT δ = y (back substitution). The cost scales with the number of nonzeros in L, not in H — and L is only slightly denser than H.
Why not just invert H? Matrix inversion of a sparse matrix gives a dense result — you lose all the sparsity. Factorization preserves sparsity in L (almost). The fill-in in L (new nonzeros created during Cholesky that were zero in H) depends critically on variable ordering — this is Chapter 7's story.
Odometry factor (1,2): residual dim = 3, so J12 is 3×11 with nonzero 3×3 blocks at columns for T1 and T2. Call them A1 and A2 (3×3 Jacobian sub-blocks with W12 = I3 for simplicity). Contribution to H: H11 += A1TA1, H12 += A1TA2, H22 += A2TA2. Sum over all 5 measurements. For the loop-closure factor (1,3): adds to H11, H13, H33. This is the only path by which T1 and T3 become directly coupled in H — before the loop closure, they were not neighbors in the information matrix.
Drag the slider to reorder variables. Watch fill-in (red = Cholesky fill) grow with bad ordering.
Bundle adjustment (BA) is the visual-SLAM back-end when measurements are only landmark projections — no odometry. The name comes from "bundles" of light rays from the camera centers to each 3D point: we jointly adjust all camera poses and point positions until each bundle aligns with the observed image coordinates. BA is the standard back-end in Structure from Motion (SfM) systems like COLMAP.
The objective is exactly (23.7) from the notes: minimize ∑ ‖z̄k,t − π(Tt, lk)‖2Σl−1. The factor graph has only projection factors (no odometry edges between consecutive poses). In SfM, the "poses" correspond to different tourist cameras photographing the same monument — no temporal ordering, no odometry. In SLAM, odometry factors are added on top.
The sparsity pattern is strikingly different from a pure pose graph. In BA: the Hessian has a characteristic arrow pattern — a dense upper-left pose–pose block, a dense lower-right landmark–landmark block, and off-diagonal pose–landmark blocks. The landmark–landmark diagonal block is purely block-diagonal (landmarks are only coupled through shared poses, not directly). This arrow pattern is famous and is why the Schur complement trick (used in Chapter 7) is especially efficient for BA: eliminate all landmarks first (cheap, diagonal blocks), leaving a smaller dense system over poses only.
Adding odometry to BA (giving visual SLAM) fills in the pose–pose block of H along the diagonal and superdiagonal — the chain of sequential pose connections. The landmark structure remains block-diagonal. The combined sparsity is the union of the two patterns. Solving the full system is still efficient because the landmark block can still be marginalized cheaply.
python — BA factor graph assembly (toy: 3 poses, 2 landmarks) import numpy as np # Projection Jacobian blocks (2×3 each for pose, 2×3 for landmark) # J_full is (2*obs, 3*n + 3*K) = (12, 9+6) = (12, 15) n_poses, n_lm, obs_per = 3, 2, 2 # 2 obs per pose J = np.zeros((2*n_poses*obs_per, 3*n_poses + 3*n_lm)) # Fill in Jacobian blocks (synthetic random values) np.random.seed(42) for t in range(n_poses): for k in range(obs_per): row = 2*(t*obs_per + k) J[row:row+2, 3*t:3*t+3] = np.random.randn(2, 3) # pose Jacobian J[row:row+2, 9+3*k:9+3*k+3] = np.random.randn(2, 3) # lm Jacobian H = J.T @ J # 15×15 information matrix print("Nonzeros in H:", np.sum(np.abs(H) > 1e-10), "out of", 15*15) # Sparsity: H[0:9,0:9] = pose-pose (dense), H[9:,9:] = lm-lm (block-diag), # H[0:9,9:] = pose-lm cross-terms # Output: Nonzeros in H: 135 out of 225
There are two philosophies for handling the growing state in SLAM as the robot moves further.
Filtering (EKF-SLAM): maintain only the current pose and the current landmark map. When a new observation arrives, run an EKF prediction + update. Past poses are marginalized out immediately — the filter never remembers where the robot was at time t−100. The state is small (current pose + K landmarks), but the covariance of the landmark map becomes dense because every landmark update correlates all landmarks through the robot pose.
Full smoothing (iSAM, GTSAM): keep all poses T1, ..., Tn in the state. The SLAM back-end is a nonlinear least-squares problem over the entire trajectory. As n grows, so does the system — but it stays sparse, because each measurement only touches 2 variables. Loop closures are handled naturally: just add a new factor and re-solve. No linearization error accumulates from incrementally updating a filter.
Fixed-lag smoothing keeps only the last W poses in the active state (a "sliding window"). Poses older than W are marginalized out using the Schur complement, contributing a dense prior on the oldest active variables. This gives bounded computation O(W) per step at the cost of: (a) losing the ability to do loop closures to poses outside the window, and (b) gradually densifying the prior on the oldest active pose. Libraries like OKVIS and Kimera-VIO use fixed-lag smoothing for real-time performance.
| Approach | State size | Info matrix density | Loop closures | Used in |
|---|---|---|---|---|
| EKF-SLAM | ncurr + K (small, fixed) | Dense (landmark block) | No (marginalized) | Classic SLAM, small maps |
| Fixed-lag smoother | W poses + K landmarks | Moderate (dense prior) | Within window only | OKVIS, Kimera-VIO |
| Full smoother | n poses + K (grows) | Sparse | Any time | iSAM2, GTSAM, ORB-SLAM3 |
Marginalization is the mathematical operation of removing variables from a SLAM system while preserving all the information those variables contributed. It is the mechanism behind fixed-lag smoothing, the EKF update, and the Schur complement trick in BA. But marginalization has a cost: it creates fill-in — new edges between the neighbors of the removed variable, densifying the remaining graph.
Partition the information matrix into blocks for the variables we want to keep (Θa) and the variables we want to remove (Θb):
To eliminate Θb (block Gaussian elimination), multiply the second block row by Hbb−1Hba and subtract from the first. The resulting system for Θa is:
The matrix S = Haa − Hab Hbb−1 Hba is the Schur complement of Hbb. It is the information matrix of Θa after marginalizing Θb. Probabilistically: S is the precision matrix of the marginal distribution p(Θa).
Suppose we marginalize pose Told which was connected to poses TA, TB, TC via factors. Before marginalization: TA and TB have no direct edge (no factor between them). The Schur complement formula: S contributes to HAA, HAB, HBB, HBC, HAC, etc. — all pairs of Told's neighbors become directly coupled. If Told had 4 neighbors, removing it creates C(4,2) = 6 new edges. This is fill-in: the marginal prior after removing Told is a dense Gaussian over its 4 former neighbors.
Click "Marginalize T2" to remove the middle pose and see its neighbors become directly connected.
Remove T2 from the 3-pose example. Partitions: Θb = T2 (3 DOF), Θa = (T1, T3, l1) (8 DOF). For simplicity, set all W=I and use toy Jacobians. Hbb = J12[T2]TJ12[T2] + J23[T2]TJ23[T2] — a 3×3 matrix, cheap to invert. The Schur complement S = Haa − Hab Hbb−1 Hba is an 8×8 dense matrix. Originally (T1,T3) had a 3×3 block only from the loop closure. After marginalization, that block is also augmented by the Schur complement contribution −Hab[T1] Hbb−1 Hba[T3] — even without a loop closure, T1 and T3 would become coupled if T2 between them is marginalized.
This is the full SLAM story in one simulation. A robot drives in a loop, accumulating drift pose-by-pose via noisy odometry. The estimated trajectory spirals outward — no global consistency. The map of landmarks (squares) is smeared by the pose errors. Then: one loop-closure factor fires. Watch the Gauss-Newton optimizer iterate, redistributing residuals across the entire trajectory until the loop closes and the map snaps into a consistent global structure.
Use the controls to: add noise, set the number of Gauss-Newton iterations, toggle the loop closure on/off, and play the optimization step-by-step. The information matrix sparsity pattern is shown live on the right.
Left: trajectory + landmarks. Right: sparsity pattern. Red nodes = incorrect drift. Green = after loop closure optimization.
Every measurement in SLAM follows the same recipe: there is a measurement model hij(θ) that predicts what the sensor should observe given the current state θ, and there is an actual observation zij. The gap between prediction and observation — the residual — should be small if θ is correct. Under Gaussian noise with covariance Σij, maximizing the MAP posterior is equivalent to minimizing the sum of squared Mahalanobis-weighted residuals.
Odometry factor: the relative pose between consecutive frames, measured by VO (L11) or wheel encoders. Residual: rt,t+1 = log([Tt−1 Tt+1]−1 T̄t,t+1)∨ ∈ R6. This is the SE(3) log-map residual from L10: how much the current relative pose differs from the measured one, expressed in the Lie algebra.
Landmark observation factor: for a monocular camera, h(Tt, lk) = π(Tt−1 lk) is the perspective projection of landmark lk into camera frame at pose Tt. Residual: rt,k = z̄t,k − π(Tt−1 lk) ∈ R2 (pixel error).
Loop-closure factor: identical in form to the odometry factor, but connects two non-consecutive poses. This is exactly the measurement delivered by place recognition (L12) + geometric verification: T̄ij is the estimated relative pose between poses i and j. Residual: rij = log([Ti−1 Tj]−1 T̄ij)∨ ∈ R6.
Collect all residuals into a stacked vector r(θ) ∈ Rm. Stack the inverse covariances into block-diagonal W. The MAP estimator is:
This is exactly weighted NLLS from L9. We linearize at θ̂ using Jacobians, solve the normal equations (Chapter 4), and retract the update onto SE(3) (L10). The only novelty in SLAM: the Jacobian J is enormous but sparse.
Our toy example, 2D poses (3 DOF) + 2D landmark (2 DOF): θ = (T1, T2, T3, l1), dim = 3+3+3+2=11. Measurements: odo(1,2): r∈R3; odo(2,3): r∈R3; obs(1,l1): r∈R2; obs(3,l1): r∈R2; lc(1,3): r∈R3. Total residual dim = 3+3+2+2+3=13. J is 13×11, mostly zeros.
python — Gauss-Newton step for tiny pose graph import numpy as np # Toy 2D pose graph: 3 poses (x,y,θ), 1 loop closure # True poses: T1=(0,0,0), T2=(1,0,0), T3=(2,0,π) (returned to start) # Drifted estimate of T3: (2.3, 0.4, π*0.85) poses_gt = np.array([[0,0,0],[1,0,0],[2,0,np.pi]]) poses_est = np.array([[0,0,0],[1,0,0],[2.3,0.4,np.pi*0.85]]) # Loop closure measurement: T3 should be at T1 rotated 180° lc_meas = np.array([2.0, 0.0, np.pi]) # (dx,dy,dθ) from T1 to T3 # Residual of loop closure factor r_lc = poses_est[2] - (poses_est[0] + lc_meas) print(f"Loop closure residual: {r_lc}") # [0.3, 0.4, -0.047] — 0.5m positional error, 2.7° heading error # One GN step: distribute residual across poses (simplified linear case) # Equal sharing: each of the 3 poses absorbs 1/3 of the correction correction = r_lc / 3 poses_corrected = poses_est.copy() poses_corrected[0] += correction # T1 shifts a little poses_corrected[1] += correction # T2 shifts a little poses_corrected[2] -= 2*correction # T3 absorbs most correction print("After 1 GN step, T3 residual:", poses_corrected[2] - (poses_corrected[0] + lc_meas)) # [0.0, 0.0, 0.0] — perfectly satisfied after correction redistribution
| Concept | One-liner |
|---|---|
| SLAM state | θ = all poses T1:n + all landmarks l1:K |
| MAP → NLLS | Gaussian noise ⇒ maximize posterior = minimize weighted squared residuals (same as L9) |
| Factor graph | Bipartite graph: variable nodes (poses & landmarks) + factor nodes (measurements). Each factor = one term in the objective. |
| Odometry factor | Connects consecutive poses Tt, Tt+1; residual = log-map of relative pose error on SE(3) |
| Loop-closure factor | Connects non-consecutive Ti, Tj; same form as odometry; provided by place recognition (L12). Cancels accumulated drift globally. |
| Sparsity | Each factor touches 2 variables ⇒ Hessian H = JTWJ is >99% zeros for large SLAM. Block at (a,b) iff a and b share a factor. |
| Normal equations | H δ = b; H = JTWJ; b = JTWr0. Solve via sparse Cholesky, not dense inversion. |
| Variable ordering | Controls fill-in in L factor of H = LLT. Bad order ⇒ dense L. COLAMD/AMD find near-optimal orderings. |
| Bundle adjustment | SLAM with only projection factors (no odometry). Arrow-pattern Hessian ⇒ eliminate landmarks cheaply via Schur complement. |
| Full smoothing | Keep all poses; sparse; supports loop closures anytime. iSAM2/GTSAM. Better than filtering for large SLAM. |
| Marginalization | Remove variables via Schur complement. Cost: adds edges among neighbors (fill-in). Densifies prior. Price of forgetting. |
| Wrong loop closure | One bad loop closure corrupts the global map catastrophically. Robust methods (L14) defend against this. |
Wrong loop closure (from L12 false positive): the optimizer tries to satisfy an impossible geometric constraint — poses get twisted into geometrically inconsistent configurations. The whole map degrades. This is the key motivation for Chapter L14 (robust SLAM / M-estimation).
Bad variable ordering: Cholesky fill-in grows from O(n) to O(n2) with pathological orderings. Real solvers use COLAMD (Column Approximate Minimum Degree) to find good orderings automatically.
Aggressive marginalization: marginalizing too many old poses too quickly densifies the prior on the active window. The sliding-window system slows down and loses the ability to detect long-range loop closures.
A factor graph is a bipartite graph with two kinds of nodes. Variable nodes represent the unknowns we want to estimate: robot poses (circles) and landmark positions (squares). Factor nodes (small diamonds) represent measurements — each factor encodes a log-likelihood term that couples a small set of variables. An edge connects a factor to every variable it involves.
Factor graphs were introduced by Kschischang, Frey & Loeliger (2001) for belief propagation in coding theory, but Frank Dellaert's group at Georgia Tech showed in the early 2000s that the entire SLAM problem maps naturally onto them. The key papers: Dellaert & Kaess, "Square Root SAM" (IJRR 2006) and Kaess et al., "iSAM" (TRO 2008). Today's standard library is GTSAM (Georgia Tech Smoothing and Mapping).
Prior factor: pins one pose to a known value (e.g., T1 = I). Essential to remove the gauge freedom of the problem — without it, rigidly transforming the whole trajectory is a zero-cost deformation, giving infinite solutions. In GTSAM: PriorFactor<Pose3>.
Odometry factor: between consecutive poses Tt and Tt+1. The "spring" connecting the robot's timeline. In GTSAM: BetweenFactor<Pose3>. These form the backbone — a chain of factors along the trajectory.
Measurement/projection factor: between a pose Tt and a landmark lk. One factor per observation. In a 30-frame window observing 100 landmarks each seen 3 times, this generates 300 such factors. In GTSAM: GenericProjectionFactor (monocular), SmartProjectionFactor (multi-view).
Loop-closure factor: between two non-consecutive poses Ti and Tj, provided by the place-recognition pipeline (L12). Mathematically identical to an odometry factor in form; what differs is the distance in the timeline and the fact that this factor closes the loop — it ties a current pose back to an old one, creating a cycle in the graph that the optimizer uses to cancel drift. In GTSAM: BetweenFactor<Pose3> with the loop-closure measurement.
Toggle factors to grow the graph. Circles = poses, squares = landmarks, diamonds = factors.