Your IMU gives you acceleration readings corrupted by Gaussian noise. Your camera gives you pixel coordinates that are a nonlinear function of a 3D point's position. You want the 3D position (or the camera pose, or the calibration parameters) that best explains all these noisy measurements. This is estimation — and when the measurement model is nonlinear, you need an iterative algorithm to solve it. This lesson derives everything from first principles: why squared error is not arbitrary (it's maximum likelihood under Gaussian noise), how to solve the linear case with the normal equations, and how Gauss-Newton and Levenberg-Marquardt tackle the nonlinear case. MIT 16.485 by Luca Carlone, Lectures 16–18.
You're calibrating a camera. You print a checkerboard and photograph it 30 times. For each photo you detect the corners in pixel coordinates. You know where those corners sit in 3D (it's a rigid board). You want the camera intrinsics — focal length f, principal point (cx, cy), distortion coefficients. There are maybe 8 unknown parameters. You have thousands of pixel measurements. How do you find the best parameters?
The measurement model says: given parameters x and a 3D point Pi, the predicted pixel location is hi(x). Your measured pixel is z̃i. The residual is the gap: ri(x) = z̃i − hi(x). If the parameters are correct and the noise is small, every residual is small. If the parameters are wrong, some residuals will be large.
The question is: what objective function do we minimize over x to make all residuals small simultaneously? The answer is not arbitrary — it comes directly from probability theory, specifically from asking: what assignment of x makes the observed measurements most likely? That is maximum likelihood estimation (MLE).
Every major VNAV computation reduces to this problem. Triangulating a 3D point from pixel observations (L7) is a linear least-squares problem. Recovering camera pose from 2D-3D correspondences (PnP) is a nonlinear least-squares problem. VIO — fusing IMU and camera to get a trajectory — is a large nonlinear least-squares problem over poses and velocities. Bundle adjustment in SLAM is a massive sparse nonlinear least-squares problem. The algorithm you'll learn here is the engine behind all of them.
RANSAC (L8) delivered you a set of inlier correspondences and a rough initial pose estimate. Now estimation takes over: it refines that rough estimate into the best possible answer given the inlier data. The pipeline is: RANSAC finds the right basin → the estimator finds the bottom of that basin.
Click anywhere on the canvas to add a point. The orange line is the least-squares fit; the teal band shows residuals (vertical gaps). Cost = sum of squared residuals, shown top-right. Drag the noise slider to see how measurement noise spreads the cost surface.
Suppose measurement zi is drawn from a Gaussian: zi = hi(x) + εi, where εi ~ N(0, Σi). The probability density of observing zi given x is:
where ri(x) = zi − hi(x) is the residual. Since all N measurements are independent, the joint likelihood of the full dataset Z = {z1, …, zN} is the product of N Gaussians.
Maximum likelihood says: find x that maximizes p(Z | x). Taking the log (which preserves the maximum) and negating (to turn a maximization into a minimization) gives:
This is weighted least squares. The weight matrix is Ωi = Σi−1, called the information matrix. A measurement with small noise (small Σi) gets a large weight Ωi — it is trusted more. A noisy measurement gets a small weight — it pulls the estimate less.
Let's make it concrete. You have a robot wheel that should travel at speed x (unknown). Your N encoder readings are z1, …, zN, each = x + noise. Residual ri = zi − x. The LS cost is:
Setting df/dx = 0: −2 ∑(zi − x) = 0 ⇒ x̂ = (1/N) ∑ zi. The maximum-likelihood estimate of x under Gaussian noise is the sample mean. This is obvious in retrospect — but the derivation shows it's not a coincidence: it emerges from MLE, not from intuition.
When the measurement model hi(x) is linear in x — meaning hi(x) = Ai x for some matrix Ai — the residual ri(x) = zi − Ai x is also linear. The cost function becomes a quadratic bowl in x: exactly one global minimum, reachable in one step.
Stack all the equations. Let A be the (m × n) matrix formed by stacking all Ai vertically, and b be the vector formed by stacking all zi. The full cost is:
where we use the shorthand b = z (the stacked measurements). This is linear least squares (LLS): minimize the squared length of the residual vector Ax − b.
You have 4 data points: (1, 2.1), (2, 3.9), (3, 6.2), (4, 7.8). You want to fit a line y = ax + b (two unknowns: a and b). Each point gives one equation: a·xi + b = yi. Stacking all four:
A is 4×2, b is 4×1, x is 2×1. We have 4 equations and 2 unknowns — an overdetermined system. No exact solution exists (the 4 points aren't perfectly collinear). Least squares finds the x that minimizes the squared error.
python import numpy as np # Data points x_data = np.array([1, 2, 3, 4]) y_data = np.array([2.1, 3.9, 6.2, 7.8]) # Build the A matrix (design matrix) A = np.column_stack([x_data, np.ones_like(x_data)]) b = y_data # Solve via numpy least-squares (uses QR internally) x_hat, residuals, rank, sv = np.linalg.lstsq(A, b, rcond=None) a_hat, b_hat = x_hat print(f"Slope: {a_hat:.4f}, Intercept: {b_hat:.4f}") # Slope: 1.9200, Intercept: 0.1500 # The residuals r = b - A @ x_hat print(f"Residuals: {r}") # [-0.03 0.07 -0.02 -0.02] — all small, no trend
The cost f(x) = ‖Ax − b‖2 is a quadratic bowl. Its minimum is where the gradient is zero. Let's derive it explicitly.
Taking the gradient with respect to x and setting it to zero:
This gives the normal equations:
And the solution:
The matrix (ATA)−1AT is called the Moore-Penrose pseudoinverse of A. It exists and gives a unique solution whenever ATA is invertible — equivalently, whenever A has full column rank (no redundant or unobservable parameters).
Solving (ATA) x = ATb:
det(ATA) = 30×4 − 10×10 = 20. Inverse = (1/20)[[4,−10],[−10,30]]. Therefore:
So the fitted line is y = 1.94x + 0.15. Let's verify: at x=1, predicted = 2.09 (measured 2.1). At x=4, predicted = 7.91 (measured 7.8). Residuals are all sub-0.1 — the fit is good.
Points are fixed on a true line y = 2x + 1 with noise σ. Watch ATA, ATb, and the solution x̂ update as noise changes. The condition number of ATA (ratio of largest to smallest eigenvalue) indicates how well-conditioned the system is.
Camera calibration's measurement model hi(x) is not linear. The predicted pixel location of a 3D point involves perspective division (a division by the z-coordinate), plus a nonlinear distortion polynomial. Triangulation (L7) involves cross-products and SVD. These are nonlinear functions of the parameters x. So the direct normal-equation approach doesn't apply.
The key idea: around any specific parameter estimate x̄ (our current guess), any smooth nonlinear function can be approximated by a linear one — its first-order Taylor expansion. For residual ri(x):
where δ is a small perturbation and Ji(x̄) is the Jacobian of ri evaluated at x̄. The Jacobian is a matrix of partial derivatives:
If ri is a scalar and x has n components, Ji is a 1×n row vector. If ri has mi components (e.g., a 2D pixel error), Ji is mi×n. Each entry Ji,jk = ∂ri,j/∂xk asks: "if I increase parameter k by a tiny amount, how much does component j of residual i change?"
Model: h(t; a, b) = a · exp(b · t). Two unknowns: x = [a, b]T. Residual for one point (ti, yi): ri(a,b) = yi − a exp(bti).
The Jacobian row for measurement i:
At current estimate ā = 3.0, b̄ = −0.5, and t1 = 1.0: J1 = [−exp(−0.5), −3·1·exp(−0.5)] = [−0.607, −1.819]. This says: if you increase a by 0.001, r1 decreases by 0.000607. If you increase b by 0.001, r1 decreases by 0.001819 — b is more sensitive to this measurement than a.
Now we can combine the two ideas: linearize the residual around the current estimate, then take one linear least-squares step. That's Gauss-Newton.
At current estimate x̄, substitute the Taylor approximation ri(x̄ + δ) ≈ ri(x̄) + Jiδ into the cost:
Stack all Ji into J (an m×n matrix) and all ri(x̄) into r(x̄) (an m-vector). This becomes:
This is a linear least-squares problem in δ! Applying the normal equations:
Update: x̄ ← x̄ + δ*. Then recompute J and r at the new x̄, and repeat. This is the Gauss-Newton algorithm.
Newton's method on f(x) would use the exact Hessian: ∇2f = 2(∑i JiTJi + ∑i ri∇2ri). Gauss-Newton drops the second term ∑i ri∇2ri. This approximation is valid when (a) residuals ri are small near the solution (good fit), and (b) the residuals are nearly linear (small second derivatives). Both hold near a good solution. The payoff: computing second derivatives (∇2ri) is expensive and fragile; dropping it costs almost nothing when the approximation is valid, and gives us a positive-definite JTJ (always invertible if J has full rank).
True model: y = 3 exp(−0.5 t). Noisy observations at t = 1,2,3,4: y = [1.9, 1.2, 0.7, 0.4]. Initial guess: ā = 2, b̄ = −0.3. Predicted: h = [2e−0.3, 2e−0.6, 2e−0.9, 2e−1.2] = [1.481, 1.098, 0.814, 0.603]. Residuals: r = [0.419, 0.102, −0.114, −0.203].
Jacobians (row i = [−exp(b̄ti), −ātiexp(b̄ti)]): J = [[−0.741, −1.481], [−0.549, −2.193], [−0.407, −2.442], [−0.301, −2.411]].
JTJ ≈ [[0.960, 4.899], [4.899, 26.24]]. JTr ≈ [−0.223, −1.096]. Solving: δ ≈ [0.68, −0.095]. New estimate: ā = 2.68, b̄ = −0.395. After just one step, we're moving toward the truth (a=3, b=−0.5). Each subsequent step brings us closer.
python import numpy as np def gauss_newton(t, y, x0, n_iter=20, tol=1e-8): """Fit y = a*exp(b*t) by Gauss-Newton.""" x = x0.copy() history = [x.copy()] for _ in range(n_iter): a, b = x h = a * np.exp(b * t) # predicted values r = y - h # residuals, shape (N,) # Jacobian: dr/d[a,b] = [-exp(bt), -a*t*exp(bt)] J = np.column_stack([ -np.exp(b * t), -a * t * np.exp(b * t) ]) # shape (N, 2) # Gauss-Newton step: (J^T J) delta = -J^T r JtJ = J.T @ J # (2,2) approximate Hessian Jtr = J.T @ r # (2,) gradient direction delta = np.linalg.solve(JtJ, -Jtr) x = x + delta history.append(x.copy()) if np.linalg.norm(delta) < tol: break return x, history t = np.array([1.0, 2.0, 3.0, 4.0]) y = np.array([1.9, 1.2, 0.7, 0.4]) x_hat, hist = gauss_newton(t, y, x0=np.array([2.0, -0.3])) print(f"a={x_hat[0]:.4f}, b={x_hat[1]:.4f}") print(f"Converged in {len(hist)-1} iterations")
Gauss-Newton can diverge. Far from the solution, the linearization is poor — the actual residual surface curves away from the tangent plane, and the GN step overshoots. Levenberg (1944) and Marquardt (1963) independently proposed the same fix: add a damping term to the normal equations.
The Levenberg-Marquardt (LM) step solves:
The only difference from Gauss-Newton is the λI term added to JTJ. Here λ > 0 is the damping parameter.
Think of λ as a trust knob. When λ is very small (λ → 0), the step reduces to Gauss-Newton — we trust the linear approximation completely. When λ is very large (λ → ∞), the diagonal dominates: (JTJ + λI)δ ≈ λδ = −JTr, so δ ≈ −(1/λ)JTr. This is gradient descent with step size 1/λ — small, cautious steps. LM interpolates between the two extremes.
After computing δ*: if the actual cost f(x̄ + δ*) < f(x̄) (the step actually helped), accept the update x̄ ← x̄ + δ* and decrease λ by a factor (e.g., λ ← λ/2 or λ ← λ/10). If f(x̄ + δ*) ≥ f(x̄) (step made things worse), reject the update and increase λ (e.g., λ ← λ×10). Start with λ = 10−3 · max(diag(JTJ)).
The contour plot shows the cost f(a,b) for fitting y = a·exp(b·t) to 5 noisy points. The orange dot is the current estimate; the arrow shows the GN step direction. The linearization is valid near the minimum but can send the estimate off into high-cost regions from far away. Watch what happens when you start from the red "bad init" position.
python def levenberg_marquardt(t, y, x0, n_iter=50, lam0=1e-3, tol=1e-9): x = x0.copy() lam = lam0 cost_history = [] for _ in range(n_iter): a, b = x h = a * np.exp(b * t) r = y - h J = np.column_stack([-np.exp(b*t), -a*t*np.exp(b*t)]) JtJ = J.T @ J Jtr = J.T @ r cost = float(r @ r) cost_history.append(cost) # LM step: add damping delta = np.linalg.solve(JtJ + lam * np.eye(2), -Jtr) x_new = x + delta h_new = x_new[0] * np.exp(x_new[1] * t) r_new = y - h_new cost_new = float(r_new @ r_new) if cost_new < cost: # accept step, decrease lambda x = x_new lam = max(lam / 2, 1e-12) else: # reject step, increase lambda lam = min(lam * 10, 1e10) if np.linalg.norm(delta) < tol: break return x, cost_history
Every NLS problem defines a cost surface f(x) — a scalar function over the parameter space. For a 2-parameter problem, this is a 2D landscape you can visualize. The shape of this landscape determines how hard the optimization problem is and how trustworthy the solution is.
GN and LM are local methods — they find the nearest valley from where you start. If f(x) is convex (one valley), initialization doesn't matter. If f(x) is nonconvex (multiple valleys), different starting points lead to different local minima. In VNAV: triangulation is a convex problem (it has a unique minimum). Camera pose estimation (PnP) is nonconvex when there are measurement outliers. Bundle adjustment over many frames is highly nonconvex. This is exactly why RANSAC (L8) is so important — it gives you an initial estimate that's already in the right basin of attraction.
When NLS converges to x̂, what is the uncertainty in x̂? Near the solution, the cost is approximately quadratic. The curvature of that quadratic tells you how sensitive the estimate is to noise in the measurements. Formally, the covariance of x̂ is:
where Ω = Σmeasurements−1 is the measurement information matrix and J is the Jacobian evaluated at the solution. The matrix JTΩJ is called the information matrix of the estimate (or the approximate Hessian). Its inverse is the covariance — how much uncertainty remains in each parameter direction after observing all the data.
If J has a zero singular value at the solution, then JTJ is singular, and (JTJ)−1 blows up. This means the estimate has infinite variance in some direction — the data simply cannot determine that parameter. Examples in VNAV: a monocular camera cannot determine the global scale factor (all solutions at any scale have the same cost). A bundle adjustment with only two camera views sharing identical baseline cannot determine depth (degenerate configuration). These singularities are observability failures, and they're revealed by the information matrix.
The contour shows f(a,b) for the exponential fit. The orange arrow shows the LM step direction for a given λ. Drag λ from near 0 (full GN step) to large (tiny gradient descent step). Notice how large λ makes the step direction point directly downhill (gradient) rather than toward the valley bottom (GN).
This is the payoff. Fit a noisy dataset with a nonlinear model (exponential decay y = a exp(bt) or circle), watching Gauss-Newton or LM iterate toward the solution step by step. The cost curve shows convergence. Toggle "bad init" to land in a local minimum and watch the algorithm get stuck.
Use Run to fit automatically, or Step to advance one iteration. Toggle between Gauss-Newton (no damping) and Levenberg-Marquardt (adaptive damping). The left panel shows the current fit vs noisy data; the right panel shows the cost curve decreasing (or diverging with bad init + GN). Try the "bad init" toggle with GN to see divergence, then switch to LM to see it recover.
| Concept | Formula / Rule | Key property |
|---|---|---|
| MLE under Gaussian noise | min ∑ riTΩiri | Least squares is not arbitrary |
| Normal equations (linear LS) | (ATA)x = ATb | Closed form, one step |
| Jacobian Ji | ∂ri/∂x at x̄ | Local linear sensitivity matrix |
| Gauss-Newton step | δ = −(JTJ)−1JTr | Fast near solution, can diverge |
| LM step | δ = −(JTJ+λI)−1JTr | Safe (monotone), always invertible |
| Estimate covariance | Σ = (JTΩJ)−1 | Singular = unobservable direction |
| Convergence caveat | Local only — initialization matters | Use RANSAC (L8) for good init |
VNAV L8: RANSAC — delivers the inlier set and rough initial estimate that NLS refines. RANSAC finds the right basin; NLS finds the bottom of that basin. The pipeline is always RANSAC → NLS, never NLS alone on raw (potentially outlier-contaminated) data.
VNAV L7: Two-View Geometry & Triangulation — triangulation is a linear least-squares problem: given two projections and known poses, find the 3D point minimizing reprojection error. The "DLT" (Direct Linear Transform) formulation stacks measurements into A, b and solves the normal equations.
VNAV L10 (manifold optimization) — when the state x includes a rotation or pose (an element of SO(3) or SE(3), not a vector space), you can't just add δ to x. The GN/LM framework extends to Lie groups by replacing x ← x + δ with x ← x · exp(𝚽δ), retraction on the manifold. L2 (Lie Groups) provides the mathematical scaffolding.
VNAV L13 (SLAM / bundle adjustment) — a SLAM factor graph is a massive sparse NLS problem over all poses and landmarks. GTSAM's NonlinearFactorGraph is exactly GN/LM applied to this large sparse system, exploiting sparsity in JTJ via Cholesky or QR on the sparse factor graph. This lesson is the engine behind SLAM.
| Symptom | Cause | Fix |
|---|---|---|
| Cost increases after step | Bad init, nonlinear region | Use LM; improve initialization |
| Oscillates, never converges | GN overshooting | Switch to LM or add line search |
| IndeterminantLinearSystemException (GTSAM) | JTJ singular (unobservable) | Add prior factors; check observability |
| Converges to wrong answer | Wrong local minimum | Better initialization (RANSAC output) |
| Huge estimated covariance in one direction | Near-singular Jacobian direction | Add measurements that excite that direction |
"The goal is to turn data and a model into an estimate, together with a characterization of how good that estimate is. The least-squares framework does exactly this — and the Jacobian is the key that unlocks both the estimate and its uncertainty."
— paraphrased from Luca Carlone, MIT 16.485 Lecture Notes