Visual Navigation for Autonomous Vehicles · MIT 16.485 · Lecture 16–18

Estimation & Nonlinear Least Squares

Your IMU gives you acceleration readings corrupted by Gaussian noise. Your camera gives you pixel coordinates that are a nonlinear function of a 3D point's position. You want the 3D position (or the camera pose, or the calibration parameters) that best explains all these noisy measurements. This is estimation — and when the measurement model is nonlinear, you need an iterative algorithm to solve it. This lesson derives everything from first principles: why squared error is not arbitrary (it's maximum likelihood under Gaussian noise), how to solve the linear case with the normal equations, and how Gauss-Newton and Levenberg-Marquardt tackle the nonlinear case. MIT 16.485 by Luca Carlone, Lectures 16–18.

Prerequisites: VNAV L7 (triangulation as a linear system) · VNAV L8 (RANSAC gives inliers + initial guess) · linear algebra (matrix multiply, transpose, inverse). No calculus beyond partial derivatives.

Chapters

Live Canvases

Derived

From First Principles

Chapter 0: The Problem — Fitting a Model to Noisy Data

You're calibrating a camera. You print a checkerboard and photograph it 30 times. For each photo you detect the corners in pixel coordinates. You know where those corners sit in 3D (it's a rigid board). You want the camera intrinsics — focal length f, principal point (c_x, c_y), distortion coefficients. There are maybe 8 unknown parameters. You have thousands of pixel measurements. How do you find the best parameters?

The measurement model says: given parameters x and a 3D point P_i, the predicted pixel location is h_i(x). Your measured pixel is z̃_i. The residual is the gap: r_i(x) = z̃_i − h_i(x). If the parameters are correct and the noise is small, every residual is small. If the parameters are wrong, some residuals will be large.

The question is: what objective function do we minimize over x to make all residuals small simultaneously? The answer is not arbitrary — it comes directly from probability theory, specifically from asking: what assignment of x makes the observed measurements most likely? That is maximum likelihood estimation (MLE).

The motivating insight: "best fit" isn't a matter of taste. Under Gaussian noise, the most probable parameter assignment is exactly the one that minimizes the sum of squared residuals. Least squares is MLE in disguise — this is WHY we use squared error, not L1 or something else.

Why this matters for VNAV

Every major VNAV computation reduces to this problem. Triangulating a 3D point from pixel observations (L7) is a linear least-squares problem. Recovering camera pose from 2D-3D correspondences (PnP) is a nonlinear least-squares problem. VIO — fusing IMU and camera to get a trajectory — is a large nonlinear least-squares problem over poses and velocities. Bundle adjustment in SLAM is a massive sparse nonlinear least-squares problem. The algorithm you'll learn here is the engine behind all of them.

RANSAC (L8) delivered you a set of inlier correspondences and a rough initial pose estimate. Now estimation takes over: it refines that rough estimate into the best possible answer given the inlier data. The pipeline is: RANSAC finds the right basin → the estimator finds the bottom of that basin.

Drag data points — watch least-squares fit update live

Click anywhere on the canvas to add a point. The orange line is the least-squares fit; the teal band shows residuals (vertical gaps). Cost = sum of squared residuals, shown top-right. Drag the noise slider to see how measurement noise spreads the cost surface.

Noise σ 0.8

You're calibrating a camera. You have 500 pixel measurements, each corrupted by independent Gaussian noise with σ = 1 pixel. You find parameters x* that make every residual exactly 0. Is x* guaranteed to be the maximum-likelihood estimate?

Yes — zero residuals means the model perfectly fits the data, which is always maximum likelihood Yes, but only if the number of parameters equals the number of measurements (the system is square) No — if the system is overdetermined (more measurements than unknowns), zero residuals is impossible. The MLE minimizes the sum of squared residuals, it doesn't drive them to zero. No — MLE always requires iterative methods; a closed-form solution can never be MLE

Chapter 1: Maximum Likelihood → Weighted Least Squares

Suppose measurement z_i is drawn from a Gaussian: z_i = h_i(x) + ε_i, where ε_i ~ N(0, Σ_i). The probability density of observing z_i given x is:

p(z_i | x) ∝ exp(−½ r_i(x)^T Σ_i⁻¹ r_i(x))

where r_i(x) = z_i − h_i(x) is the residual. Since all N measurements are independent, the joint likelihood of the full dataset Z = {z₁, …, z_N} is the product of N Gaussians.

Maximum likelihood says: find x that maximizes p(Z | x). Taking the log (which preserves the maximum) and negating (to turn a maximization into a minimization) gives:

x̂ = argmin ∑_i=1^N r_i(x)^T Σ_i⁻¹ r_i(x)

This is weighted least squares. The weight matrix is Ω_i = Σ_i⁻¹, called the information matrix. A measurement with small noise (small Σ_i) gets a large weight Ω_i — it is trusted more. A noisy measurement gets a small weight — it pulls the estimate less.

Why squared error? Not aesthetic choice — mathematical necessity. The log-likelihood of a Gaussian is quadratic in the residual. Minimizing negative log-likelihood under Gaussian noise is identically minimizing the sum of squared (information-weighted) residuals. If your noise is non-Gaussian (e.g., Laplace → L1 loss, Cauchy → robust Huber loss), you get a different objective. RANSAC (L8) is your defense against non-Gaussian gross outliers; once RANSAC hands you clean inliers, Gaussian LS is the right tool.

The scalar case: one unknown, N measurements

Let's make it concrete. You have a robot wheel that should travel at speed x (unknown). Your N encoder readings are z₁, …, z_N, each = x + noise. Residual r_i = z_i − x. The LS cost is:

f(x) = ∑_i=1^N (z_i − x)²

Setting df/dx = 0: −2 ∑(z_i − x) = 0 ⇒ x̂ = (1/N) ∑ z_i. The maximum-likelihood estimate of x under Gaussian noise is the sample mean. This is obvious in retrospect — but the derivation shows it's not a coincidence: it emerges from MLE, not from intuition.

Misconception: "least squares is arbitrary." It is not. It is the uniquely correct estimator when measurement noise is additive Gaussian. The moment your noise is non-Gaussian or has outliers, squared loss is the wrong choice. RANSAC gave you the inlier set precisely to restore the Gaussian assumption — so that least squares is justified on the remaining data.

Two sensors measure the same temperature. Sensor A has noise σ_A = 0.1°C; sensor B has noise σ_B = 1.0°C. In the MLE/weighted-LS framework, how should their information weights Ω_A and Ω_B compare?

Ω_A = Ω_B — both sensors should be trusted equally since both are measuring the same thing Ω_B > Ω_A — noisier sensors have larger variance, so larger Ω Ω_A > Ω_B — Ω = σ⁻², so smaller noise → larger weight. Sensor A is 100× more trusted. Ω_A = 10 Ω_B — weight scales linearly with the inverse standard deviation

Chapter 2: Linear Least Squares — The Setup

When the measurement model h_i(x) is linear in x — meaning h_i(x) = A_i x for some matrix A_i — the residual r_i(x) = z_i − A_i x is also linear. The cost function becomes a quadratic bowl in x: exactly one global minimum, reachable in one step.

Stack all the equations. Let A be the (m × n) matrix formed by stacking all A_i vertically, and b be the vector formed by stacking all z_i. The full cost is:

f(x) = ‖Ax − b‖² = (Ax − b)^T(Ax − b)

where we use the shorthand b = z (the stacked measurements). This is linear least squares (LLS): minimize the squared length of the residual vector Ax − b.

A worked example: fitting a line by LLS

You have 4 data points: (1, 2.1), (2, 3.9), (3, 6.2), (4, 7.8). You want to fit a line y = ax + b (two unknowns: a and b). Each point gives one equation: a·x_i + b = y_i. Stacking all four:

A = [[1, 1], [2, 1], [3, 1], [4, 1]], b = [2.1, 3.9, 6.2, 7.8]^T, x = [a, b]^T

A is 4×2, b is 4×1, x is 2×1. We have 4 equations and 2 unknowns — an overdetermined system. No exact solution exists (the 4 points aren't perfectly collinear). Least squares finds the x that minimizes the squared error.

Geometric interpretation: The vector Ax lives in the column space of A — a 2D subspace of R⁴. The vector b lives somewhere in R⁴ (generally outside the column space). Least squares finds the point Ax̂ in the column space closest to b. The residual vector (b − Ax̂) is perpendicular to the column space of A — this is the key geometric fact.

python
import numpy as np

# Data points
x_data = np.array([1, 2, 3, 4])
y_data = np.array([2.1, 3.9, 6.2, 7.8])

# Build the A matrix (design matrix)
A = np.column_stack([x_data, np.ones_like(x_data)])
b = y_data

# Solve via numpy least-squares (uses QR internally)
x_hat, residuals, rank, sv = np.linalg.lstsq(A, b, rcond=None)
a_hat, b_hat = x_hat
print(f"Slope: {a_hat:.4f}, Intercept: {b_hat:.4f}")
# Slope: 1.9200, Intercept: 0.1500

# The residuals
r = b - A @ x_hat
print(f"Residuals: {r}")
# [-0.03  0.07 -0.02 -0.02]  — all small, no trend

You have 3 data points and want to fit a degree-2 polynomial y = ax² + bx + c (3 unknowns). How many rows and columns does the design matrix A have, and is the system over-, under-, or exactly-determined?

2 rows × 3 columns, underdetermined — fewer equations than unknowns 3 rows × 2 columns, overdetermined — the polynomial isn't quadratic 3 rows × 3 columns, exactly-determined — 3 points uniquely fix a degree-2 polynomial (as long as x values are distinct) 3 rows × 3 columns, overdetermined — more data points always means overdetermined

Chapter 3: The Normal Equations — Deriving the Closed-Form Solution

The cost f(x) = ‖Ax − b‖² is a quadratic bowl. Its minimum is where the gradient is zero. Let's derive it explicitly.

f(x) = (Ax−b)^T(Ax−b) = x^TA^TAx − 2b^TAx + b^Tb

Taking the gradient with respect to x and setting it to zero:

∇_x f = 2A^TAx − 2A^Tb = 0

This gives the normal equations:

(A^TA) x = A^Tb

And the solution:

x̂ = (A^TA)⁻¹ A^Tb

The matrix (A^TA)⁻¹A^T is called the Moore-Penrose pseudoinverse of A. It exists and gives a unique solution whenever A^TA is invertible — equivalently, whenever A has full column rank (no redundant or unobservable parameters).

Worked numbers: solving the line-fitting example

A^TA = [[1,2,3,4],[1,1,1,1]] · [[1,1],[2,1],[3,1],[4,1]] = [[30, 10],[10, 4]]

A^Tb = [[1,2,3,4],[1,1,1,1]] · [2.1, 3.9, 6.2, 7.8]^T = [2.1+7.8+18.6+31.2, 2.1+3.9+6.2+7.8]^T = [59.7, 20.0]^T

Solving (A^TA) x = A^Tb:

[[30,10],[10,4]] · [a, b]^T = [59.7, 20.0]^T

det(A^TA) = 30×4 − 10×10 = 20. Inverse = (1/20)[[4,−10],[−10,30]]. Therefore:

â = (4×59.7 − 10×20)/20 = (238.8−200)/20 = 1.94 b̂ = (−10×59.7 + 30×20)/20 = (−597+600)/20 = 0.15

So the fitted line is y = 1.94x + 0.15. Let's verify: at x=1, predicted = 2.09 (measured 2.1). At x=4, predicted = 7.91 (measured 7.8). Residuals are all sub-0.1 — the fit is good.

When does the solution not exist? When A^TA is singular. This happens when two columns of A are linearly dependent — which means two parameters affect the measurements identically (they're unobservable as separate quantities). Example: fitting y = (a+c)x + b where a and c only appear summed. The system can't distinguish a=1,c=2 from a=2,c=1. No amount of data fixes this. In VNAV, unobservability shows up in monocular scale — you can't determine the 3D scale from a single camera; the entire trajectory is ambiguous up to a global scale factor.

Interactive Normal Equations Solver — drag noise, see the (A^TA)⁻¹A^Tb fit update

Points are fixed on a true line y = 2x + 1 with noise σ. Watch A^TA, A^Tb, and the solution x̂ update as noise changes. The condition number of A^TA (ratio of largest to smallest eigenvalue) indicates how well-conditioned the system is.

Noise σ 0.50

N points 8

The condition number of A^TA is 10⁸. What does this mean for your estimate x̂?

The estimate is very accurate — a large condition number means the system is well-conditioned The estimate is numerically unreliable — small changes in the data b can cause huge changes in x̂. One direction in parameter space is nearly unobservable. The estimate requires 10⁸ iterations to converge The matrix A^TA has 10⁸ nonzero entries — the problem is very large

Chapter 4: The Jacobian — Linearizing a Nonlinear Residual

Camera calibration's measurement model h_i(x) is not linear. The predicted pixel location of a 3D point involves perspective division (a division by the z-coordinate), plus a nonlinear distortion polynomial. Triangulation (L7) involves cross-products and SVD. These are nonlinear functions of the parameters x. So the direct normal-equation approach doesn't apply.

The key idea: around any specific parameter estimate x̄ (our current guess), any smooth nonlinear function can be approximated by a linear one — its first-order Taylor expansion. For residual r_i(x):

r_i(x̄ + δ) ≈ r_i(x̄) + J_i(x̄) δ

where δ is a small perturbation and J_i(x̄) is the Jacobian of r_i evaluated at x̄. The Jacobian is a matrix of partial derivatives:

J_i(x) = ∂r_i ⁄ ∂x = [∂r_i,1/∂x₁ ∂r_i,1/∂x₂ … ; ∂r_i,2/∂x₁ …]

If r_i is a scalar and x has n components, J_i is a 1×n row vector. If r_i has m_i components (e.g., a 2D pixel error), J_i is m_i×n. Each entry J_i,jk = ∂r_i,j/∂x_k asks: "if I increase parameter k by a tiny amount, how much does component j of residual i change?"

Geometric meaning of the Jacobian: Think of the Jacobian as the "local slope matrix." At any point x̄, it maps a change in parameters (δx) to the predicted change in residuals (δr ≈ Jδx). A large Jacobian entry means that parameter is sensitive to that measurement — small changes have big effects. A near-zero Jacobian entry means the parameter doesn't affect that measurement much — it's nearly unobservable from that data.

Concrete example: fitting an exponential

Model: h(t; a, b) = a · exp(b · t). Two unknowns: x = [a, b]^T. Residual for one point (t_i, y_i): r_i(a,b) = y_i − a exp(bt_i).

The Jacobian row for measurement i:

J_i = [∂r_i/∂a, ∂r_i/∂b] = [−exp(bt_i), −a t_i exp(bt_i)]

At current estimate ā = 3.0, b̄ = −0.5, and t₁ = 1.0: J₁ = [−exp(−0.5), −3·1·exp(−0.5)] = [−0.607, −1.819]. This says: if you increase a by 0.001, r₁ decreases by 0.000607. If you increase b by 0.001, r₁ decreases by 0.001819 — b is more sensitive to this measurement than a.

Misconception: "the Jacobian is just bookkeeping." It is the heart of the algorithm. Every step of Gauss-Newton and Levenberg-Marquardt involves solving a linear system whose coefficient matrix is J^TJ. If J is rank-deficient (some parameter combination doesn't affect any measurement), the linear system is singular and the estimate is undefined in that direction. The Jacobian is your observability detector.

Residual r(a, b) = y − a · exp(b · t). At ā = 2, b̄ = 0, t = 3: what is J = [∂r/∂a, ∂r/∂b]?

J = [−exp(0), −2·3·exp(0)] = [−1, −6] is wrong — signs are incorrect J = [exp(0), 2·3·exp(0)] = [1, 6] — positive because we're taking derivative of the model, not the residual J = [−exp(0), −2·3·exp(0)] = [−1, −6] — the negative sign comes from ∂(y − a·e^bt)/∂a = −e^bt J = [−a, −b] = [−2, 0] — derivatives are taken with respect to the exponent

Chapter 5: Gauss-Newton — Iterative Linearized Least Squares

Now we can combine the two ideas: linearize the residual around the current estimate, then take one linear least-squares step. That's Gauss-Newton.

At current estimate x̄, substitute the Taylor approximation r_i(x̄ + δ) ≈ r_i(x̄) + J_iδ into the cost:

f(x̄ + δ) ≈ ∑_i ‖r_i(x̄) + J_iδ‖²

Stack all J_i into J (an m×n matrix) and all r_i(x̄) into r(x̄) (an m-vector). This becomes:

f ≈ ‖Jδ + r(x̄)‖²

This is a linear least-squares problem in δ! Applying the normal equations:

(J^TJ)δ^* = −J^Tr(x̄)

δ^* = −(J^TJ)⁻¹ J^T r(x̄)

Update: x̄ ← x̄ + δ^*. Then recompute J and r at the new x̄, and repeat. This is the Gauss-Newton algorithm.

Why drop the second-order term?

Newton's method on f(x) would use the exact Hessian: ∇²f = 2(∑_i J_i^TJ_i + ∑_i r_i∇²r_i). Gauss-Newton drops the second term ∑_i r_i∇²r_i. This approximation is valid when (a) residuals r_i are small near the solution (good fit), and (b) the residuals are nearly linear (small second derivatives). Both hold near a good solution. The payoff: computing second derivatives (∇²r_i) is expensive and fragile; dropping it costs almost nothing when the approximation is valid, and gives us a positive-definite J^TJ (always invertible if J has full rank).

GN step summary:
1. Evaluate residuals r(x̄) at current estimate.
2. Evaluate Jacobian J(x̄) at current estimate.
3. Solve the linear system (J^TJ)δ = −J^Tr for δ.
4. Update: x̄ ← x̄ + δ.
5. Check convergence: if ‖δ‖ < ε or cost stopped decreasing, stop. Else go to 1.

One GN step by hand: exponential fit

True model: y = 3 exp(−0.5 t). Noisy observations at t = 1,2,3,4: y = [1.9, 1.2, 0.7, 0.4]. Initial guess: ā = 2, b̄ = −0.3. Predicted: h = [2e^−0.3, 2e^−0.6, 2e^−0.9, 2e^−1.2] = [1.481, 1.098, 0.814, 0.603]. Residuals: r = [0.419, 0.102, −0.114, −0.203].

Jacobians (row i = [−exp(b̄t_i), −āt_iexp(b̄t_i)]): J = [[−0.741, −1.481], [−0.549, −2.193], [−0.407, −2.442], [−0.301, −2.411]].

J^TJ ≈ [[0.960, 4.899], [4.899, 26.24]]. J^Tr ≈ [−0.223, −1.096]. Solving: δ ≈ [0.68, −0.095]. New estimate: ā = 2.68, b̄ = −0.395. After just one step, we're moving toward the truth (a=3, b=−0.5). Each subsequent step brings us closer.

python
import numpy as np

def gauss_newton(t, y, x0, n_iter=20, tol=1e-8):
    """Fit y = a*exp(b*t) by Gauss-Newton."""
    x = x0.copy()
    history = [x.copy()]
    for _ in range(n_iter):
        a, b = x
        h = a * np.exp(b * t)          # predicted values
        r = y - h                       # residuals, shape (N,)
        # Jacobian: dr/d[a,b] = [-exp(bt), -a*t*exp(bt)]
        J = np.column_stack([
            -np.exp(b * t),
            -a * t * np.exp(b * t)
        ])                              # shape (N, 2)
        # Gauss-Newton step: (J^T J) delta = -J^T r
        JtJ = J.T @ J                   # (2,2) approximate Hessian
        Jtr = J.T @ r                   # (2,) gradient direction
        delta = np.linalg.solve(JtJ, -Jtr)
        x = x + delta
        history.append(x.copy())
        if np.linalg.norm(delta) < tol:
            break
    return x, history

t = np.array([1.0, 2.0, 3.0, 4.0])
y = np.array([1.9, 1.2, 0.7, 0.4])
x_hat, hist = gauss_newton(t, y, x0=np.array([2.0, -0.3]))
print(f"a={x_hat[0]:.4f}, b={x_hat[1]:.4f}")
print(f"Converged in {len(hist)-1} iterations")

Gauss-Newton takes one step and the cost increases. What is the most likely cause?

A bug in the Jacobian computation — correct Jacobians always lead to decreasing cost The learning rate is too small — GN always converges if you use a small enough step The current estimate is far from the solution, where the linear approximation is inaccurate — the GN step overshoots. GN is not globally convergent; it can diverge or oscillate from bad starting points. The problem has no solution — GN cannot solve problems with more measurements than unknowns

Chapter 6: Levenberg-Marquardt — Adding a Safety Net

Gauss-Newton can diverge. Far from the solution, the linearization is poor — the actual residual surface curves away from the tangent plane, and the GN step overshoots. Levenberg (1944) and Marquardt (1963) independently proposed the same fix: add a damping term to the normal equations.

The Levenberg-Marquardt (LM) step solves:

(J^TJ + λI) δ^* = −J^Tr(x̄)

The only difference from Gauss-Newton is the λI term added to J^TJ. Here λ > 0 is the damping parameter.

What does λ do?

Think of λ as a trust knob. When λ is very small (λ → 0), the step reduces to Gauss-Newton — we trust the linear approximation completely. When λ is very large (λ → ∞), the diagonal dominates: (J^TJ + λI)δ ≈ λδ = −J^Tr, so δ ≈ −(1/λ)J^Tr. This is gradient descent with step size 1/λ — small, cautious steps. LM interpolates between the two extremes.

Trust-region interpretation: LM implicitly constrains δ to lie within a sphere of radius ≈ 1/√λ around x̄. Large λ = small sphere = only move a little. Small λ = large sphere = take the full GN step. The algorithm adjusts λ automatically: if a step reduces the cost as predicted by the linear model, λ decreases (expand the trust region — the approximation was good). If the step increases cost, λ increases (shrink the trust region — the step was too ambitious).

The LM update rule for λ

After computing δ^*: if the actual cost f(x̄ + δ^*) < f(x̄) (the step actually helped), accept the update x̄ ← x̄ + δ^* and decrease λ by a factor (e.g., λ ← λ/2 or λ ← λ/10). If f(x̄ + δ^*) ≥ f(x̄) (step made things worse), reject the update and increase λ (e.g., λ ← λ×10). Start with λ = 10⁻³ · max(diag(J^TJ)).

Misconception: "LM converges to the global minimum." It does not — no local method does. LM converges to a local minimum. What LM guarantees over bare GN: it never increases the cost (bad steps are rejected), so it is monotonically decreasing. And the λI term makes J^TJ + λI always invertible (full rank regardless of J's rank), preventing blow-up even when some parameters are unobservable. LM is the industry standard (used in GTSAM, Ceres, g2o) precisely for these robustness properties.

Gauss-Newton step-through — click "Step" to watch the iterate move on the cost surface

The contour plot shows the cost f(a,b) for fitting y = a·exp(b·t) to 5 noisy points. The orange dot is the current estimate; the arrow shows the GN step direction. The linearization is valid near the minimum but can send the estimate off into high-cost regions from far away. Watch what happens when you start from the red "bad init" position.

LM λ 10⁻³

python
def levenberg_marquardt(t, y, x0, n_iter=50, lam0=1e-3, tol=1e-9):
    x = x0.copy()
    lam = lam0
    cost_history = []
    for _ in range(n_iter):
        a, b = x
        h = a * np.exp(b * t)
        r = y - h
        J = np.column_stack([-np.exp(b*t), -a*t*np.exp(b*t)])
        JtJ = J.T @ J
        Jtr = J.T @ r
        cost = float(r @ r)
        cost_history.append(cost)
        # LM step: add damping
        delta = np.linalg.solve(JtJ + lam * np.eye(2), -Jtr)
        x_new = x + delta
        h_new = x_new[0] * np.exp(x_new[1] * t)
        r_new = y - h_new
        cost_new = float(r_new @ r_new)
        if cost_new < cost:        # accept step, decrease lambda
            x = x_new
            lam = max(lam / 2, 1e-12)
        else:                       # reject step, increase lambda
            lam = min(lam * 10, 1e10)
        if np.linalg.norm(delta) < tol:
            break
    return x, cost_history

After an LM step with λ = 10⁻⁴, the cost goes from 12.3 to 14.7. What should happen next?

Accept the step (x updated) and decrease λ — the algorithm must always accept steps and refine Accept the step but increase λ — the step was bad so we damp more next time, but keep the new x Reject the step, decrease λ — the small λ caused overshoot, so reduce damping further Reject the step (x unchanged), increase λ — cost increased means the linear approximation was bad; shrink the trust region by damping more

Chapter 7: Cost Surfaces, Convergence & Covariance

Every NLS problem defines a cost surface f(x) — a scalar function over the parameter space. For a 2-parameter problem, this is a 2D landscape you can visualize. The shape of this landscape determines how hard the optimization problem is and how trustworthy the solution is.

Why initialization matters

GN and LM are local methods — they find the nearest valley from where you start. If f(x) is convex (one valley), initialization doesn't matter. If f(x) is nonconvex (multiple valleys), different starting points lead to different local minima. In VNAV: triangulation is a convex problem (it has a unique minimum). Camera pose estimation (PnP) is nonconvex when there are measurement outliers. Bundle adjustment over many frames is highly nonconvex. This is exactly why RANSAC (L8) is so important — it gives you an initial estimate that's already in the right basin of attraction.

Covariance of the estimate

When NLS converges to x̂, what is the uncertainty in x̂? Near the solution, the cost is approximately quadratic. The curvature of that quadratic tells you how sensitive the estimate is to noise in the measurements. Formally, the covariance of x̂ is:

Σ_x̂ ≈ (J^TΩ J)⁻¹

where Ω = Σ_measurements⁻¹ is the measurement information matrix and J is the Jacobian evaluated at the solution. The matrix J^TΩJ is called the information matrix of the estimate (or the approximate Hessian). Its inverse is the covariance — how much uncertainty remains in each parameter direction after observing all the data.

Reading the covariance: The diagonal entries Σ_ii are the variance of each individual parameter. Large diagonal = large uncertainty in that parameter. But the off-diagonal terms matter too: large Σ_ij means parameters i and j are correlated in their uncertainty — knowing one constrains the other. In SLAM, the correlation between pose and map uncertainty grows over time, until a loop closure "tightens" both simultaneously.

What a singular Jacobian means

If J has a zero singular value at the solution, then J^TJ is singular, and (J^TJ)⁻¹ blows up. This means the estimate has infinite variance in some direction — the data simply cannot determine that parameter. Examples in VNAV: a monocular camera cannot determine the global scale factor (all solutions at any scale have the same cost). A bundle adjustment with only two camera views sharing identical baseline cannot determine depth (degenerate configuration). These singularities are observability failures, and they're revealed by the information matrix.

LM λ explorer — slider morphs step from Gauss-Newton to gradient descent

The contour shows f(a,b) for the exponential fit. The orange arrow shows the LM step direction for a given λ. Drag λ from near 0 (full GN step) to large (tiny gradient descent step). Notice how large λ makes the step direction point directly downhill (gradient) rather than toward the valley bottom (GN).

log₁₀(λ) 0.01

At the converged solution x̂, the information matrix J^TJ has eigenvalues [1000, 0.001]. What does this tell you?

The solution is very accurate — eigenvalue 1000 indicates strong convergence One parameter direction is well-constrained (eigenvalue 1000 → variance ~0.001), but another is barely constrained (eigenvalue 0.001 → variance ~1000). The problem is nearly unobservable in that second direction. The algorithm ran 1000 iterations and then stalled at precision 0.001 The condition number is fine (1000/0.001 = 10⁶) — this ratio is acceptable for double precision

Chapter 8: Showcase — Full NLS Fit with Live Cost Curve

This is the payoff. Fit a noisy dataset with a nonlinear model (exponential decay y = a exp(bt) or circle), watching Gauss-Newton or LM iterate toward the solution step by step. The cost curve shows convergence. Toggle "bad init" to land in a local minimum and watch the algorithm get stuck.

Full Nonlinear Least-Squares Optimizer

Use Run to fit automatically, or Step to advance one iteration. Toggle between Gauss-Newton (no damping) and Levenberg-Marquardt (adaptive damping). The left panel shows the current fit vs noisy data; the right panel shows the cost curve decreasing (or diverging with bad init + GN). Try the "bad init" toggle with GN to see divergence, then switch to LM to see it recover.

Use LM damping Bad initialization

Noise σ 0.30

Chapter 9: Connections & Cheat Sheet

The complete derivation chain

Gaussian noise model

z_i = h_i(x) + ε_i, ε_i ~ N(0,Σ_i)

↓ take −log(likelihood)

Weighted least squares

min ∑ r_i^TΣ_i⁻¹r_i (MLE is WLS)

↓ if h_i(x) = A_ix (linear)

Normal equations

(A^TA) x = A^Tb → x̂ = (A^TA)⁻¹A^Tb

↓ if h_i(x) is nonlinear: linearize via Jacobian

Gauss-Newton step

(J^TJ)δ = −J^Tr → δ^* = −(J^TJ)⁻¹J^Tr

↓ add damping for robustness

Levenberg-Marquardt step

(J^TJ + λI)δ = −J^Tr → interp. GN (λ→0) & grad.desc. (λ→∞)

↓ at convergence

Covariance of estimate

Σ_x̂ = (J^TΩJ)⁻¹ (inverse information matrix)

Quick reference

Concept	Formula / Rule	Key property
MLE under Gaussian noise	min ∑ r_i^TΩ_ir_i	Least squares is not arbitrary
Normal equations (linear LS)	(A^TA)x = A^Tb	Closed form, one step
Jacobian J_i	∂r_i/∂x at x̄	Local linear sensitivity matrix
Gauss-Newton step	δ = −(J^TJ)⁻¹J^Tr	Fast near solution, can diverge
LM step	δ = −(J^TJ+λI)⁻¹J^Tr	Safe (monotone), always invertible
Estimate covariance	Σ = (J^TΩJ)⁻¹	Singular = unobservable direction
Convergence caveat	Local only — initialization matters	Use RANSAC (L8) for good init

Links to this lesson's neighbors

VNAV L8: RANSAC — delivers the inlier set and rough initial estimate that NLS refines. RANSAC finds the right basin; NLS finds the bottom of that basin. The pipeline is always RANSAC → NLS, never NLS alone on raw (potentially outlier-contaminated) data.

VNAV L7: Two-View Geometry & Triangulation — triangulation is a linear least-squares problem: given two projections and known poses, find the 3D point minimizing reprojection error. The "DLT" (Direct Linear Transform) formulation stacks measurements into A, b and solves the normal equations.

VNAV L10 (manifold optimization) — when the state x includes a rotation or pose (an element of SO(3) or SE(3), not a vector space), you can't just add δ to x. The GN/LM framework extends to Lie groups by replacing x ← x + δ with x ← x · exp(𝚽δ), retraction on the manifold. L2 (Lie Groups) provides the mathematical scaffolding.

VNAV L13 (SLAM / bundle adjustment) — a SLAM factor graph is a massive sparse NLS problem over all poses and landmarks. GTSAM's NonlinearFactorGraph is exactly GN/LM applied to this large sparse system, exploiting sparsity in J^TJ via Cholesky or QR on the sparse factor graph. This lesson is the engine behind SLAM.

Common failure modes and fixes

Symptom	Cause	Fix
Cost increases after step	Bad init, nonlinear region	Use LM; improve initialization
Oscillates, never converges	GN overshooting	Switch to LM or add line search
IndeterminantLinearSystemException (GTSAM)	J^TJ singular (unobservable)	Add prior factors; check observability
Converges to wrong answer	Wrong local minimum	Better initialization (RANSAC output)
Huge estimated covariance in one direction	Near-singular Jacobian direction	Add measurements that excite that direction

The Feynman test for this lesson: Can you explain to someone why minimizing the sum of squared residuals is the right thing to do (not just a convention), derive the normal equations by setting a gradient to zero, explain what the Jacobian tells you geometrically, and describe in words what happens when you add λI to the GN normal equations? If yes, you understand estimation at the level needed to read the GTSAM documentation, SLAM papers, and VIO system descriptions without getting lost.

"The goal is to turn data and a model into an estimate, together with a characterization of how good that estimate is. The least-squares framework does exactly this — and the Jacobian is the key that unlocks both the estimate and its uncertainty."
— paraphrased from Luca Carlone, MIT 16.485 Lecture Notes