Visual Navigation for Autonomous Vehicles · MIT 16.485 · Lecture 16–18

Estimation & Nonlinear Least Squares

Your IMU gives you acceleration readings corrupted by Gaussian noise. Your camera gives you pixel coordinates that are a nonlinear function of a 3D point's position. You want the 3D position (or the camera pose, or the calibration parameters) that best explains all these noisy measurements. This is estimation — and when the measurement model is nonlinear, you need an iterative algorithm to solve it. This lesson derives everything from first principles: why squared error is not arbitrary (it's maximum likelihood under Gaussian noise), how to solve the linear case with the normal equations, and how Gauss-Newton and Levenberg-Marquardt tackle the nonlinear case. MIT 16.485 by Luca Carlone, Lectures 16–18.

Prerequisites: VNAV L7 (triangulation as a linear system) · VNAV L8 (RANSAC gives inliers + initial guess) · linear algebra (matrix multiply, transpose, inverse). No calculus beyond partial derivatives.
10
Chapters
5
Live Canvases
Derived
From First Principles

Chapter 0: The Problem — Fitting a Model to Noisy Data

You're calibrating a camera. You print a checkerboard and photograph it 30 times. For each photo you detect the corners in pixel coordinates. You know where those corners sit in 3D (it's a rigid board). You want the camera intrinsics — focal length f, principal point (cx, cy), distortion coefficients. There are maybe 8 unknown parameters. You have thousands of pixel measurements. How do you find the best parameters?

The measurement model says: given parameters x and a 3D point Pi, the predicted pixel location is hi(x). Your measured pixel is z̃i. The residual is the gap: ri(x) = z̃i − hi(x). If the parameters are correct and the noise is small, every residual is small. If the parameters are wrong, some residuals will be large.

The question is: what objective function do we minimize over x to make all residuals small simultaneously? The answer is not arbitrary — it comes directly from probability theory, specifically from asking: what assignment of x makes the observed measurements most likely? That is maximum likelihood estimation (MLE).

The motivating insight: "best fit" isn't a matter of taste. Under Gaussian noise, the most probable parameter assignment is exactly the one that minimizes the sum of squared residuals. Least squares is MLE in disguise — this is WHY we use squared error, not L1 or something else.

Why this matters for VNAV

Every major VNAV computation reduces to this problem. Triangulating a 3D point from pixel observations (L7) is a linear least-squares problem. Recovering camera pose from 2D-3D correspondences (PnP) is a nonlinear least-squares problem. VIO — fusing IMU and camera to get a trajectory — is a large nonlinear least-squares problem over poses and velocities. Bundle adjustment in SLAM is a massive sparse nonlinear least-squares problem. The algorithm you'll learn here is the engine behind all of them.

RANSAC (L8) delivered you a set of inlier correspondences and a rough initial pose estimate. Now estimation takes over: it refines that rough estimate into the best possible answer given the inlier data. The pipeline is: RANSAC finds the right basin → the estimator finds the bottom of that basin.

Drag data points — watch least-squares fit update live

Click anywhere on the canvas to add a point. The orange line is the least-squares fit; the teal band shows residuals (vertical gaps). Cost = sum of squared residuals, shown top-right. Drag the noise slider to see how measurement noise spreads the cost surface.

Noise σ 0.8
You're calibrating a camera. You have 500 pixel measurements, each corrupted by independent Gaussian noise with σ = 1 pixel. You find parameters x* that make every residual exactly 0. Is x* guaranteed to be the maximum-likelihood estimate?

Chapter 1: Maximum Likelihood → Weighted Least Squares

Suppose measurement zi is drawn from a Gaussian: zi = hi(x) + εi, where εi ~ N(0, Σi). The probability density of observing zi given x is:

p(zi | x) ∝ exp(−½ ri(x)T Σi−1 ri(x))

where ri(x) = zi − hi(x) is the residual. Since all N measurements are independent, the joint likelihood of the full dataset Z = {z1, …, zN} is the product of N Gaussians.

Maximum likelihood says: find x that maximizes p(Z | x). Taking the log (which preserves the maximum) and negating (to turn a maximization into a minimization) gives:

x̂ = argmin ∑i=1N ri(x)T Σi−1 ri(x)

This is weighted least squares. The weight matrix is Ωi = Σi−1, called the information matrix. A measurement with small noise (small Σi) gets a large weight Ωi — it is trusted more. A noisy measurement gets a small weight — it pulls the estimate less.

Why squared error? Not aesthetic choice — mathematical necessity. The log-likelihood of a Gaussian is quadratic in the residual. Minimizing negative log-likelihood under Gaussian noise is identically minimizing the sum of squared (information-weighted) residuals. If your noise is non-Gaussian (e.g., Laplace → L1 loss, Cauchy → robust Huber loss), you get a different objective. RANSAC (L8) is your defense against non-Gaussian gross outliers; once RANSAC hands you clean inliers, Gaussian LS is the right tool.

The scalar case: one unknown, N measurements

Let's make it concrete. You have a robot wheel that should travel at speed x (unknown). Your N encoder readings are z1, …, zN, each = x + noise. Residual ri = zi − x. The LS cost is:

f(x) = ∑i=1N (zi − x)2

Setting df/dx = 0: −2 ∑(zi − x) = 0 ⇒ x̂ = (1/N) ∑ zi. The maximum-likelihood estimate of x under Gaussian noise is the sample mean. This is obvious in retrospect — but the derivation shows it's not a coincidence: it emerges from MLE, not from intuition.

Misconception: "least squares is arbitrary." It is not. It is the uniquely correct estimator when measurement noise is additive Gaussian. The moment your noise is non-Gaussian or has outliers, squared loss is the wrong choice. RANSAC gave you the inlier set precisely to restore the Gaussian assumption — so that least squares is justified on the remaining data.
Two sensors measure the same temperature. Sensor A has noise σA = 0.1°C; sensor B has noise σB = 1.0°C. In the MLE/weighted-LS framework, how should their information weights ΩA and ΩB compare?

Chapter 2: Linear Least Squares — The Setup

When the measurement model hi(x) is linear in x — meaning hi(x) = Ai x for some matrix Ai — the residual ri(x) = zi − Ai x is also linear. The cost function becomes a quadratic bowl in x: exactly one global minimum, reachable in one step.

Stack all the equations. Let A be the (m × n) matrix formed by stacking all Ai vertically, and b be the vector formed by stacking all zi. The full cost is:

f(x) = ‖Ax − b‖2 = (Ax − b)T(Ax − b)

where we use the shorthand b = z (the stacked measurements). This is linear least squares (LLS): minimize the squared length of the residual vector Ax − b.

A worked example: fitting a line by LLS

You have 4 data points: (1, 2.1), (2, 3.9), (3, 6.2), (4, 7.8). You want to fit a line y = ax + b (two unknowns: a and b). Each point gives one equation: a·xi + b = yi. Stacking all four:

A = [[1, 1], [2, 1], [3, 1], [4, 1]],   b = [2.1, 3.9, 6.2, 7.8]T,   x = [a, b]T

A is 4×2, b is 4×1, x is 2×1. We have 4 equations and 2 unknowns — an overdetermined system. No exact solution exists (the 4 points aren't perfectly collinear). Least squares finds the x that minimizes the squared error.

Geometric interpretation: The vector Ax lives in the column space of A — a 2D subspace of R4. The vector b lives somewhere in R4 (generally outside the column space). Least squares finds the point Ax̂ in the column space closest to b. The residual vector (b − Ax̂) is perpendicular to the column space of A — this is the key geometric fact.
python
import numpy as np

# Data points
x_data = np.array([1, 2, 3, 4])
y_data = np.array([2.1, 3.9, 6.2, 7.8])

# Build the A matrix (design matrix)
A = np.column_stack([x_data, np.ones_like(x_data)])
b = y_data

# Solve via numpy least-squares (uses QR internally)
x_hat, residuals, rank, sv = np.linalg.lstsq(A, b, rcond=None)
a_hat, b_hat = x_hat
print(f"Slope: {a_hat:.4f}, Intercept: {b_hat:.4f}")
# Slope: 1.9200, Intercept: 0.1500

# The residuals
r = b - A @ x_hat
print(f"Residuals: {r}")
# [-0.03  0.07 -0.02 -0.02]  — all small, no trend
You have 3 data points and want to fit a degree-2 polynomial y = ax2 + bx + c (3 unknowns). How many rows and columns does the design matrix A have, and is the system over-, under-, or exactly-determined?

Chapter 3: The Normal Equations — Deriving the Closed-Form Solution

The cost f(x) = ‖Ax − b‖2 is a quadratic bowl. Its minimum is where the gradient is zero. Let's derive it explicitly.

f(x) = (Ax−b)T(Ax−b) = xTATAx − 2bTAx + bTb

Taking the gradient with respect to x and setting it to zero:

x f = 2ATAx − 2ATb = 0

This gives the normal equations:

(ATA) x = ATb

And the solution:

x̂ = (ATA)−1 ATb

The matrix (ATA)−1AT is called the Moore-Penrose pseudoinverse of A. It exists and gives a unique solution whenever ATA is invertible — equivalently, whenever A has full column rank (no redundant or unobservable parameters).

Worked numbers: solving the line-fitting example

ATA = [[1,2,3,4],[1,1,1,1]] · [[1,1],[2,1],[3,1],[4,1]] = [[30, 10],[10, 4]]
ATb = [[1,2,3,4],[1,1,1,1]] · [2.1, 3.9, 6.2, 7.8]T = [2.1+7.8+18.6+31.2, 2.1+3.9+6.2+7.8]T = [59.7, 20.0]T

Solving (ATA) x = ATb:

[[30,10],[10,4]] · [a, b]T = [59.7, 20.0]T

det(ATA) = 30×4 − 10×10 = 20. Inverse = (1/20)[[4,−10],[−10,30]]. Therefore:

â = (4×59.7 − 10×20)/20 = (238.8−200)/20 = 1.94     b̂ = (−10×59.7 + 30×20)/20 = (−597+600)/20 = 0.15

So the fitted line is y = 1.94x + 0.15. Let's verify: at x=1, predicted = 2.09 (measured 2.1). At x=4, predicted = 7.91 (measured 7.8). Residuals are all sub-0.1 — the fit is good.

When does the solution not exist? When ATA is singular. This happens when two columns of A are linearly dependent — which means two parameters affect the measurements identically (they're unobservable as separate quantities). Example: fitting y = (a+c)x + b where a and c only appear summed. The system can't distinguish a=1,c=2 from a=2,c=1. No amount of data fixes this. In VNAV, unobservability shows up in monocular scale — you can't determine the 3D scale from a single camera; the entire trajectory is ambiguous up to a global scale factor.
Interactive Normal Equations Solver — drag noise, see the (ATA)−1ATb fit update

Points are fixed on a true line y = 2x + 1 with noise σ. Watch ATA, ATb, and the solution x̂ update as noise changes. The condition number of ATA (ratio of largest to smallest eigenvalue) indicates how well-conditioned the system is.

Noise σ 0.50
N points 8
The condition number of ATA is 108. What does this mean for your estimate x̂?

Chapter 4: The Jacobian — Linearizing a Nonlinear Residual

Camera calibration's measurement model hi(x) is not linear. The predicted pixel location of a 3D point involves perspective division (a division by the z-coordinate), plus a nonlinear distortion polynomial. Triangulation (L7) involves cross-products and SVD. These are nonlinear functions of the parameters x. So the direct normal-equation approach doesn't apply.

The key idea: around any specific parameter estimate x̄ (our current guess), any smooth nonlinear function can be approximated by a linear one — its first-order Taylor expansion. For residual ri(x):

ri(x̄ + δ) ≈ ri(x̄) + Ji(x̄) δ

where δ is a small perturbation and Ji(x̄) is the Jacobian of ri evaluated at x̄. The Jacobian is a matrix of partial derivatives:

Ji(x) = ∂ri ⁄ ∂x = [∂ri,1/∂x1   ∂ri,1/∂x2   … ;   ∂ri,2/∂x1   …]

If ri is a scalar and x has n components, Ji is a 1×n row vector. If ri has mi components (e.g., a 2D pixel error), Ji is mi×n. Each entry Ji,jk = ∂ri,j/∂xk asks: "if I increase parameter k by a tiny amount, how much does component j of residual i change?"

Geometric meaning of the Jacobian: Think of the Jacobian as the "local slope matrix." At any point x̄, it maps a change in parameters (δx) to the predicted change in residuals (δr ≈ Jδx). A large Jacobian entry means that parameter is sensitive to that measurement — small changes have big effects. A near-zero Jacobian entry means the parameter doesn't affect that measurement much — it's nearly unobservable from that data.

Concrete example: fitting an exponential

Model: h(t; a, b) = a · exp(b · t). Two unknowns: x = [a, b]T. Residual for one point (ti, yi): ri(a,b) = yi − a exp(bti).

The Jacobian row for measurement i:

Ji = [∂ri/∂a, ∂ri/∂b] = [−exp(bti), −a ti exp(bti)]

At current estimate ā = 3.0, b̄ = −0.5, and t1 = 1.0: J1 = [−exp(−0.5), −3·1·exp(−0.5)] = [−0.607, −1.819]. This says: if you increase a by 0.001, r1 decreases by 0.000607. If you increase b by 0.001, r1 decreases by 0.001819 — b is more sensitive to this measurement than a.

Misconception: "the Jacobian is just bookkeeping." It is the heart of the algorithm. Every step of Gauss-Newton and Levenberg-Marquardt involves solving a linear system whose coefficient matrix is JTJ. If J is rank-deficient (some parameter combination doesn't affect any measurement), the linear system is singular and the estimate is undefined in that direction. The Jacobian is your observability detector.
Residual r(a, b) = y − a · exp(b · t). At ā = 2, b̄ = 0, t = 3: what is J = [∂r/∂a, ∂r/∂b]?

Chapter 5: Gauss-Newton — Iterative Linearized Least Squares

Now we can combine the two ideas: linearize the residual around the current estimate, then take one linear least-squares step. That's Gauss-Newton.

At current estimate x̄, substitute the Taylor approximation ri(x̄ + δ) ≈ ri(x̄) + Jiδ into the cost:

f(x̄ + δ) ≈ ∑i ‖ri(x̄) + Jiδ‖2

Stack all Ji into J (an m×n matrix) and all ri(x̄) into r(x̄) (an m-vector). This becomes:

f ≈ ‖Jδ + r(x̄)‖2

This is a linear least-squares problem in δ! Applying the normal equations:

(JTJ)δ* = −JTr(x̄)
δ* = −(JTJ)−1 JT r(x̄)

Update: x̄ ← x̄ + δ*. Then recompute J and r at the new x̄, and repeat. This is the Gauss-Newton algorithm.

Why drop the second-order term?

Newton's method on f(x) would use the exact Hessian: ∇2f = 2(∑i JiTJi + ∑i ri2ri). Gauss-Newton drops the second term ∑i ri2ri. This approximation is valid when (a) residuals ri are small near the solution (good fit), and (b) the residuals are nearly linear (small second derivatives). Both hold near a good solution. The payoff: computing second derivatives (∇2ri) is expensive and fragile; dropping it costs almost nothing when the approximation is valid, and gives us a positive-definite JTJ (always invertible if J has full rank).

GN step summary:
1. Evaluate residuals r(x̄) at current estimate.
2. Evaluate Jacobian J(x̄) at current estimate.
3. Solve the linear system (JTJ)δ = −JTr for δ.
4. Update: x̄ ← x̄ + δ.
5. Check convergence: if ‖δ‖ < ε or cost stopped decreasing, stop. Else go to 1.

One GN step by hand: exponential fit

True model: y = 3 exp(−0.5 t). Noisy observations at t = 1,2,3,4: y = [1.9, 1.2, 0.7, 0.4]. Initial guess: ā = 2, b̄ = −0.3. Predicted: h = [2e−0.3, 2e−0.6, 2e−0.9, 2e−1.2] = [1.481, 1.098, 0.814, 0.603]. Residuals: r = [0.419, 0.102, −0.114, −0.203].

Jacobians (row i = [−exp(b̄ti), −ātiexp(b̄ti)]): J = [[−0.741, −1.481], [−0.549, −2.193], [−0.407, −2.442], [−0.301, −2.411]].

JTJ ≈ [[0.960, 4.899], [4.899, 26.24]]. JTr ≈ [−0.223, −1.096]. Solving: δ ≈ [0.68, −0.095]. New estimate: ā = 2.68, b̄ = −0.395. After just one step, we're moving toward the truth (a=3, b=−0.5). Each subsequent step brings us closer.

python
import numpy as np

def gauss_newton(t, y, x0, n_iter=20, tol=1e-8):
    """Fit y = a*exp(b*t) by Gauss-Newton."""
    x = x0.copy()
    history = [x.copy()]
    for _ in range(n_iter):
        a, b = x
        h = a * np.exp(b * t)          # predicted values
        r = y - h                       # residuals, shape (N,)
        # Jacobian: dr/d[a,b] = [-exp(bt), -a*t*exp(bt)]
        J = np.column_stack([
            -np.exp(b * t),
            -a * t * np.exp(b * t)
        ])                              # shape (N, 2)
        # Gauss-Newton step: (J^T J) delta = -J^T r
        JtJ = J.T @ J                   # (2,2) approximate Hessian
        Jtr = J.T @ r                   # (2,) gradient direction
        delta = np.linalg.solve(JtJ, -Jtr)
        x = x + delta
        history.append(x.copy())
        if np.linalg.norm(delta) < tol:
            break
    return x, history

t = np.array([1.0, 2.0, 3.0, 4.0])
y = np.array([1.9, 1.2, 0.7, 0.4])
x_hat, hist = gauss_newton(t, y, x0=np.array([2.0, -0.3]))
print(f"a={x_hat[0]:.4f}, b={x_hat[1]:.4f}")
print(f"Converged in {len(hist)-1} iterations")
Gauss-Newton takes one step and the cost increases. What is the most likely cause?

Chapter 6: Levenberg-Marquardt — Adding a Safety Net

Gauss-Newton can diverge. Far from the solution, the linearization is poor — the actual residual surface curves away from the tangent plane, and the GN step overshoots. Levenberg (1944) and Marquardt (1963) independently proposed the same fix: add a damping term to the normal equations.

The Levenberg-Marquardt (LM) step solves:

(JTJ + λI) δ* = −JTr(x̄)

The only difference from Gauss-Newton is the λI term added to JTJ. Here λ > 0 is the damping parameter.

What does λ do?

Think of λ as a trust knob. When λ is very small (λ → 0), the step reduces to Gauss-Newton — we trust the linear approximation completely. When λ is very large (λ → ∞), the diagonal dominates: (JTJ + λI)δ ≈ λδ = −JTr, so δ ≈ −(1/λ)JTr. This is gradient descent with step size 1/λ — small, cautious steps. LM interpolates between the two extremes.

Trust-region interpretation: LM implicitly constrains δ to lie within a sphere of radius ≈ 1/√λ around x̄. Large λ = small sphere = only move a little. Small λ = large sphere = take the full GN step. The algorithm adjusts λ automatically: if a step reduces the cost as predicted by the linear model, λ decreases (expand the trust region — the approximation was good). If the step increases cost, λ increases (shrink the trust region — the step was too ambitious).

The LM update rule for λ

After computing δ*: if the actual cost f(x̄ + δ*) < f(x̄) (the step actually helped), accept the update x̄ ← x̄ + δ* and decrease λ by a factor (e.g., λ ← λ/2 or λ ← λ/10). If f(x̄ + δ*) ≥ f(x̄) (step made things worse), reject the update and increase λ (e.g., λ ← λ×10). Start with λ = 10−3 · max(diag(JTJ)).

Misconception: "LM converges to the global minimum." It does not — no local method does. LM converges to a local minimum. What LM guarantees over bare GN: it never increases the cost (bad steps are rejected), so it is monotonically decreasing. And the λI term makes JTJ + λI always invertible (full rank regardless of J's rank), preventing blow-up even when some parameters are unobservable. LM is the industry standard (used in GTSAM, Ceres, g2o) precisely for these robustness properties.
Gauss-Newton step-through — click "Step" to watch the iterate move on the cost surface

The contour plot shows the cost f(a,b) for fitting y = a·exp(b·t) to 5 noisy points. The orange dot is the current estimate; the arrow shows the GN step direction. The linearization is valid near the minimum but can send the estimate off into high-cost regions from far away. Watch what happens when you start from the red "bad init" position.

LM λ 10−3
python
def levenberg_marquardt(t, y, x0, n_iter=50, lam0=1e-3, tol=1e-9):
    x = x0.copy()
    lam = lam0
    cost_history = []
    for _ in range(n_iter):
        a, b = x
        h = a * np.exp(b * t)
        r = y - h
        J = np.column_stack([-np.exp(b*t), -a*t*np.exp(b*t)])
        JtJ = J.T @ J
        Jtr = J.T @ r
        cost = float(r @ r)
        cost_history.append(cost)
        # LM step: add damping
        delta = np.linalg.solve(JtJ + lam * np.eye(2), -Jtr)
        x_new = x + delta
        h_new = x_new[0] * np.exp(x_new[1] * t)
        r_new = y - h_new
        cost_new = float(r_new @ r_new)
        if cost_new < cost:        # accept step, decrease lambda
            x = x_new
            lam = max(lam / 2, 1e-12)
        else:                       # reject step, increase lambda
            lam = min(lam * 10, 1e10)
        if np.linalg.norm(delta) < tol:
            break
    return x, cost_history
After an LM step with λ = 10−4, the cost goes from 12.3 to 14.7. What should happen next?

Chapter 7: Cost Surfaces, Convergence & Covariance

Every NLS problem defines a cost surface f(x) — a scalar function over the parameter space. For a 2-parameter problem, this is a 2D landscape you can visualize. The shape of this landscape determines how hard the optimization problem is and how trustworthy the solution is.

Why initialization matters

GN and LM are local methods — they find the nearest valley from where you start. If f(x) is convex (one valley), initialization doesn't matter. If f(x) is nonconvex (multiple valleys), different starting points lead to different local minima. In VNAV: triangulation is a convex problem (it has a unique minimum). Camera pose estimation (PnP) is nonconvex when there are measurement outliers. Bundle adjustment over many frames is highly nonconvex. This is exactly why RANSAC (L8) is so important — it gives you an initial estimate that's already in the right basin of attraction.

Covariance of the estimate

When NLS converges to x̂, what is the uncertainty in x̂? Near the solution, the cost is approximately quadratic. The curvature of that quadratic tells you how sensitive the estimate is to noise in the measurements. Formally, the covariance of x̂ is:

Σ ≈ (JTΩ J)−1

where Ω = Σmeasurements−1 is the measurement information matrix and J is the Jacobian evaluated at the solution. The matrix JTΩJ is called the information matrix of the estimate (or the approximate Hessian). Its inverse is the covariance — how much uncertainty remains in each parameter direction after observing all the data.

Reading the covariance: The diagonal entries Σii are the variance of each individual parameter. Large diagonal = large uncertainty in that parameter. But the off-diagonal terms matter too: large Σij means parameters i and j are correlated in their uncertainty — knowing one constrains the other. In SLAM, the correlation between pose and map uncertainty grows over time, until a loop closure "tightens" both simultaneously.

What a singular Jacobian means

If J has a zero singular value at the solution, then JTJ is singular, and (JTJ)−1 blows up. This means the estimate has infinite variance in some direction — the data simply cannot determine that parameter. Examples in VNAV: a monocular camera cannot determine the global scale factor (all solutions at any scale have the same cost). A bundle adjustment with only two camera views sharing identical baseline cannot determine depth (degenerate configuration). These singularities are observability failures, and they're revealed by the information matrix.

LM λ explorer — slider morphs step from Gauss-Newton to gradient descent

The contour shows f(a,b) for the exponential fit. The orange arrow shows the LM step direction for a given λ. Drag λ from near 0 (full GN step) to large (tiny gradient descent step). Notice how large λ makes the step direction point directly downhill (gradient) rather than toward the valley bottom (GN).

log10(λ) 0.01
At the converged solution x̂, the information matrix JTJ has eigenvalues [1000, 0.001]. What does this tell you?

Chapter 8: Showcase — Full NLS Fit with Live Cost Curve

This is the payoff. Fit a noisy dataset with a nonlinear model (exponential decay y = a exp(bt) or circle), watching Gauss-Newton or LM iterate toward the solution step by step. The cost curve shows convergence. Toggle "bad init" to land in a local minimum and watch the algorithm get stuck.

Full Nonlinear Least-Squares Optimizer

Use Run to fit automatically, or Step to advance one iteration. Toggle between Gauss-Newton (no damping) and Levenberg-Marquardt (adaptive damping). The left panel shows the current fit vs noisy data; the right panel shows the cost curve decreasing (or diverging with bad init + GN). Try the "bad init" toggle with GN to see divergence, then switch to LM to see it recover.

Noise σ 0.30

Chapter 9: Connections & Cheat Sheet

The complete derivation chain

Gaussian noise model
zi = hi(x) + εi, εi ~ N(0,Σi)
↓ take −log(likelihood)
Weighted least squares
min ∑ riTΣi−1ri   (MLE is WLS)
↓ if hi(x) = Aix (linear)
Normal equations
(ATA) x = ATb  →  x̂ = (ATA)−1ATb
↓ if hi(x) is nonlinear: linearize via Jacobian
Gauss-Newton step
(JTJ)δ = −JTr  →  δ* = −(JTJ)−1JTr
↓ add damping for robustness
Levenberg-Marquardt step
(JTJ + λI)δ = −JTr  →  interp. GN (λ→0) & grad.desc. (λ→∞)
↓ at convergence
Covariance of estimate
Σ = (JTΩJ)−1    (inverse information matrix)

Quick reference

ConceptFormula / RuleKey property
MLE under Gaussian noisemin ∑ riTΩiriLeast squares is not arbitrary
Normal equations (linear LS)(ATA)x = ATbClosed form, one step
Jacobian Ji∂ri/∂x at x̄Local linear sensitivity matrix
Gauss-Newton stepδ = −(JTJ)−1JTrFast near solution, can diverge
LM stepδ = −(JTJ+λI)−1JTrSafe (monotone), always invertible
Estimate covarianceΣ = (JTΩJ)−1Singular = unobservable direction
Convergence caveatLocal only — initialization mattersUse RANSAC (L8) for good init

Links to this lesson's neighbors

VNAV L8: RANSAC — delivers the inlier set and rough initial estimate that NLS refines. RANSAC finds the right basin; NLS finds the bottom of that basin. The pipeline is always RANSAC → NLS, never NLS alone on raw (potentially outlier-contaminated) data.

VNAV L7: Two-View Geometry & Triangulation — triangulation is a linear least-squares problem: given two projections and known poses, find the 3D point minimizing reprojection error. The "DLT" (Direct Linear Transform) formulation stacks measurements into A, b and solves the normal equations.

VNAV L10 (manifold optimization) — when the state x includes a rotation or pose (an element of SO(3) or SE(3), not a vector space), you can't just add δ to x. The GN/LM framework extends to Lie groups by replacing x ← x + δ with x ← x · exp(𝚽δ), retraction on the manifold. L2 (Lie Groups) provides the mathematical scaffolding.

VNAV L13 (SLAM / bundle adjustment) — a SLAM factor graph is a massive sparse NLS problem over all poses and landmarks. GTSAM's NonlinearFactorGraph is exactly GN/LM applied to this large sparse system, exploiting sparsity in JTJ via Cholesky or QR on the sparse factor graph. This lesson is the engine behind SLAM.

Common failure modes and fixes

SymptomCauseFix
Cost increases after stepBad init, nonlinear regionUse LM; improve initialization
Oscillates, never convergesGN overshootingSwitch to LM or add line search
IndeterminantLinearSystemException (GTSAM)JTJ singular (unobservable)Add prior factors; check observability
Converges to wrong answerWrong local minimumBetter initialization (RANSAC output)
Huge estimated covariance in one directionNear-singular Jacobian directionAdd measurements that excite that direction
The Feynman test for this lesson: Can you explain to someone why minimizing the sum of squared residuals is the right thing to do (not just a convention), derive the normal equations by setting a gradient to zero, explain what the Jacobian tells you geometrically, and describe in words what happens when you add λI to the GN normal equations? If yes, you understand estimation at the level needed to read the GTSAM documentation, SLAM papers, and VIO system descriptions without getting lost.
"The goal is to turn data and a model into an estimate, together with a characterization of how good that estimate is. The least-squares framework does exactly this — and the Jacobian is the key that unlocks both the estimate and its uncertainty."
— paraphrased from Luca Carlone, MIT 16.485 Lecture Notes