Visual Navigation for Autonomous Vehicles · MIT 16.485 · Lecture 12–13

Feature Detection & Tracking

A robot navigating by camera must re-find the same physical points across frames — corners of buildings, edges of signs, specks on a surface. A flat wall gives nothing to grab. An edge is ambiguous: you can locate it across its direction but it slides freely along itself. Only corners lock a point in every direction. This lesson builds the full pipeline from image gradients, through the structure tensor and its eigenvalues, to Harris corners, SIFT-style descriptors, ratio-test matching, and KLT optical-flow tracking. The same matrix G runs through all of it. MIT 16.485 by Luca Carlone, Lectures 12–13.

Prerequisites: VNAV L5 (image formation & projection). Basic partial derivatives. No advanced linear algebra beyond 2×2 eigenvalues.

Chapters

Live Canvases

Derived

From First Principles

Chapter 0: The Re-Finding Problem

Your autonomous vehicle is cruising down a street. Frame 1: a camera captures a building with a fire-escape corner. Frame 2: the vehicle has moved 0.2 m forward. Where did that corner go in frame 2? If you can answer that question for hundreds of points across dozens of frames, you have enough to compute the vehicle's trajectory, the 3D structure of the scene, and eventually a navigable map. Feature tracking is the front end of all of visual navigation.

But not every pixel is equally trackable. Stare at a patch of blank wall: any small shift of the camera produces a patch that looks identical. There is no way to know if you moved at all. Now stare at a painted horizontal stripe: you can tell if it moved vertically (across the stripe) but not horizontally (along the stripe). Now stare at a corner of a window frame: any motion in any direction changes the patch's appearance. Only corners give you unambiguous information in every direction.

The aperture problem, stated. When you look through a small hole (aperture) at a moving line, you can only perceive motion perpendicular to the line — motion along the line is invisible. Mathematically, the gradient is zero along the edge direction, so the component of motion in that direction is unobservable. Corners break the symmetry: they have gradients in multiple directions, making all motion components observable.

What we need from a feature

A good feature must be repeatable (you can re-detect it from a nearby viewpoint), distinctive (its local appearance is unique — you won't confuse it with a feature 100 pixels away), and localizable (you can pin it down to a precise pixel). Flat regions fail all three. Edges fail distinctiveness and partially fail localizability. Corners pass all three.

Aperture problem — drag the window

Left panel: a synthetic image with a flat region, an edge, and a corner. The orange aperture window follows the slider. Right panel: shows which motion directions are detectable inside that window. Flat = no direction; Edge = one direction; Corner = all directions.

Window position edge

A camera looks at a uniform white wall. You move the camera 10 pixels to the right. What does the KLT tracker report for most tracked points on the wall?

Each point moves exactly 10 pixels to the right — the tracker works perfectly The tracker reports zero motion because the wall is not a feature The tracker fails or returns arbitrary results — a uniform region has no gradient information to anchor the solution Exactly one point on the wall is trackable at any given time

Chapter 1: Image Gradients

The foundation of all feature detectors is the image gradient: how quickly does pixel intensity change as you move in the x (horizontal) or y (vertical) direction? Formally, for a continuous intensity function I(u, v), the gradient is the vector of partial derivatives:

∇I(u,v) = [ ∂I⁄∂u , ∂I⁄∂v ]^T

In practice images are discrete pixel grids, so we approximate derivatives with finite differences. The simplest estimate at pixel (u, v) is:

∂I⁄∂u ≈ I(u+1, v) − I(u, v) ∂I⁄∂v ≈ I(u, v+1) − I(u, v)

Better in practice: use the Sobel operator, a 3×3 kernel that combines finite differences with a small blur to reduce noise sensitivity. The Sobel x-kernel is [−1, 0, 1] replicated vertically with weights [1, 2, 1] — it measures the horizontal rate of change while smoothing vertically.

Gradient magnitude and direction. At any pixel, magnitude |∇I| = √(I_u² + I_v²) tells you how strong the local edge is. Direction θ = atan2(I_v, I_u) tells you which way the intensity rises fastest. An edge runs perpendicular to the gradient direction. A corner has gradients pointing in many different directions within a small window.

Worked numbers

Consider a tiny 3×3 patch of intensity values (row 0 is top):

patch
10  10  10
10  80  80    # ← strong horizontal step at row 1
10  80  80

At the center pixel (1,1): I_u = I(2,1) − I(0,1) = 80 − 10 = 70 (right minus left); I_v = I(1,2) − I(1,0) = 80 − 10 = 70 (bottom minus top). The gradient is [70, 70]^T, pointing diagonally up-right at 45°. This pixel is at a corner of the step — the detector will love it.

Misconception: gradients detect edges, not corners. The gradient at a single pixel only tells you about intensity change at that location. A corner is not a property of one pixel — it is a property of the distribution of gradients across a window. That's what the structure tensor in Ch 2 captures.

In a region of a horizontal edge (intensity = 0 above, 255 below), which gradient component is large?

I_u (horizontal gradient) is large; I_v (vertical gradient) is near zero I_v (vertical gradient) is large; I_u (horizontal gradient) is near zero Both components are equal Neither component is large — edges are invisible to gradient detectors

Chapter 2: The Structure Tensor

A single gradient vector tells us the local edge direction at one pixel. To decide whether a pixel is a corner, we need to ask a bigger question: does shifting a window in any small direction cause the appearance to change? This is precisely what the structure tensor (also called the second-moment matrix or the Harris matrix) captures.

Derivation from the Taylor expansion

Define the sum-of-squared-differences when you shift window W around pixel x̄ by displacement δ = [δ_u, δ_v]^T:

E(δ) = ∑_{x ∈ W(x̄)} [I(x + δ) − I(x)]²

For a small δ, Taylor-expand: I(x + δ) ≈ I(x) + ∇I(x)^Tδ. Substituting:

E(δ) ≈ ∑_{x ∈ W} [∇I(x)^Tδ]² = δ^T · G · δ

where the structure tensor G is:

G = ∑_{x ∈ W} ∇I(x) ∇I(x)^T = ∑_{x ∈ W}

I_u²	I_uI_v
I_uI_v	I_v²

G is a 2×2 symmetric positive-semidefinite matrix. Its eigenvalues λ₁ ≥ λ₂ ≥ 0 measure how much E(δ) changes in the two principal directions. The minimum change over all unit directions δ equals λ₂ (the smaller eigenvalue). So: a good corner has large λ₂.

Eigenvalue intuition: the error ellipse. The quadratic form δ^TGδ draws an ellipse in (δ_u, δ_v) space. Large λ₁ and λ₂ mean the ellipse is fat in both directions — any shift you make changes the appearance. Small λ₂ means the ellipse is very thin in one direction — there is a direction of shift that changes nothing. That thin direction is exactly the aperture-problem direction along the dominant edge.

Worked numbers: hand-computing G from a tiny patch

Use the 3×3 patch from Chapter 1. At the four interior positions (we'll use just the center pixel for simplicity), suppose the gradient is I_u = 70, I_v = 70 at center and zero everywhere else (as an idealization). Then:

G = [[70, 70]]^T[[70, 70]] = [[4900, 4900], [4900, 4900]]

Eigenvalues: tr(G) = 9800, det(G) = 4900×4900 − 4900×4900 = 0. So λ₁ = 9800, λ₂ = 0. This says: edge, not a corner. The gradient only points in one direction across the window, giving rank-1 G. A real corner would have gradients in multiple directions, giving rank-2 G with two large eigenvalues.

Structure tensor eigenvalue explorer

Adjust the dominant gradient direction and a secondary gradient direction. The canvas shows gradient vectors in the window, the resulting G matrix values, eigenvalues λ₁ and λ₂, the error ellipse, and a flat/edge/corner verdict.

Primary gradient angle (°) 45

Secondary gradient strength (0=edge, 1=corner) 0.00

A window's structure tensor has eigenvalues λ₁ = 1200 and λ₂ = 8. What is this region?

A strong corner — both eigenvalues are positive An edge — one eigenvalue is large, the other is near zero A flat region — the minimum eigenvalue is too small to detect anything Ambiguous without knowing the trace

Chapter 3: Harris & Shi-Tomasi Scores

Computing eigenvalues of G at every pixel is expensive. Both classical corner detectors approximate the eigenvalue criterion with a simpler scalar score. The key insight: for a 2×2 matrix, det(G) = λ₁λ₂ and tr(G) = λ₁ + λ₂ — both are cheap to compute without finding actual eigenvalues.

Harris corner response (1988)

Harris and Stephens proposed the score:

C(G) = det(G) − k · tr(G)² = λ₁λ₂ − k(λ₁ + λ₂)²

where k ≈ 0.04–0.06 is a tuning constant. Let's see why this works by case analysis:

Flat region: λ₁ ≈ λ₂ ≈ 0 ⇒ C ≈ 0 (correctly: no corner)
Edge: λ₁ ≫ λ₂ ≈ 0 ⇒ det(G) ≈ 0 but tr(G)² ≈ λ₁² large ⇒ C ≈ −kλ₁² < 0 (correctly: penalizes edges)
Corner: λ₁ ≈ λ₂ ≫ 0 ⇒ C ≈ (1−4k)λ₁² > 0 (correctly: rewards balanced eigenvalues)

Shi-Tomasi score (1994)

Shi and Tomasi simplified further: just use the minimum eigenvalue directly.

S(G) = λ_min(G) = min(λ₁, λ₂)

A pixel is a corner if S exceeds a threshold. This is called "Good Features to Track" and is slightly simpler to tune. Both detectors work by: compute G over a window → compute score → threshold → apply non-maximum suppression (keep only local maxima within a radius).

Worked numbers: Harris score from scratch

A 3-pixel window has gradients: pixel A: (I_u, I_v) = (40, 0); pixel B: (0, 40); pixel C: (30, 30). k = 0.05.

G = [[40,0]]^T[[40,0]] + [[0,40]]^T[[0,40]] + [[30,30]]^T[[30,30]]

G = [[1600,0],[0,0]] + [[0,0],[0,1600]] + [[900,900],[900,900]]

G = [[2500, 900], [900, 2500]]

det(G) = 2500×2500 − 900×900 = 6,250,000 − 810,000 = 5,440,000

tr(G) = 5000 ⇒ tr(G)² = 25,000,000

C = 5,440,000 − 0.05 × 25,000,000 = 5,440,000 − 1,250,000 = 4,190,000

Large positive: this is a strong corner. Eigenvalues: λ = (5000 ± √(2500²−4×5,440,000))⁄2 = (5000 ± 3200)⁄2 ⇒ λ₁ = 4100, λ₂ = 900. Both large, roughly balanced: corner confirmed.

Harris response heatmap on a synthetic scene

A synthetic image with flat regions, edges, and corners. Warm = high Harris response. Toggle threshold to see detected corners (teal dots). Adjust k to see how the flat/edge/corner balance changes.

Harris k 0.05

Threshold (%) 30%

Show Heatmap Corners

Misconception: more features = better navigation. A feature you can't re-find across frames is noise that corrupts your pose estimator. Better to have 50 unambiguous corners than 500 weak edge responses. Non-maximum suppression and a minimum-quality threshold exist for this reason.

For a pixel on a straight horizontal edge, what is the sign of the Harris response C?

Positive and large — edges score well Zero — edges produce no response Negative — the −k·tr² term dominates when one eigenvalue is large and the other near zero Depends on k — can be positive or negative

Chapter 4: Scale & Blob Detection

Harris corners are not scale-invariant: a corner at scale 1 may look like a flat region at scale 4 (zoom out and the fine structure blurs away). For matching across large viewpoint or distance changes — a car seen at 5 m then 50 m — we need to detect features at their "natural" scale.

The scale-space pyramid

Build a Gaussian pyramid: repeatedly blur and downsample the image. Each level is a coarser view of the scene. A feature that is a corner at the correct scale will show a strong response at that level and weak responses at others. Scale-invariant detectors search for extrema across both space AND scale simultaneously.

L(u, v, σ) = G(σ) * I(u, v) (Gaussian blurred image at scale σ)

Difference of Gaussians (DoG) — the SIFT blob detector

SIFT detects blobs (rounded regions of distinct intensity) using the Difference of Gaussians, an efficient approximation to the Laplacian of Gaussian (LoG). For two adjacent pyramid levels with scales σ and kσ:

DoG(u, v, σ) = L(u, v, kσ) − L(u, v, σ)

A keypoint is detected wherever DoG achieves a local extremum (maximum or minimum) over a 3×3×3 neighborhood in (u, v, σ) space — 26 comparisons in total. This gives a (u, v, scale) triplet for each keypoint, providing built-in scale information.

Why blobs, not just corners? On a building at 50 m, individual corners may be too small to detect reliably. But the window's own blob scale (determined by where the DoG response peaks) gives the detector a natural "zoom level" to use when computing descriptors. SIFT's extraordinary scale-invariance comes from always measuring descriptors at the detected scale, not at fixed pixel sizes.

Scale normalization in brief

At each detected keypoint (u, v, σ*), SIFT sets a patch size proportional to σ* before computing its descriptor. This way a feature on a distant building and the same feature when the car is close will produce the same descriptor — the scale information is baked in.

ORB skips scale-space but stays fast. ORB (Oriented FAST and Rotated BRIEF) uses a simpler detector (FAST corners) and a binary descriptor, running 100× faster than SIFT. It handles scale by running on an image pyramid but without the full DoG extremum search. Real-time SLAM systems use ORB; offline reconstruction uses SIFT or SURF.

Why does SIFT detect keypoints in a 3D space (u, v, scale) rather than just 2D (u, v)?

To reduce computation time by searching fewer pixels To support 3D scene reconstruction directly To find each feature at its natural scale, enabling scale-invariant description and matching across large distance changes The third dimension encodes orientation, not scale

Chapter 5: Feature Descriptors

Detecting a corner gives you a pixel location. But how do you recognize the same corner in a completely different image, taken hours later or from a different camera? You need a descriptor: a compact fingerprint of the local appearance around a keypoint that is distinctive enough to match against, and robust to small rotations, lighting changes, and viewpoint shifts.

SIFT descriptor: gradient histograms

Around the keypoint (at its detected scale), SIFT samples a 16×16 pixel region, subdivided into a 4×4 grid of sub-patches. In each sub-patch, it builds an 8-bin histogram of gradient orientations, weighted by gradient magnitude. Concatenating 4×4×8 = 128 values gives the SIFT descriptor vector. Before concatenation, the keypoint is assigned a canonical orientation (the dominant gradient direction in the surrounding region) so that the descriptor is rotation-invariant.

d_SIFT ∈ ℝ¹²⁸ (128-element float vector, L2-normalized)

After computing d, SIFT L2-normalizes it and clips values above 0.2 (to reduce illumination-change sensitivity), then re-normalizes. Two SIFT descriptors are matched if their L2 distance is small enough.

Binary descriptors: ORB / BRIEF

SIFT's 128-float vector costs memory and comparison time. BRIEF (Binary Robust Independent Elementary Features) replaces it with a bit-string: for each of N pre-selected pixel-pair positions (p_i, q_i) in the patch, set bit i = 1 if I(p_i) > I(q_i), else 0. ORB uses BRIEF with 256 bits, with pairs chosen to be maximally uncorrelated (maximizing descriptor distinctiveness).

d_ORB ∈ {0,1}²⁵⁶ (256-bit string, Hamming distance for matching)

Comparing 256-bit strings takes one XOR + popcount — extremely fast on modern CPUs (hardware popcount instruction). ORB runs in real time on embedded systems like the Raspberry Pi.

Descriptor = a point's fingerprint. Descriptors exist so you can match WITHOUT tracking. If you lose a feature (occlusion, large motion jump), you can re-detect it in the new frame and match by descriptor. Tracking (Ch 7) assumes slow motion and uses intensity directly. Descriptor matching handles large viewpoint/time gaps.

Misconception: a corner you detect is uniquely matchable. Many corners look similar — a right-angle junction in a brick building repeats every 0.5 m. The ratio test (Ch 6) exists precisely because descriptors are never perfectly unique, and naive nearest-neighbor matching makes many false matches.

Why does SIFT assign a canonical orientation to each keypoint before computing the descriptor?

To speed up computation by aligning the 4×4 grid with image axes So the descriptor is the same regardless of image rotation — matching works even if the camera rotated between views To reduce the descriptor dimension from 128 to 64 The orientation is embedded as a separate component, not used to rotate the patch

Chapter 6: Matching & the Ratio Test

You have two sets of keypoints with descriptors — one from frame A, one from frame B. The naive approach: for each keypoint in A, find the nearest neighbor in B by descriptor distance. This almost always produces many false matches because many features look similar (repeated patterns, similar corners).

Lowe's ratio test

David Lowe proposed a simple and powerful filter: for each keypoint in A, find its two nearest neighbors in B — call their distances d₁ (best) and d₂ (second-best). Accept the match only if:

d₁ ⁄ d₂ < 0.8 (Lowe's ratio test threshold)

The intuition: if the best match is much closer than the second-best, the match is distinctive and likely correct. If d₁ ≈ d₂, the descriptor matches two candidates almost equally well — this is an ambiguous point (like one of many identical bricks) and we discard it.

Worked decision

Query keypoint q has descriptor distances: d₁ = 42 to match A, d₂ = 51 to match B. Ratio = 42/51 = 0.82 > 0.8 ⇒ rejected. Another query: d₁ = 31, d₂ = 78. Ratio = 31/78 = 0.40 < 0.8 ⇒ accepted. The second is much more distinctive.

Descriptor matching with ratio-test culling

Two synthetic images (left, right) with detected features (dots). Lines connect matched pairs. Adjust the ratio threshold to cull ambiguous matches. Green = high-confidence match; orange = borderline; red = rejected by ratio test.

Ratio threshold 0.80

Code: ratio test in Python

python
import numpy as np

def ratio_test_match(desc_A, desc_B, threshold=0.8):
    """
    desc_A: (N, D) float array of descriptors from frame A
    desc_B: (M, D) float array of descriptors from frame B
    Returns list of (i, j) index pairs that pass the ratio test.
    """
    matches = []
    for i, da in enumerate(desc_A):
        # L2 distance to all descriptors in B
        dists = np.linalg.norm(desc_B - da, axis=1)
        # Two nearest neighbours
        idx = np.argsort(dists)[:2]
        d1, d2 = dists[idx[0]], dists[idx[1]]
        if d1 / d2 < threshold:
            matches.append((i, idx[0]))
    return matches

A keypoint on a brick in a regular pattern has d₁=55, d₂=58 to its two nearest descriptor matches. Ratio = 0.948. What does Lowe's ratio test do with this match?

Accepts it — d₁ < d₂ so the nearest match is still better Rejects it — the two candidates are nearly tied, indicating an ambiguous / repeated-pattern feature Accepts it only if d₁ < 60 (absolute threshold) The ratio test only applies to binary descriptors, not float ones

Chapter 7: KLT Optical Flow

Descriptor matching is powerful but expensive. When frames are close in time (30 fps video), the same physical point doesn't move far in the image. KLT (Kanade-Lucas-Tomasi) tracking exploits this: instead of matching descriptors, it directly solves for the pixel displacement δ = [δ_u, δ_v]^T that makes the warped patch in frame 2 look most like the original patch in frame 1.

The brightness constancy assumption

The core assumption: the intensity of a point doesn't change as the camera moves. For a point moving with velocity (u, v) in pixel coordinates:

I(x + δ, t + 1) = I(x, t) (brightness constancy)

Expanding the left side via Taylor series (small δ):

I(x,t) + ∇I · δ + I_t = I(x,t) ⇒ ∇I · δ = −I_t

where I_t = I(x, t+1) − I(x, t) is the temporal derivative (how much intensity changed at this pixel between frames). This is the optical flow constraint equation: one equation in two unknowns (δ_u, δ_v). At a single pixel it is underdetermined — the aperture problem again.

The KLT solution: sum over a window

Write the constraint for every pixel y in a window W around x. For each y: I_u(y)δ_u + I_v(y)δ_v = −I_t(y). This is an over-determined system (many equations, 2 unknowns). In matrix form:

G · δ = b

where G is exactly the structure tensor from Chapter 2, and b = − ∑_y∈W ∇I(y) · I_t(y). The least-squares solution:

δ* = G⁻¹ b = [ ∑ ∇I ∇I^T ]⁻¹ [ −∑ ∇I I_t ]

G must be invertible for this to work — it must have two large eigenvalues — which is exactly the condition for a good corner. The structure tensor is why corners are trackable.

One matrix rules them all. The same structure tensor G appears in corner detection (Ch 2), Harris scoring (Ch 3), AND KLT tracking (Ch 7). Corner detection finds points where G is well-conditioned. KLT tracking works precisely at those points because a well-conditioned G gives a unique, stable δ*. They are two sides of the same coin.

Worked numbers: one KLT update step

A 3-pixel window. Gradients and temporal differences:

KLT step
# pixel 1: Iu=30, Iv=10, It=-15
# pixel 2: Iu=0,  Iv=40, It=20
# pixel 3: Iu=20, Iv=20, It=-10
G  = [[30,10],[30,10]]T + [[0,40],[0,40]]T + [[20,20],[20,20]]T
   = [[900,300],[300,100]] + [[0,0],[0,1600]] + [[400,400],[400,400]]
   = [[1300,700],[700,2100]]

b  = -([30,10]*(-15) + [0,40]*20 + [20,20]*(-10))
   = -(-[450,150] + [0,800] + -[200,200])
   = -[-650, 450]   =  [650, -450]

det(G) = 1300*2100 - 700*700 = 2,730,000 - 490,000 = 2,240,000
G_inv = (1/2240000)*[[2100,-700],[-700,1300]]
δ* = G_inv @ b:
  δu = (2100*650 + (-700)*(-450)) / 2240000
     = (1,365,000 + 315,000) / 2,240,000
     = 1,680,000 / 2,240,000  ≈ 0.75 pixels
  δv = (-700*650 + 1300*(-450)) / 2240000
     = (-455,000 - 585,000) / 2,240,000
     = -1,040,000 / 2,240,000 ≈ -0.46 pixels

What breaks KLT tracking

KLT fails when its assumptions are violated. Four killers:

Failure mode	Why	Symptom
Low texture	G near-singular (λ₂≈0) — no unique solution	Track drifts randomly
Large motion	Taylor expansion invalid for big δ	Track jumps to wrong patch
Illumination change	Brightness constancy violated	Temporal derivative I_t misleads δ*
Occlusion	Point disappears behind another surface	Track follows occluder's edge

Pyramidal KLT

Large motions are handled by coarse-to-fine tracking: solve KLT on a small (blurry) version of the image first, where the large motion becomes small. Use that estimate to warp the full-resolution image, then refine with another KLT step. Typically 3–4 pyramid levels, allowing tracking of 30–50 pixel displacements robustly.

KLT fails to track a feature on a blank white wall. The structure tensor G at that location has eigenvalues λ₁=2000, λ₂=0.5. What is the mathematical reason for failure?

G is not symmetric so it cannot be inverted The temporal derivative I_t is too large for the Taylor expansion to be valid G is nearly singular (λ₂≈0), so G⁻¹ is numerically unstable — the solution δ*=G⁻¹b can be anything The brightness constancy assumption fails on white surfaces

Chapter 8: Showcase — Full Tracking Simulation

This is the payoff: watch KLT tracking in action on a simulated camera moving over a scene with corners, edges, and flat regions. Use the toggles to trigger tracking failures and see exactly when and why tracks die.

KLT feature tracking simulator

A synthetic scene (circles = corners, lines = edges, gray rectangles = flat regions) moves past the camera. Teal trails = active KLT tracks. Red × = lost track. Use sliders to introduce failures and see tracks die.

Camera speed (px/frame) 5

Illumination flicker 0.00

Occlusion band

Chapter 9: Connections & Cheat Sheet

The complete pipeline in one diagram

Image I₁, I₂

Raw frames from camera

↓

Compute gradients I_u, I_v

Sobel or finite differences → ∇I at each pixel

↓

Structure tensor G

G = ∑_W ∇I ∇I^T over a window

↓

Corner score (Harris / Shi-Tomasi)

C = det(G)−k·tr(G)² or S = λ_min(G); NMS + threshold → keypoints

↓ (choose one)

KLT tracking (consecutive frames)

δ* = G⁻¹b using same G; works for small inter-frame motion

↓ or ↓

Descriptor matching (any pair of frames)

SIFT 128-float or ORB 256-bit; ratio test d₁/d₂<0.8

↓

Matched point pairs (x₁, x₂)

Feed into Two-View Geometry (L7) to recover R, t

Cheat sheet

Concept	Rule of thumb	Math handle
Flat region	λ₁ ≈ λ₂ ≈ 0	G ≈ 0; C≈0
Edge	λ₁≫λ₂≈0	C < 0 (Harris penalizes)
Corner	λ₁≈λ₂≫0	C≫0; S=λ₂≫0
KLT trackable	G invertible (λ₂≫0)	δ*=G⁻¹b unique
Ratio test pass	d₁/d₂<0.8	Distinctive match
Ratio test fail	d₁≈d₂	Ambiguous / repeated pattern
Scale invariance	DoG extremum in (u,v,σ)	SIFT keypoint at natural scale

What breaks tracking

Break	Physics	Detector fix
Low texture	G singular	Detect corners only (both λ large)
Large motion	Taylor invalid	Pyramidal KLT (coarse-to-fine)
Illumination	Brightness constancy violated	Normalize patch; use binary descriptors
Occlusion	Point disappears	Forward-backward check; outlier rejection
Repeated patterns	d₁≈d₂	Ratio test rejects ambiguous matches

Links within this series

Previous: VNAV L5 — Image Formation (the projection pipeline that turns 3D points into pixels — the front end of everything here).

Next: VNAV L7 — Two-View Geometry. The matched point pairs (x₁, x₂) produced by this lesson feed directly into the essential matrix, which recovers relative camera rotation R and translation t from 5 or 8 point correspondences.

Related Gleams

Classical VIO — feature tracks become the measurements for an EKF that estimates camera pose and IMU bias. | Modern SLAM — ORB-SLAM3 uses exactly the ORB detector described here for relocalization. | Kalman Filter — KLT tracks become noisy measurements in the EKF update step.

You can now: Explain why corners beat edges for tracking (both eigenvalues of G large). Compute G from a gradient patch and classify flat/edge/corner. Derive the Harris score from eigenvalues. State why KLT uses the same G as corner detection. Apply the ratio test to accept/reject a descriptor match. Identify what breaks tracking and why.

"The structure tensor is one of the most beautiful objects in computer vision: a 2×2 matrix derived from raw pixel differences that simultaneously tells you what to track and how to track it."

— paraphrase of Luca Carlone, MIT 16.485 lectures