Visual Navigation for Autonomous Vehicles · MIT 16.485 · Lecture 12–13

Feature Detection & Tracking

A robot navigating by camera must re-find the same physical points across frames — corners of buildings, edges of signs, specks on a surface. A flat wall gives nothing to grab. An edge is ambiguous: you can locate it across its direction but it slides freely along itself. Only corners lock a point in every direction. This lesson builds the full pipeline from image gradients, through the structure tensor and its eigenvalues, to Harris corners, SIFT-style descriptors, ratio-test matching, and KLT optical-flow tracking. The same matrix G runs through all of it. MIT 16.485 by Luca Carlone, Lectures 12–13.

Prerequisites: VNAV L5 (image formation & projection). Basic partial derivatives. No advanced linear algebra beyond 2×2 eigenvalues.
10
Chapters
5
Live Canvases
Derived
From First Principles

Chapter 0: The Re-Finding Problem

Your autonomous vehicle is cruising down a street. Frame 1: a camera captures a building with a fire-escape corner. Frame 2: the vehicle has moved 0.2 m forward. Where did that corner go in frame 2? If you can answer that question for hundreds of points across dozens of frames, you have enough to compute the vehicle's trajectory, the 3D structure of the scene, and eventually a navigable map. Feature tracking is the front end of all of visual navigation.

But not every pixel is equally trackable. Stare at a patch of blank wall: any small shift of the camera produces a patch that looks identical. There is no way to know if you moved at all. Now stare at a painted horizontal stripe: you can tell if it moved vertically (across the stripe) but not horizontally (along the stripe). Now stare at a corner of a window frame: any motion in any direction changes the patch's appearance. Only corners give you unambiguous information in every direction.

The aperture problem, stated. When you look through a small hole (aperture) at a moving line, you can only perceive motion perpendicular to the line — motion along the line is invisible. Mathematically, the gradient is zero along the edge direction, so the component of motion in that direction is unobservable. Corners break the symmetry: they have gradients in multiple directions, making all motion components observable.

What we need from a feature

A good feature must be repeatable (you can re-detect it from a nearby viewpoint), distinctive (its local appearance is unique — you won't confuse it with a feature 100 pixels away), and localizable (you can pin it down to a precise pixel). Flat regions fail all three. Edges fail distinctiveness and partially fail localizability. Corners pass all three.

Aperture problem — drag the window

Left panel: a synthetic image with a flat region, an edge, and a corner. The orange aperture window follows the slider. Right panel: shows which motion directions are detectable inside that window. Flat = no direction; Edge = one direction; Corner = all directions.

Window position edge
A camera looks at a uniform white wall. You move the camera 10 pixels to the right. What does the KLT tracker report for most tracked points on the wall?

Chapter 1: Image Gradients

The foundation of all feature detectors is the image gradient: how quickly does pixel intensity change as you move in the x (horizontal) or y (vertical) direction? Formally, for a continuous intensity function I(u, v), the gradient is the vector of partial derivatives:

∇I(u,v) = [ ∂I⁄∂u   ,   ∂I⁄∂v ]T

In practice images are discrete pixel grids, so we approximate derivatives with finite differences. The simplest estimate at pixel (u, v) is:

∂I⁄∂u ≈ I(u+1, v) − I(u, v)      ∂I⁄∂v ≈ I(u, v+1) − I(u, v)

Better in practice: use the Sobel operator, a 3×3 kernel that combines finite differences with a small blur to reduce noise sensitivity. The Sobel x-kernel is [−1, 0, 1] replicated vertically with weights [1, 2, 1] — it measures the horizontal rate of change while smoothing vertically.

Gradient magnitude and direction. At any pixel, magnitude |∇I| = √(Iu2 + Iv2) tells you how strong the local edge is. Direction θ = atan2(Iv, Iu) tells you which way the intensity rises fastest. An edge runs perpendicular to the gradient direction. A corner has gradients pointing in many different directions within a small window.

Worked numbers

Consider a tiny 3×3 patch of intensity values (row 0 is top):

patch
10  10  10
10  80  80    # ← strong horizontal step at row 1
10  80  80

At the center pixel (1,1): Iu = I(2,1) − I(0,1) = 80 − 10 = 70 (right minus left); Iv = I(1,2) − I(1,0) = 80 − 10 = 70 (bottom minus top). The gradient is [70, 70]T, pointing diagonally up-right at 45°. This pixel is at a corner of the step — the detector will love it.

Misconception: gradients detect edges, not corners. The gradient at a single pixel only tells you about intensity change at that location. A corner is not a property of one pixel — it is a property of the distribution of gradients across a window. That's what the structure tensor in Ch 2 captures.
In a region of a horizontal edge (intensity = 0 above, 255 below), which gradient component is large?

Chapter 2: The Structure Tensor

A single gradient vector tells us the local edge direction at one pixel. To decide whether a pixel is a corner, we need to ask a bigger question: does shifting a window in any small direction cause the appearance to change? This is precisely what the structure tensor (also called the second-moment matrix or the Harris matrix) captures.

Derivation from the Taylor expansion

Define the sum-of-squared-differences when you shift window W around pixel x̄ by displacement δ = [δu, δv]T:

E(δ) = ∑x ∈ W(x̄) [I(x + δ) − I(x)]2

For a small δ, Taylor-expand: I(x + δ) ≈ I(x) + ∇I(x)Tδ. Substituting:

E(δ) ≈ ∑x ∈ W [∇I(x)Tδ]2 = δT · G · δ

where the structure tensor G is:

G = ∑x ∈ W ∇I(x) ∇I(x)T = ∑x ∈ W
Iu2IuIv
IuIvIv2

G is a 2×2 symmetric positive-semidefinite matrix. Its eigenvalues λ1 ≥ λ2 ≥ 0 measure how much E(δ) changes in the two principal directions. The minimum change over all unit directions δ equals λ2 (the smaller eigenvalue). So: a good corner has large λ2.

Eigenvalue intuition: the error ellipse. The quadratic form δTGδ draws an ellipse in (δu, δv) space. Large λ1 and λ2 mean the ellipse is fat in both directions — any shift you make changes the appearance. Small λ2 means the ellipse is very thin in one direction — there is a direction of shift that changes nothing. That thin direction is exactly the aperture-problem direction along the dominant edge.

Worked numbers: hand-computing G from a tiny patch

Use the 3×3 patch from Chapter 1. At the four interior positions (we'll use just the center pixel for simplicity), suppose the gradient is Iu = 70, Iv = 70 at center and zero everywhere else (as an idealization). Then:

G = [[70, 70]]T[[70, 70]] = [[4900, 4900], [4900, 4900]]

Eigenvalues: tr(G) = 9800, det(G) = 4900×4900 − 4900×4900 = 0. So λ1 = 9800, λ2 = 0. This says: edge, not a corner. The gradient only points in one direction across the window, giving rank-1 G. A real corner would have gradients in multiple directions, giving rank-2 G with two large eigenvalues.

Structure tensor eigenvalue explorer

Adjust the dominant gradient direction and a secondary gradient direction. The canvas shows gradient vectors in the window, the resulting G matrix values, eigenvalues λ1 and λ2, the error ellipse, and a flat/edge/corner verdict.

Primary gradient angle (°) 45
Secondary gradient strength (0=edge, 1=corner) 0.00
A window's structure tensor has eigenvalues λ1 = 1200 and λ2 = 8. What is this region?

Chapter 3: Harris & Shi-Tomasi Scores

Computing eigenvalues of G at every pixel is expensive. Both classical corner detectors approximate the eigenvalue criterion with a simpler scalar score. The key insight: for a 2×2 matrix, det(G) = λ1λ2 and tr(G) = λ1 + λ2 — both are cheap to compute without finding actual eigenvalues.

Harris corner response (1988)

Harris and Stephens proposed the score:

C(G) = det(G) − k · tr(G)2 = λ1λ2 − k(λ1 + λ2)2

where k ≈ 0.04–0.06 is a tuning constant. Let's see why this works by case analysis:

Shi-Tomasi score (1994)

Shi and Tomasi simplified further: just use the minimum eigenvalue directly.

S(G) = λmin(G) = min(λ1, λ2)

A pixel is a corner if S exceeds a threshold. This is called "Good Features to Track" and is slightly simpler to tune. Both detectors work by: compute G over a window → compute score → threshold → apply non-maximum suppression (keep only local maxima within a radius).

Worked numbers: Harris score from scratch

A 3-pixel window has gradients: pixel A: (Iu, Iv) = (40, 0); pixel B: (0, 40); pixel C: (30, 30). k = 0.05.

G = [[40,0]]T[[40,0]] + [[0,40]]T[[0,40]] + [[30,30]]T[[30,30]]
G = [[1600,0],[0,0]] + [[0,0],[0,1600]] + [[900,900],[900,900]]
G = [[2500, 900], [900, 2500]]
det(G) = 2500×2500 − 900×900 = 6,250,000 − 810,000 = 5,440,000
tr(G) = 5000   ⇒   tr(G)2 = 25,000,000
C = 5,440,000 − 0.05 × 25,000,000 = 5,440,000 − 1,250,000 = 4,190,000

Large positive: this is a strong corner. Eigenvalues: λ = (5000 ± √(25002−4×5,440,000))⁄2 = (5000 ± 3200)⁄2 ⇒ λ1 = 4100, λ2 = 900. Both large, roughly balanced: corner confirmed.

Harris response heatmap on a synthetic scene

A synthetic image with flat regions, edges, and corners. Warm = high Harris response. Toggle threshold to see detected corners (teal dots). Adjust k to see how the flat/edge/corner balance changes.

Harris k 0.05
Threshold (%) 30%
Show
Misconception: more features = better navigation. A feature you can't re-find across frames is noise that corrupts your pose estimator. Better to have 50 unambiguous corners than 500 weak edge responses. Non-maximum suppression and a minimum-quality threshold exist for this reason.
For a pixel on a straight horizontal edge, what is the sign of the Harris response C?

Chapter 4: Scale & Blob Detection

Harris corners are not scale-invariant: a corner at scale 1 may look like a flat region at scale 4 (zoom out and the fine structure blurs away). For matching across large viewpoint or distance changes — a car seen at 5 m then 50 m — we need to detect features at their "natural" scale.

The scale-space pyramid

Build a Gaussian pyramid: repeatedly blur and downsample the image. Each level is a coarser view of the scene. A feature that is a corner at the correct scale will show a strong response at that level and weak responses at others. Scale-invariant detectors search for extrema across both space AND scale simultaneously.

L(u, v, σ) = G(σ) * I(u, v)      (Gaussian blurred image at scale σ)

Difference of Gaussians (DoG) — the SIFT blob detector

SIFT detects blobs (rounded regions of distinct intensity) using the Difference of Gaussians, an efficient approximation to the Laplacian of Gaussian (LoG). For two adjacent pyramid levels with scales σ and kσ:

DoG(u, v, σ) = L(u, v, kσ) − L(u, v, σ)

A keypoint is detected wherever DoG achieves a local extremum (maximum or minimum) over a 3×3×3 neighborhood in (u, v, σ) space — 26 comparisons in total. This gives a (u, v, scale) triplet for each keypoint, providing built-in scale information.

Why blobs, not just corners? On a building at 50 m, individual corners may be too small to detect reliably. But the window's own blob scale (determined by where the DoG response peaks) gives the detector a natural "zoom level" to use when computing descriptors. SIFT's extraordinary scale-invariance comes from always measuring descriptors at the detected scale, not at fixed pixel sizes.

Scale normalization in brief

At each detected keypoint (u, v, σ*), SIFT sets a patch size proportional to σ* before computing its descriptor. This way a feature on a distant building and the same feature when the car is close will produce the same descriptor — the scale information is baked in.

ORB skips scale-space but stays fast. ORB (Oriented FAST and Rotated BRIEF) uses a simpler detector (FAST corners) and a binary descriptor, running 100× faster than SIFT. It handles scale by running on an image pyramid but without the full DoG extremum search. Real-time SLAM systems use ORB; offline reconstruction uses SIFT or SURF.
Why does SIFT detect keypoints in a 3D space (u, v, scale) rather than just 2D (u, v)?

Chapter 5: Feature Descriptors

Detecting a corner gives you a pixel location. But how do you recognize the same corner in a completely different image, taken hours later or from a different camera? You need a descriptor: a compact fingerprint of the local appearance around a keypoint that is distinctive enough to match against, and robust to small rotations, lighting changes, and viewpoint shifts.

SIFT descriptor: gradient histograms

Around the keypoint (at its detected scale), SIFT samples a 16×16 pixel region, subdivided into a 4×4 grid of sub-patches. In each sub-patch, it builds an 8-bin histogram of gradient orientations, weighted by gradient magnitude. Concatenating 4×4×8 = 128 values gives the SIFT descriptor vector. Before concatenation, the keypoint is assigned a canonical orientation (the dominant gradient direction in the surrounding region) so that the descriptor is rotation-invariant.

dSIFT ∈ ℝ128     (128-element float vector, L2-normalized)

After computing d, SIFT L2-normalizes it and clips values above 0.2 (to reduce illumination-change sensitivity), then re-normalizes. Two SIFT descriptors are matched if their L2 distance is small enough.

Binary descriptors: ORB / BRIEF

SIFT's 128-float vector costs memory and comparison time. BRIEF (Binary Robust Independent Elementary Features) replaces it with a bit-string: for each of N pre-selected pixel-pair positions (pi, qi) in the patch, set bit i = 1 if I(pi) > I(qi), else 0. ORB uses BRIEF with 256 bits, with pairs chosen to be maximally uncorrelated (maximizing descriptor distinctiveness).

dORB ∈ {0,1}256     (256-bit string, Hamming distance for matching)

Comparing 256-bit strings takes one XOR + popcount — extremely fast on modern CPUs (hardware popcount instruction). ORB runs in real time on embedded systems like the Raspberry Pi.

Descriptor = a point's fingerprint. Descriptors exist so you can match WITHOUT tracking. If you lose a feature (occlusion, large motion jump), you can re-detect it in the new frame and match by descriptor. Tracking (Ch 7) assumes slow motion and uses intensity directly. Descriptor matching handles large viewpoint/time gaps.
Misconception: a corner you detect is uniquely matchable. Many corners look similar — a right-angle junction in a brick building repeats every 0.5 m. The ratio test (Ch 6) exists precisely because descriptors are never perfectly unique, and naive nearest-neighbor matching makes many false matches.
Why does SIFT assign a canonical orientation to each keypoint before computing the descriptor?

Chapter 6: Matching & the Ratio Test

You have two sets of keypoints with descriptors — one from frame A, one from frame B. The naive approach: for each keypoint in A, find the nearest neighbor in B by descriptor distance. This almost always produces many false matches because many features look similar (repeated patterns, similar corners).

Lowe's ratio test

David Lowe proposed a simple and powerful filter: for each keypoint in A, find its two nearest neighbors in B — call their distances d1 (best) and d2 (second-best). Accept the match only if:

d1 ⁄ d2 < 0.8     (Lowe's ratio test threshold)

The intuition: if the best match is much closer than the second-best, the match is distinctive and likely correct. If d1 ≈ d2, the descriptor matches two candidates almost equally well — this is an ambiguous point (like one of many identical bricks) and we discard it.

Worked decision

Query keypoint q has descriptor distances: d1 = 42 to match A, d2 = 51 to match B. Ratio = 42/51 = 0.82 > 0.8 ⇒ rejected. Another query: d1 = 31, d2 = 78. Ratio = 31/78 = 0.40 < 0.8 ⇒ accepted. The second is much more distinctive.

Descriptor matching with ratio-test culling

Two synthetic images (left, right) with detected features (dots). Lines connect matched pairs. Adjust the ratio threshold to cull ambiguous matches. Green = high-confidence match; orange = borderline; red = rejected by ratio test.

Ratio threshold 0.80

Code: ratio test in Python

python
import numpy as np

def ratio_test_match(desc_A, desc_B, threshold=0.8):
    """
    desc_A: (N, D) float array of descriptors from frame A
    desc_B: (M, D) float array of descriptors from frame B
    Returns list of (i, j) index pairs that pass the ratio test.
    """
    matches = []
    for i, da in enumerate(desc_A):
        # L2 distance to all descriptors in B
        dists = np.linalg.norm(desc_B - da, axis=1)
        # Two nearest neighbours
        idx = np.argsort(dists)[:2]
        d1, d2 = dists[idx[0]], dists[idx[1]]
        if d1 / d2 < threshold:
            matches.append((i, idx[0]))
    return matches
A keypoint on a brick in a regular pattern has d1=55, d2=58 to its two nearest descriptor matches. Ratio = 0.948. What does Lowe's ratio test do with this match?

Chapter 7: KLT Optical Flow

Descriptor matching is powerful but expensive. When frames are close in time (30 fps video), the same physical point doesn't move far in the image. KLT (Kanade-Lucas-Tomasi) tracking exploits this: instead of matching descriptors, it directly solves for the pixel displacement δ = [δu, δv]T that makes the warped patch in frame 2 look most like the original patch in frame 1.

The brightness constancy assumption

The core assumption: the intensity of a point doesn't change as the camera moves. For a point moving with velocity (u, v) in pixel coordinates:

I(x + δ, t + 1) = I(x, t)      (brightness constancy)

Expanding the left side via Taylor series (small δ):

I(x,t) + ∇I · δ + It = I(x,t)   ⇒   ∇I · δ = −It

where It = I(x, t+1) − I(x, t) is the temporal derivative (how much intensity changed at this pixel between frames). This is the optical flow constraint equation: one equation in two unknowns (δu, δv). At a single pixel it is underdetermined — the aperture problem again.

The KLT solution: sum over a window

Write the constraint for every pixel y in a window W around x. For each y: Iu(y)δu + Iv(y)δv = −It(y). This is an over-determined system (many equations, 2 unknowns). In matrix form:

G · δ = b

where G is exactly the structure tensor from Chapter 2, and b = − ∑y∈W ∇I(y) · It(y). The least-squares solution:

δ* = G−1 b = [ ∑ ∇I ∇IT ]−1 [ −∑ ∇I It ]

G must be invertible for this to work — it must have two large eigenvalues — which is exactly the condition for a good corner. The structure tensor is why corners are trackable.

One matrix rules them all. The same structure tensor G appears in corner detection (Ch 2), Harris scoring (Ch 3), AND KLT tracking (Ch 7). Corner detection finds points where G is well-conditioned. KLT tracking works precisely at those points because a well-conditioned G gives a unique, stable δ*. They are two sides of the same coin.

Worked numbers: one KLT update step

A 3-pixel window. Gradients and temporal differences:

KLT step
# pixel 1: Iu=30, Iv=10, It=-15
# pixel 2: Iu=0,  Iv=40, It=20
# pixel 3: Iu=20, Iv=20, It=-10
G  = [[30,10],[30,10]]T + [[0,40],[0,40]]T + [[20,20],[20,20]]T
   = [[900,300],[300,100]] + [[0,0],[0,1600]] + [[400,400],[400,400]]
   = [[1300,700],[700,2100]]

b  = -([30,10]*(-15) + [0,40]*20 + [20,20]*(-10))
   = -(-[450,150] + [0,800] + -[200,200])
   = -[-650, 450]   =  [650, -450]

det(G) = 1300*2100 - 700*700 = 2,730,000 - 490,000 = 2,240,000
G_inv = (1/2240000)*[[2100,-700],[-700,1300]]
δ* = G_inv @ b:
  δu = (2100*650 + (-700)*(-450)) / 2240000
     = (1,365,000 + 315,000) / 2,240,000
     = 1,680,000 / 2,240,000  ≈ 0.75 pixels
  δv = (-700*650 + 1300*(-450)) / 2240000
     = (-455,000 - 585,000) / 2,240,000
     = -1,040,000 / 2,240,000 ≈ -0.46 pixels

What breaks KLT tracking

KLT fails when its assumptions are violated. Four killers:

Failure modeWhySymptom
Low textureG near-singular (λ2≈0) — no unique solutionTrack drifts randomly
Large motionTaylor expansion invalid for big δTrack jumps to wrong patch
Illumination changeBrightness constancy violatedTemporal derivative It misleads δ*
OcclusionPoint disappears behind another surfaceTrack follows occluder's edge

Pyramidal KLT

Large motions are handled by coarse-to-fine tracking: solve KLT on a small (blurry) version of the image first, where the large motion becomes small. Use that estimate to warp the full-resolution image, then refine with another KLT step. Typically 3–4 pyramid levels, allowing tracking of 30–50 pixel displacements robustly.

KLT fails to track a feature on a blank white wall. The structure tensor G at that location has eigenvalues λ1=2000, λ2=0.5. What is the mathematical reason for failure?

Chapter 8: Showcase — Full Tracking Simulation

This is the payoff: watch KLT tracking in action on a simulated camera moving over a scene with corners, edges, and flat regions. Use the toggles to trigger tracking failures and see exactly when and why tracks die.

KLT feature tracking simulator

A synthetic scene (circles = corners, lines = edges, gray rectangles = flat regions) moves past the camera. Teal trails = active KLT tracks. Red × = lost track. Use sliders to introduce failures and see tracks die.

Camera speed (px/frame) 5
Illumination flicker 0.00

Chapter 9: Connections & Cheat Sheet

The complete pipeline in one diagram

Image I1, I2
Raw frames from camera
Compute gradients Iu, Iv
Sobel or finite differences → ∇I at each pixel
Structure tensor G
G = ∑W ∇I ∇IT over a window
Corner score (Harris / Shi-Tomasi)
C = det(G)−k·tr(G)2 or S = λmin(G); NMS + threshold → keypoints
↓ (choose one)
KLT tracking (consecutive frames)
δ* = G−1b using same G; works for small inter-frame motion
↓ or ↓
Descriptor matching (any pair of frames)
SIFT 128-float or ORB 256-bit; ratio test d1/d2<0.8
Matched point pairs (x1, x2)
Feed into Two-View Geometry (L7) to recover R, t

Cheat sheet

ConceptRule of thumbMath handle
Flat regionλ1 ≈ λ2 ≈ 0G ≈ 0; C≈0
Edgeλ1≫λ2≈0C < 0 (Harris penalizes)
Cornerλ1≈λ2≫0C≫0; S=λ2≫0
KLT trackableG invertible (λ2≫0)δ*=G−1b unique
Ratio test passd1/d2<0.8Distinctive match
Ratio test faild1≈d2Ambiguous / repeated pattern
Scale invarianceDoG extremum in (u,v,σ)SIFT keypoint at natural scale

What breaks tracking

BreakPhysicsDetector fix
Low textureG singularDetect corners only (both λ large)
Large motionTaylor invalidPyramidal KLT (coarse-to-fine)
IlluminationBrightness constancy violatedNormalize patch; use binary descriptors
OcclusionPoint disappearsForward-backward check; outlier rejection
Repeated patternsd1≈d2Ratio test rejects ambiguous matches

Links within this series

Previous: VNAV L5 — Image Formation (the projection pipeline that turns 3D points into pixels — the front end of everything here).

Next: VNAV L7 — Two-View Geometry. The matched point pairs (x1, x2) produced by this lesson feed directly into the essential matrix, which recovers relative camera rotation R and translation t from 5 or 8 point correspondences.

Related Gleams

Classical VIO — feature tracks become the measurements for an EKF that estimates camera pose and IMU bias.  |  Modern SLAM — ORB-SLAM3 uses exactly the ORB detector described here for relocalization.  |  Kalman Filter — KLT tracks become noisy measurements in the EKF update step.

You can now: Explain why corners beat edges for tracking (both eigenvalues of G large). Compute G from a gradient patch and classify flat/edge/corner. Derive the Harris score from eigenvalues. State why KLT uses the same G as corner detection. Apply the ratio test to accept/reject a descriptor match. Identify what breaks tracking and why.
"The structure tensor is one of the most beautiful objects in computer vision: a 2×2 matrix derived from raw pixel differences that simultaneously tells you what to track and how to track it."

— paraphrase of Luca Carlone, MIT 16.485 lectures