A robot navigating by camera must re-find the same physical points across frames — corners of buildings, edges of signs, specks on a surface. A flat wall gives nothing to grab. An edge is ambiguous: you can locate it across its direction but it slides freely along itself. Only corners lock a point in every direction. This lesson builds the full pipeline from image gradients, through the structure tensor and its eigenvalues, to Harris corners, SIFT-style descriptors, ratio-test matching, and KLT optical-flow tracking. The same matrix G runs through all of it. MIT 16.485 by Luca Carlone, Lectures 12–13.
Your autonomous vehicle is cruising down a street. Frame 1: a camera captures a building with a fire-escape corner. Frame 2: the vehicle has moved 0.2 m forward. Where did that corner go in frame 2? If you can answer that question for hundreds of points across dozens of frames, you have enough to compute the vehicle's trajectory, the 3D structure of the scene, and eventually a navigable map. Feature tracking is the front end of all of visual navigation.
But not every pixel is equally trackable. Stare at a patch of blank wall: any small shift of the camera produces a patch that looks identical. There is no way to know if you moved at all. Now stare at a painted horizontal stripe: you can tell if it moved vertically (across the stripe) but not horizontally (along the stripe). Now stare at a corner of a window frame: any motion in any direction changes the patch's appearance. Only corners give you unambiguous information in every direction.
A good feature must be repeatable (you can re-detect it from a nearby viewpoint), distinctive (its local appearance is unique — you won't confuse it with a feature 100 pixels away), and localizable (you can pin it down to a precise pixel). Flat regions fail all three. Edges fail distinctiveness and partially fail localizability. Corners pass all three.
Left panel: a synthetic image with a flat region, an edge, and a corner. The orange aperture window follows the slider. Right panel: shows which motion directions are detectable inside that window. Flat = no direction; Edge = one direction; Corner = all directions.
The foundation of all feature detectors is the image gradient: how quickly does pixel intensity change as you move in the x (horizontal) or y (vertical) direction? Formally, for a continuous intensity function I(u, v), the gradient is the vector of partial derivatives:
In practice images are discrete pixel grids, so we approximate derivatives with finite differences. The simplest estimate at pixel (u, v) is:
Better in practice: use the Sobel operator, a 3×3 kernel that combines finite differences with a small blur to reduce noise sensitivity. The Sobel x-kernel is [−1, 0, 1] replicated vertically with weights [1, 2, 1] — it measures the horizontal rate of change while smoothing vertically.
Consider a tiny 3×3 patch of intensity values (row 0 is top):
patch 10 10 10 10 80 80 # ← strong horizontal step at row 1 10 80 80
At the center pixel (1,1): Iu = I(2,1) − I(0,1) = 80 − 10 = 70 (right minus left); Iv = I(1,2) − I(1,0) = 80 − 10 = 70 (bottom minus top). The gradient is [70, 70]T, pointing diagonally up-right at 45°. This pixel is at a corner of the step — the detector will love it.
A single gradient vector tells us the local edge direction at one pixel. To decide whether a pixel is a corner, we need to ask a bigger question: does shifting a window in any small direction cause the appearance to change? This is precisely what the structure tensor (also called the second-moment matrix or the Harris matrix) captures.
Define the sum-of-squared-differences when you shift window W around pixel x̄ by displacement δ = [δu, δv]T:
For a small δ, Taylor-expand: I(x + δ) ≈ I(x) + ∇I(x)Tδ. Substituting:
where the structure tensor G is:
| Iu2 | IuIv |
| IuIv | Iv2 |
G is a 2×2 symmetric positive-semidefinite matrix. Its eigenvalues λ1 ≥ λ2 ≥ 0 measure how much E(δ) changes in the two principal directions. The minimum change over all unit directions δ equals λ2 (the smaller eigenvalue). So: a good corner has large λ2.
Use the 3×3 patch from Chapter 1. At the four interior positions (we'll use just the center pixel for simplicity), suppose the gradient is Iu = 70, Iv = 70 at center and zero everywhere else (as an idealization). Then:
Eigenvalues: tr(G) = 9800, det(G) = 4900×4900 − 4900×4900 = 0. So λ1 = 9800, λ2 = 0. This says: edge, not a corner. The gradient only points in one direction across the window, giving rank-1 G. A real corner would have gradients in multiple directions, giving rank-2 G with two large eigenvalues.
Adjust the dominant gradient direction and a secondary gradient direction. The canvas shows gradient vectors in the window, the resulting G matrix values, eigenvalues λ1 and λ2, the error ellipse, and a flat/edge/corner verdict.
Computing eigenvalues of G at every pixel is expensive. Both classical corner detectors approximate the eigenvalue criterion with a simpler scalar score. The key insight: for a 2×2 matrix, det(G) = λ1λ2 and tr(G) = λ1 + λ2 — both are cheap to compute without finding actual eigenvalues.
Harris and Stephens proposed the score:
where k ≈ 0.04–0.06 is a tuning constant. Let's see why this works by case analysis:
Shi and Tomasi simplified further: just use the minimum eigenvalue directly.
A pixel is a corner if S exceeds a threshold. This is called "Good Features to Track" and is slightly simpler to tune. Both detectors work by: compute G over a window → compute score → threshold → apply non-maximum suppression (keep only local maxima within a radius).
A 3-pixel window has gradients: pixel A: (Iu, Iv) = (40, 0); pixel B: (0, 40); pixel C: (30, 30). k = 0.05.
Large positive: this is a strong corner. Eigenvalues: λ = (5000 ± √(25002−4×5,440,000))⁄2 = (5000 ± 3200)⁄2 ⇒ λ1 = 4100, λ2 = 900. Both large, roughly balanced: corner confirmed.
A synthetic image with flat regions, edges, and corners. Warm = high Harris response. Toggle threshold to see detected corners (teal dots). Adjust k to see how the flat/edge/corner balance changes.
Harris corners are not scale-invariant: a corner at scale 1 may look like a flat region at scale 4 (zoom out and the fine structure blurs away). For matching across large viewpoint or distance changes — a car seen at 5 m then 50 m — we need to detect features at their "natural" scale.
Build a Gaussian pyramid: repeatedly blur and downsample the image. Each level is a coarser view of the scene. A feature that is a corner at the correct scale will show a strong response at that level and weak responses at others. Scale-invariant detectors search for extrema across both space AND scale simultaneously.
SIFT detects blobs (rounded regions of distinct intensity) using the Difference of Gaussians, an efficient approximation to the Laplacian of Gaussian (LoG). For two adjacent pyramid levels with scales σ and kσ:
A keypoint is detected wherever DoG achieves a local extremum (maximum or minimum) over a 3×3×3 neighborhood in (u, v, σ) space — 26 comparisons in total. This gives a (u, v, scale) triplet for each keypoint, providing built-in scale information.
At each detected keypoint (u, v, σ*), SIFT sets a patch size proportional to σ* before computing its descriptor. This way a feature on a distant building and the same feature when the car is close will produce the same descriptor — the scale information is baked in.
Detecting a corner gives you a pixel location. But how do you recognize the same corner in a completely different image, taken hours later or from a different camera? You need a descriptor: a compact fingerprint of the local appearance around a keypoint that is distinctive enough to match against, and robust to small rotations, lighting changes, and viewpoint shifts.
Around the keypoint (at its detected scale), SIFT samples a 16×16 pixel region, subdivided into a 4×4 grid of sub-patches. In each sub-patch, it builds an 8-bin histogram of gradient orientations, weighted by gradient magnitude. Concatenating 4×4×8 = 128 values gives the SIFT descriptor vector. Before concatenation, the keypoint is assigned a canonical orientation (the dominant gradient direction in the surrounding region) so that the descriptor is rotation-invariant.
After computing d, SIFT L2-normalizes it and clips values above 0.2 (to reduce illumination-change sensitivity), then re-normalizes. Two SIFT descriptors are matched if their L2 distance is small enough.
SIFT's 128-float vector costs memory and comparison time. BRIEF (Binary Robust Independent Elementary Features) replaces it with a bit-string: for each of N pre-selected pixel-pair positions (pi, qi) in the patch, set bit i = 1 if I(pi) > I(qi), else 0. ORB uses BRIEF with 256 bits, with pairs chosen to be maximally uncorrelated (maximizing descriptor distinctiveness).
Comparing 256-bit strings takes one XOR + popcount — extremely fast on modern CPUs (hardware popcount instruction). ORB runs in real time on embedded systems like the Raspberry Pi.
You have two sets of keypoints with descriptors — one from frame A, one from frame B. The naive approach: for each keypoint in A, find the nearest neighbor in B by descriptor distance. This almost always produces many false matches because many features look similar (repeated patterns, similar corners).
David Lowe proposed a simple and powerful filter: for each keypoint in A, find its two nearest neighbors in B — call their distances d1 (best) and d2 (second-best). Accept the match only if:
The intuition: if the best match is much closer than the second-best, the match is distinctive and likely correct. If d1 ≈ d2, the descriptor matches two candidates almost equally well — this is an ambiguous point (like one of many identical bricks) and we discard it.
Query keypoint q has descriptor distances: d1 = 42 to match A, d2 = 51 to match B. Ratio = 42/51 = 0.82 > 0.8 ⇒ rejected. Another query: d1 = 31, d2 = 78. Ratio = 31/78 = 0.40 < 0.8 ⇒ accepted. The second is much more distinctive.
Two synthetic images (left, right) with detected features (dots). Lines connect matched pairs. Adjust the ratio threshold to cull ambiguous matches. Green = high-confidence match; orange = borderline; red = rejected by ratio test.
python import numpy as np def ratio_test_match(desc_A, desc_B, threshold=0.8): """ desc_A: (N, D) float array of descriptors from frame A desc_B: (M, D) float array of descriptors from frame B Returns list of (i, j) index pairs that pass the ratio test. """ matches = [] for i, da in enumerate(desc_A): # L2 distance to all descriptors in B dists = np.linalg.norm(desc_B - da, axis=1) # Two nearest neighbours idx = np.argsort(dists)[:2] d1, d2 = dists[idx[0]], dists[idx[1]] if d1 / d2 < threshold: matches.append((i, idx[0])) return matches
Descriptor matching is powerful but expensive. When frames are close in time (30 fps video), the same physical point doesn't move far in the image. KLT (Kanade-Lucas-Tomasi) tracking exploits this: instead of matching descriptors, it directly solves for the pixel displacement δ = [δu, δv]T that makes the warped patch in frame 2 look most like the original patch in frame 1.
The core assumption: the intensity of a point doesn't change as the camera moves. For a point moving with velocity (u, v) in pixel coordinates:
Expanding the left side via Taylor series (small δ):
where It = I(x, t+1) − I(x, t) is the temporal derivative (how much intensity changed at this pixel between frames). This is the optical flow constraint equation: one equation in two unknowns (δu, δv). At a single pixel it is underdetermined — the aperture problem again.
Write the constraint for every pixel y in a window W around x. For each y: Iu(y)δu + Iv(y)δv = −It(y). This is an over-determined system (many equations, 2 unknowns). In matrix form:
where G is exactly the structure tensor from Chapter 2, and b = − ∑y∈W ∇I(y) · It(y). The least-squares solution:
G must be invertible for this to work — it must have two large eigenvalues — which is exactly the condition for a good corner. The structure tensor is why corners are trackable.
A 3-pixel window. Gradients and temporal differences:
KLT step # pixel 1: Iu=30, Iv=10, It=-15 # pixel 2: Iu=0, Iv=40, It=20 # pixel 3: Iu=20, Iv=20, It=-10 G = [[30,10],[30,10]]T + [[0,40],[0,40]]T + [[20,20],[20,20]]T = [[900,300],[300,100]] + [[0,0],[0,1600]] + [[400,400],[400,400]] = [[1300,700],[700,2100]] b = -([30,10]*(-15) + [0,40]*20 + [20,20]*(-10)) = -(-[450,150] + [0,800] + -[200,200]) = -[-650, 450] = [650, -450] det(G) = 1300*2100 - 700*700 = 2,730,000 - 490,000 = 2,240,000 G_inv = (1/2240000)*[[2100,-700],[-700,1300]] δ* = G_inv @ b: δu = (2100*650 + (-700)*(-450)) / 2240000 = (1,365,000 + 315,000) / 2,240,000 = 1,680,000 / 2,240,000 ≈ 0.75 pixels δv = (-700*650 + 1300*(-450)) / 2240000 = (-455,000 - 585,000) / 2,240,000 = -1,040,000 / 2,240,000 ≈ -0.46 pixels
KLT fails when its assumptions are violated. Four killers:
| Failure mode | Why | Symptom |
|---|---|---|
| Low texture | G near-singular (λ2≈0) — no unique solution | Track drifts randomly |
| Large motion | Taylor expansion invalid for big δ | Track jumps to wrong patch |
| Illumination change | Brightness constancy violated | Temporal derivative It misleads δ* |
| Occlusion | Point disappears behind another surface | Track follows occluder's edge |
Large motions are handled by coarse-to-fine tracking: solve KLT on a small (blurry) version of the image first, where the large motion becomes small. Use that estimate to warp the full-resolution image, then refine with another KLT step. Typically 3–4 pyramid levels, allowing tracking of 30–50 pixel displacements robustly.
This is the payoff: watch KLT tracking in action on a simulated camera moving over a scene with corners, edges, and flat regions. Use the toggles to trigger tracking failures and see exactly when and why tracks die.
A synthetic scene (circles = corners, lines = edges, gray rectangles = flat regions) moves past the camera. Teal trails = active KLT tracks. Red × = lost track. Use sliders to introduce failures and see tracks die.
| Concept | Rule of thumb | Math handle |
|---|---|---|
| Flat region | λ1 ≈ λ2 ≈ 0 | G ≈ 0; C≈0 |
| Edge | λ1≫λ2≈0 | C < 0 (Harris penalizes) |
| Corner | λ1≈λ2≫0 | C≫0; S=λ2≫0 |
| KLT trackable | G invertible (λ2≫0) | δ*=G−1b unique |
| Ratio test pass | d1/d2<0.8 | Distinctive match |
| Ratio test fail | d1≈d2 | Ambiguous / repeated pattern |
| Scale invariance | DoG extremum in (u,v,σ) | SIFT keypoint at natural scale |
| Break | Physics | Detector fix |
|---|---|---|
| Low texture | G singular | Detect corners only (both λ large) |
| Large motion | Taylor invalid | Pyramidal KLT (coarse-to-fine) |
| Illumination | Brightness constancy violated | Normalize patch; use binary descriptors |
| Occlusion | Point disappears | Forward-backward check; outlier rejection |
| Repeated patterns | d1≈d2 | Ratio test rejects ambiguous matches |
Previous: VNAV L5 — Image Formation (the projection pipeline that turns 3D points into pixels — the front end of everything here).
Next: VNAV L7 — Two-View Geometry. The matched point pairs (x1, x2) produced by this lesson feed directly into the essential matrix, which recovers relative camera rotation R and translation t from 5 or 8 point correspondences.
Classical VIO — feature tracks become the measurements for an EKF that estimates camera pose and IMU bias. | Modern SLAM — ORB-SLAM3 uses exactly the ORB detector described here for relocalization. | Kalman Filter — KLT tracks become noisy measurements in the EKF update step.
"The structure tensor is one of the most beautiful objects in computer vision: a 2×2 matrix derived from raw pixel differences that simultaneously tells you what to track and how to track it."
— paraphrase of Luca Carlone, MIT 16.485 lectures