Visual Navigation for Autonomous Vehicles · MIT 16.485 · Lecture 14

Two-View Geometry: Epipolar & SfM

You have two photos of the same street corner, taken from slightly different spots. Hundreds of feature correspondences link pixels between the images. From those pairs alone — no GPS, no IMU — can you deduce how much the camera moved, and where each matched point sits in 3D space? Yes. But the geometry connecting the two views imposes a tight constraint first: every correct match must lie on a specific line in the second image. That constraint is epipolar geometry, and it is the bridge from pixels to poses. MIT 16.485 by Luca Carlone, Lecture 14.

Prerequisites: VNAV L5 (intrinsic matrix K, back-projection) · VNAV L6 (feature correspondences) · VNAV L1/L2 (SE(3), skew/hat operator). No new tools beyond what you already know.
10
Chapters
5
Live Canvases
Derived
From First Principles

Chapter 0: The Two-Photo Problem

A drone flies past a building. Its forward-facing camera snaps two frames 0.3 seconds apart. Your feature tracker (Lecture 12–13) hands you 200 matched pixel pairs: point (u1, v1) in frame 1 corresponds to (u2, v2) in frame 2. Those pairs all picture the same physical points — a window corner, a ledge, a drainpipe. Now what?

The matched pixels alone don't tell you anything about how the camera moved. A 1-pixel shift could mean the drone moved 1 cm or 100 m, depending on the distance to the building. And each pixel is just a ray shooting out from the camera — the feature could be anywhere along that ray. You need more structure.

The saving grace is that the two cameras are geometrically related. They are both looking at the same rigid 3D world. That rigidity imposes a constraint between every correct correspondence — a constraint that can be written as a single equation involving only the pixel coordinates, the camera rotation R, and the translation t between the two views. Epipolar geometry makes that constraint explicit.

The single most important intuition of this lesson: a single pixel correspondence does NOT determine the 3D position of the point. It gives you a ray. Two correspondences from two views give you TWO rays. Where the rays meet is the 3D point — but only if you know R and t. Recovering R and t from many correspondences is the goal of the 8-point algorithm. Then triangulating rays gives you the 3D structure. These two steps together are called Structure from Motion (SfM).

Why matched pixels alone aren't enough

From L5, a calibrated pixel (u, v) in camera 1 back-projects to a ray: every 3D point d · [x, y, 1]T (in normalized camera coordinates) maps to that pixel, for any depth d > 0. The same matching pixel in camera 2 back-projects to another ray through camera 2's origin. The two rays should meet at the 3D point — but you don't know R and t yet, so you can't intersect them.

The epipolar constraint is what you can assert before knowing R and t. It says: whatever R and t are, the correct match must satisfy a single algebraic equation. If you collect enough correct matches, that equation system is solvable for R and t.

Two views of the same 3D point

Drag the 3D point (blue dot) or the second camera position (orange dot) to see how the projected image positions change. Notice that the match carries no depth information on its own — the point could be anywhere along either dashed ray.

You are given a single pixel (u, v) in a calibrated camera. After back-projecting through K−1, what do you know about the 3D location of the corresponding scene point?

Chapter 1: Epipolar Geometry

Place two cameras in space. Pick any 3D point P visible to both. Draw lines from P to each camera's optical center O1 and O2. Those three points — P, O1, O2 — define a plane. This plane is called the epipolar plane.

The epipolar plane intersects each image plane in a line. These are the epipolar lines: in camera 1's image, the epipolar line corresponding to a point in camera 2; and vice versa. All epipolar lines in camera 1 pass through a single special point called the epipole e1 — the image of camera 2's optical center as seen by camera 1. Symmetrically, e2 is camera 1's center projected into camera 2.

The 2D-to-1D reduction. Without epipolar geometry, finding the match for a point in image 1 requires searching the entire image 2 (a 2D search). With epipolar geometry, the match must lie exactly on the epipolar line — a 1D search. This is one of the most practically important facts in all of computer vision: it halves the dimensionality of the correspondence problem and dramatically reduces false matches.

Here is the key geometric fact: as P moves along a 3D ray through O1, the epipolar plane rotates around the baseline (the line through O1 and O2). The image of P in camera 1 stays fixed (it's always the same pixel on the ray), but the image in camera 2 traces out the epipolar line. So: given the pixel in camera 1, the epipolar geometry tells you exactly which line to search in camera 2.

This is not just elegant geometry — it has a direct computational consequence. Stereo matching algorithms exploit it to limit correlation search to single scan-lines. Dense stereo SLAM (like DepthAI or ZED camera systems) is powered by exactly this: rectify the image pair so epipolar lines become horizontal, then search each row independently. The reduction from 2D search (W × H comparisons per pixel) to 1D search (W comparisons per pixel) yields a roughly ×H speedup — typically 500× for a 500-pixel-tall image.

The epipolar line equation

Given a point ỹ1 in camera 1's calibrated coordinates, the corresponding epipolar line in camera 2 is given by:

l2 = E ỹ1

This is a 3-vector [a, b, c]T representing the line au + bv + c = 0 in calibrated coordinates. Any correct match ỹ2 must satisfy l2T2 = 0 — which is exactly the epipolar constraint ỹ2TEỹ1 = 0. The distance from a point ỹ2 to this line is |l2T2| / ||[a,b]|| and is called the Sampson distance — a first-order approximation to the reprojection error used in RANSAC inlier scoring.

Epipoles: where all epipolar lines meet

The epipole e1 is the projection of O2 into camera 1. It may fall outside the image frame — that's fine, it's still a well-defined point (at infinity if the cameras are side-by-side with the same heading). All epipolar lines in camera 1 pass through e1. Symmetrically, all epipolar lines in camera 2 pass through e2.

Special case: parallel cameras (e.g., a stereo pair with identical orientations). Here O1 and O2 differ only in the horizontal direction. The epipoles are at infinity, and all epipolar lines are horizontal — the y-coordinate of a match is identical in both images. This is the classical stereo rectification setup.

Misconception: epipolar geometry depends on the scene. It does NOT. The epipolar geometry depends only on the camera positions and orientations — the extrinsic relationship between the two cameras. A single 3D point determines which epipolar plane we're in, but the family of all epipolar lines (and the epipoles) is fixed by R and t alone. Change the scene, keep the cameras: same epipolar geometry.

Counting the epipolar geometry DOF

The relative pose between two cameras has 6 DOF (3 rotation + 3 translation). But two-view geometry only determines translation direction, not magnitude — that's the scale ambiguity. So the epipolar geometry has 5 DOF: 3 for R, 2 for the unit direction of t. This is why the minimal solver for the essential matrix needs exactly 5 correspondences (the 5-point algorithm), and why the 8-point algorithm over-parameterizes by 3 DOF (it fits a general 3×3 matrix with 8 DOF before projecting to rank 2).

Epipolar geometry — drag the 3D point

Left: camera 1's image. Right: camera 2's image. Drag the orange dot (3D point position projected onto left image). Watch how the epipolar line in camera 2 (right panel) updates — the match for the orange dot must lie on that teal line. The blue dots show the epipoles.

If the two cameras are pointed in exactly the same direction (no relative rotation), just displaced sideways by baseline b, where do the epipoles lie?

Chapter 2: The Epipolar Constraint

Let's derive the epipolar constraint from scratch. Set camera 1 at the world origin with no rotation — a convenient choice that doesn't lose generality. Camera 2 has position t and rotation R relative to camera 1. A 3D point P is at depth d1 in camera 1's frame, so:

d1 · ỹ1 = Pcam1     d2 · ỹ2 = R · Pcam1 + t

where ỹ1 = K−11 and ỹ2 = K−12 are the calibrated (normalized) coordinates — pixel coordinates pre-multiplied by K−1 to remove the camera's intrinsics. From the left equation, Pcam1 = d11. Substituting into the right:

d2 · ỹ2 = d1 R ỹ1 + t

Now cross-multiply both sides by t (i.e., apply [t]×, the skew-symmetric matrix of t). Since [t]×t = 0 (any vector crossed with itself is zero):

d2 [t]×2 = d1 [t]× R ỹ1

Dot both sides with ỹ2T. The left side becomes d22T[t]×2 = 0, because [t]×2 is orthogonal to ỹ2 (cross product gives a perpendicular vector). So:

2T [t]× R ỹ1 = 0

This is the epipolar constraint. It must hold for every correct correspondence (ỹ1, ỹ2). The quantity in the middle — [t]×R — is the Essential matrix E (Chapter 3).

The skew-symmetric matrix [t]×

If t = [tx, ty, tz]T, the skew-symmetric (hat) matrix is:

[t]× = [ row1: 0, −tz, ty | row2: tz, 0, −tx | row3: −ty, tx, 0 ]

This is the hat operator from VNAV L1/L2 — the same tool used to convert the angular velocity vector to the cross-product matrix. Here it converts t into a matrix so the cross product t × v can be written as [t]×v.

Worked numeric check

Let t = [1, 0, 0]T (camera 2 is 1 unit to the right), R = I (no rotation). A 3D point at [0, 0, 5] in camera 1's frame. Normalized coords: ỹ1 = [0, 0, 1]T; ỹ2 = [−1/5, 0, 1]T (camera 2 sees the point shifted left by 1/5 = baseline/depth). Check:

epipolar constraint check
# t = [1,0,0],  [t]× = [[0,0,0],[0,0,-1],[0,1,0]]  (skew of [1,0,0])
# Wait: [1,0,0] skew: [ 0, -tz, ty ; tz, 0, -tx ; -ty, tx, 0 ]
#                  = [ 0,  0,  0 ;  0, 0, -1 ;   0,  1,  0 ]
E = [t]× R = [t]×  (since R=I)
# y2^T E y1 = [-1/5, 0, 1] * [[0,0,0],[0,0,-1],[0,1,0]] * [0,0,1]
# E y1 = [0,0,0;0,0,-1;0,1,0]*[0,0,1] = [0,-1,0]
# y2^T * [0,-1,0] = (-1/5)(0) + (0)(-1) + (1)(0) = 0  ✓ constraint holds

Now test a wrong match: ỹ2,wrong = [0, 0, 1]T (pointing straight forward in camera 2, not the correct leftward-shifted point). Then ỹ2,wrongT E ỹ1 = [0,0,1]·[0,−1,0] = 0 (by coincidence in this degenerate case — but for a general rotation and t, wrong matches give nonzero values). In practice you test |ỹ2TEỹ1| < threshold.

Non-degenerate example: rotation + translation

Let t = [0.5, 0, 0.866]T (camera 2 is 1 unit forward-right at 60°), R = rotation of −30° around y-axis. A point P = [1, 0, 5] in camera 1's frame. First, ỹ1 = [1/5, 0, 1]T = [0.2, 0, 1]. In camera 2's frame, P2 = RP − Rt (relative pose), approximately [0.4, 0, 4.7] after the rotation, so ỹ2 ≈ [0.085, 0, 1]. The epipolar constraint check:

non-degenerate constraint verification
import numpy as np

# Camera 2 rotated -30deg around Y from camera 1
theta = np.radians(-30)
R = np.array([[np.cos(theta), 0, np.sin(theta)],
              [0, 1, 0],
              [-np.sin(theta), 0, np.cos(theta)]])
t = np.array([0.5, 0, 0.866])  # ||t|| ≈ 1

t_hat = np.array([[0, -t[2], t[1]],
                  [t[2], 0, -t[0]],
                  [-t[1], t[0], 0]])
E = t_hat @ R

P = np.array([1.0, 0.0, 5.0])   # 3D point in cam1 frame
y1 = P / P[2]                   # [0.2, 0, 1]
P2 = R @ P + t                  # point in cam2 frame
y2 = P2 / P2[2]                 # ≈ [0.085, 0, 1]

constraint = y2 @ E @ y1        # should be ≈ 0
print(f"Constraint: {constraint:.2e}")  # → ~0.0  ✓

# Wrong match — shift y2 by (0.3, 0, 0)
y2_wrong = y2 + np.array([0.3, 0, 0])
constraint_wrong = y2_wrong @ E @ y1
print(f"Wrong match: {constraint_wrong:.4f}")  # → nonzero ≈ -0.28
Verified: the epipolar constraint is a scalar equation — a single number that equals zero for any correct match and is nonzero for incorrect ones (in non-degenerate cases). This is what makes it testable in RANSAC (L8).
What geometric condition does [t]×2 = 0 encode in the epipolar constraint derivation?

Chapter 3: The Essential Matrix

The Essential matrix is the 3×3 matrix at the heart of the epipolar constraint for calibrated cameras:

E = [t]× R

It encodes the full relative pose between two calibrated cameras — six DOF in principle (3 for R, 3 for t), but reduced to 5 DOF because E is scale-invariant (multiplying t by any scalar leaves E unchanged up to scale, since the epipolar constraint ỹ2TEỹ1=0 is homogeneous). This is the scale ambiguity: from two images alone you can recover the direction of t but not its magnitude.

Critical misconception — you cannot recover absolute scale from two views. Suppose the true scene has baseline 1 m and a building at 10 m. A scaled-down version with baseline 1 cm and a model at 10 cm produces identical images. No pixel information can distinguish these two scenarios. You need an external reference (known object size, IMU, GPS, stereo rig with fixed baseline) to recover metric scale. This is a fundamental limit, not a limitation of any particular algorithm.

Properties of E

An essential matrix has singular values {σ, σ, 0}. The two equal nonzero singular values reflect the rotational degrees of freedom; the zero singular value reflects the fact that E is rank-2 (the epipolar constraint collapses the 3D cone of rays to a line in the other image). After running the 8-point algorithm (Ch 5), we project the estimated matrix onto the essential space by enforcing this structure via SVD:

Eproj = U · diag(σ̄, σ̄, 0) · VT     where σ̄ = (σ1 + σ2) ⁄ 2

where UΣVT is the SVD of the unconstrained estimate. This "closest essential matrix" projection is the standard post-processing step.

Stereo camera example — parallel cameras

The simplest case: t = [b, 0, 0]T (horizontal baseline b), R = I. The skew matrix [t]× has only two nonzero entries: −b in position (1,2) and b in position (2,1). Then E = [t]×I = [t]×, so:

2T E ỹ1 = ỹ2T [0,0,0; 0,0,−b; 0,b,0] ỹ1

Working this out with ỹ1 = [u1, v1, 1] and ỹ2 = [u2, v2, 1]: the result is b(v2 − v1) = 0, so v2 = v1. For a horizontal baseline and no rotation, the match must have the same vertical coordinate in both images — the epipolar lines are horizontal scan-lines. This is exactly the stereo rectification property.

Calibrated vs uncalibrated cameras

The derivation above used normalized coordinates ỹ = K−1x̃ — pixel coordinates mapped to unit focal-length space. This assumes K is known. When cameras are calibrated (K known), we work with E. When K is unknown, we must work with raw pixel coordinates and the Fundamental matrix F = K−TEK−1 instead (Chapter 4).

python
import numpy as np

def hat(t):
    """Skew-symmetric (hat) matrix of a 3-vector."""
    return np.array([[0, -t[2], t[1]],
                     [t[2],  0, -t[0]],
                     [-t[1], t[0],  0]])

def essential_matrix(R, t):
    """E = [t]× R  (calibrated cameras, unit ||t||)."""
    t_hat = hat(t / np.linalg.norm(t))  # normalize t
    return t_hat @ R

# Example: camera 2 is 1 unit right, no rotation
R = np.eye(3)
t = np.array([1.0, 0.0, 0.0])
E = essential_matrix(R, t)
# E = [[0, 0, 0],[0, 0,-1],[0, 1, 0]]
# Singular values of E should be {1, 1, 0}
U, s, Vt = np.linalg.svd(E)
print(s)  # → [1. 1. 0.]  ✓
The Essential matrix E = [t]×R has rank 2 (one zero singular value). What does this zero singular value correspond to geometrically?

Chapter 4: The Fundamental Matrix

What if you don't know K? Many SfM pipelines encounter images from unknown cameras — internet photos, historical footage, smartphone images with EXIF stripped. The Fundamental matrix F generalizes the epipolar constraint to raw pixel coordinates without requiring calibration:

2T F x̃1 = 0

where x̃1, x̃2 are pixel coordinates (not normalized). The relationship between F and E is:

F = K2−T E K1−1 = K2−T [t]× R K1−1

F absorbs K into the constraint — it works directly with raw pixel coordinates. Like E, F is rank-2 and determined only up to scale, so it has 7 degrees of freedom (9 entries minus 1 for scale minus 1 for the rank constraint). In practice you estimate F with the 8-point algorithm on pixel coordinates, then if you know K you can recover E = K2T F K1.

The epipolar line equation

Given a point x1 in image 1, the corresponding epipolar line in image 2 is:

l2 = F x̃1

This is a 3-vector [a, b, c]T representing the line au + bv + c = 0 in pixel coordinates. The point x2 must lie on this line: l2T2 = 0, which is exactly the constraint x̃2TFx̃1 = 0. Symmetrically, l1 = FT2 is the epipolar line in image 1 for point x2.

Scale ambiguity — slide the global scale

Both panels show the same two-view scenario. Left: baseline = 1 m, scene depth = 10 m. Right: baseline = 1 cm, scene depth = 10 cm. The slider multiplies both the baseline and all depths by the same factor. Watch: the projected images (shown as colored dots on image planes) are identical regardless of scale. This is why you cannot recover absolute scale from two images alone.

Global scale 1.00×
Summary: F works on pixels. E works on calibrated (K−1 normalized) coordinates. If you know K, recover E from F. If you don't, use F directly and accept that you can't separate R and t from K. In robotics (calibrated cameras), E is standard. In internet-scale SfM (unknown cameras), F is standard.

Numerical sanity check

With K = diag(500, 500, 1) (focal length 500 px, principal point at origin) and the E computed above (E = [t]×I for t=[1,0,0]):

F = K−T E K−1 = (1⁄500) · [[0,0,0],[0,0,−1],[0,1,0]] · (1⁄500)

So F = (1/250000) · [[0,0,0],[0,0,−1],[0,1,0]]. A pixel at x1 = [250, 250, 1]T (center-left): l2 = Fx1 = [0, −1/250000, 1/250000]T, or equivalently the line −v + 1 = 0, i.e., v = 1 pixel row. The horizontal epipolar line makes sense for a horizontal baseline.

You estimate F from pixel correspondences. A friend says: "great, now we can recover R and t". What is the catch?

Chapter 5: The 8-Point Algorithm

You have N ≥ 8 matched pixel pairs. You want to estimate F (or E for calibrated cameras). The 8-point algorithm (Hartley 1997) turns the epipolar constraint into a linear system.

Building the constraint matrix

Each correspondence (x1 = [u1, v1, 1]T, x2 = [u2, v2, 1]T) contributes one row to a constraint matrix A. Expanding x2TFx1 = 0 and stacking the 9 entries of F as a vector f = vec(F):

akT f = 0     where    akT = [u2u1, u2v1, u2, v2u1, v2v1, v2, u1, v1, 1]

Stack N such rows into A ∈ ℝN×9 and solve Af = 0. The unique (up to scale) solution is the right singular vector corresponding to the smallest singular value of A. With exactly 8 points and exact correspondences, A has rank 8 and the null space is 1D.

Worked rows

Match 1: x1 = [100, 200, 1], x2 = [95, 198, 1]
a1T = [100×95, 200×95, 95, 100×198, 200×198, 198, 100, 200, 1]
= [9500, 19000, 95, 19800, 39600, 198, 100, 200, 1]

Match 2: x1 = [300, 150, 1], x2 = [294, 151, 1]
a2T = [294×300, 294×150, 294, 151×300, 151×150, 151, 300, 150, 1]
= [88200, 44100, 294, 45300, 22650, 151, 300, 150, 1]

With 8 such rows assembled, SVD gives the least-squares solution for f (and thus F).

Rank-2 enforcement

The SVD solution gives a generic 3×3 matrix — it won't have rank 2. We enforce rank 2 by zeroing the smallest singular value:

F = U diag(σ1, σ2, 0) VT

This is the nearest rank-2 matrix (in Frobenius norm) to the unconstrained estimate. The same step, with the averaging modification shown in Ch 3, projects onto the essential space.

Why normalization matters (Hartley's insight)

Raw pixel coordinates span [0, W] × [0, H] — numbers in the hundreds. The constraint matrix A has entries like u2u1 ≈ 104 in one column and 1 in another. This poor conditioning makes the SVD numerically unstable. Hartley (1997) showed you should normalize each image's points to have zero mean and unit mean distance from origin before running the algorithm, then denormalize at the end. This simple step reduces errors by orders of magnitude.

Key insight: 5 is actually enough, 8 is convenient. E has 5 DOF (3 for R + 2 for the unit direction of t). Each correspondence gives 1 equation. So 5 correspondences suffice in theory — Nistér's 5-point algorithm is a minimal solver used in modern SLAM. The 8-point algorithm uses 8 because it ignores the E-specific structure and solves for a generic 3×3 matrix first. More data (N >> 8) is handled by least-squares minimization of ||Af||2.
python
import numpy as np

def normalize_pts(pts):
    """Hartley normalization: zero mean, unit mean distance."""
    mu = pts.mean(axis=0)
    d  = np.linalg.norm(pts - mu, axis=1).mean()
    s  = np.sqrt(2) / (d + 1e-8)
    T  = np.array([[s, 0, -s*mu[0]],
                   [0, s, -s*mu[1]],
                   [0, 0, 1]])
    pts_h = np.hstack([pts, np.ones((len(pts),1))])
    return (T @ pts_h.T).T, T

def eight_point(pts1, pts2):
    """8-point algorithm for Fundamental matrix."""
    p1n, T1 = normalize_pts(pts1)
    p2n, T2 = normalize_pts(pts2)
    N = len(pts1)
    A = np.zeros((N, 9))
    for k in range(N):
        u1, v1 = p1n[k, 0], p1n[k, 1]
        u2, v2 = p2n[k, 0], p2n[k, 1]
        A[k] = [u2*u1, u2*v1, u2, v2*u1, v2*v1, v2, u1, v1, 1]
    # Least-squares: right singular vector of A
    _, _, Vt = np.linalg.svd(A)
    F = Vt[-1].reshape(3, 3)
    # Enforce rank 2
    U, s, Vt2 = np.linalg.svd(F)
    s[2] = 0
    F = U @ np.diag(s) @ Vt2
    # Denormalize
    F = T2.T @ F @ T1
    return F / F[2, 2]
After running the 8-point algorithm you get a 3×3 matrix F̂. Its singular values are {8.4, 3.1, 0.7}. What do you do next before using F̂ as a Fundamental matrix?

Chapter 6: Recovering R & t from E

You have a valid Essential matrix E (either computed directly from calibrated coordinates, or recovered as E = KTFK1 from F). Now recover R and t. The result from Ma et al. (2004, Thm 5.7) gives exactly two rotation-translation pairs from E — and another two from −E, giving four candidates total.

SVD decomposition

Compute the SVD: E = UΣVT. Define the 90° rotation matrix around the z-axis:

Rz(±π⁄2) = [[0, ∓1, 0], [±1, 0, 0], [0, 0, 1]]

The two pose candidates from E are:

R1 = U Rz(+π⁄2) VT     t1 = U Rz(+π⁄2) Σ UT
R2 = U Rz(−π⁄2) VT     t2 = U Rz(−π⁄2) Σ UT

Since −E is also a valid essential matrix (the constraint ỹ2TEỹ1 = 0 holds for −E too), there are two more candidates from −E. Total: four (R, t) pairs.

The cheirality test

Three of the four candidates place the reconstructed 3D points behind one or both cameras (negative depth). Only one candidate has all points in front of both cameras. This is the cheirality constraint (from the Greek for "hand" — it distinguishes left from right, front from back). To apply it: triangulate a test point using each candidate (R, t), and check the signs of d1 and d2. The candidate with d1 > 0 and d2 > 0 for all (or most) test points is the correct one.

The four-fold ambiguity, explained geometrically. Two of the four solutions correspond to points in front of camera 1 but behind camera 2 (or vice versa). The fourth has points behind both cameras — the camera has "flipped" through the baseline. Only one solution has points in front of both cameras, which is the physical one. In practice you check just one or a few test points; that's enough to identify the correct solution.
python
import numpy as np

def recover_pose(E, pts1_cal, pts2_cal):
    """Recover R, t from E using cheirality test.
    pts1_cal, pts2_cal: calibrated 2D points (K^-1 * pixel)."""
    U, _, Vt = np.linalg.svd(E)
    if np.linalg.det(U) < 0: U *= -1
    if np.linalg.det(Vt) < 0: Vt *= -1
    Rz90p = np.array([[0,-1,0],[1,0,0],[0,0,1]])
    Rz90m = np.array([[0, 1,0],[-1,0,0],[0,0,1]])
    candidates = [
        (U @ Rz90p @ Vt,  U[:, 2]),
        (U @ Rz90p @ Vt, -U[:, 2]),
        (U @ Rz90m @ Vt,  U[:, 2]),
        (U @ Rz90m @ Vt, -U[:, 2]),
    ]
    best_R, best_t, best_count = None, None, -1
    for R, t in candidates:
        count = 0
        for p1, p2 in zip(pts1_cal, pts2_cal):
            X = triangulate_point(p1, p2, R, t)
            d1 = X[2]
            d2 = (R @ X + t)[2]
            if d1 > 0 and d2 > 0: count += 1
        if count > best_count:
            best_count, best_R, best_t = count, R, t
    return best_R, best_t
Four-fold ambiguity: toggle the four (R, t) candidates

Each button shows one of the four decomposition candidates. Watch where the reconstructed point (orange dot) lands relative to both cameras (triangles). Only the correct candidate puts the point in front of both cameras.

After decomposing E into 4 candidate poses, you triangulate a single test point under each candidate. Two candidates yield d1=−2, d2=3; d1=4, d2=−1 respectively. The third gives d1=−3, d2=−5. The fourth gives d1=4, d2=3. Which candidate is physically correct?

Chapter 7: Triangulation

You have R and t. You have calibrated correspondences (ỹ1, ỹ2). To find the 3D position of the matched point, you triangulate: intersect the two back-projected rays.

Ray from camera 1: all points d11 for d1 > 0. Ray from camera 2 expressed in camera 1's frame: all points Rỹ2·d2 + t for d2 > 0. In the absence of noise these rays exactly intersect. With noise they are skew lines. We find the 3D point that minimizes reprojection error.

Linear triangulation (DLT)

Rewrite d22 = Rỹ1d1 + t as two equations in (d1, d2):

[ỹ2 | −Rỹ1] · [d2, d1]T = t

This is Ax = b with A ∈ ℝ3×2, x = [d2, d1]T, b = t. Solve via least-squares: x = (ATA)−1ATb. The 3D point is then P = d11.

Worked numbers

Let R = I, t = [1, 0, 0]T (camera 2 is 1 m to the right). A 3D point P = [0, 0, 5]. Camera 1 sees ỹ1 = [0, 0, 1]T (point straight ahead at depth 5). Camera 2 sees ỹ2 = [−0.2, 0, 1]T (point shifted left by 1/5).

triangulation worked example
# A = [y2 | -R*y1] where y2=[-.2,0,1], R*y1=[0,0,1]
# A = [[-0.2,  0],       (row 0: y2[0], -y1[0])
#      [ 0.0,  0],       (row 1: y2[1], -y1[1])
#      [ 1.0, -1]]       (row 2: y2[2], -y1[2])
# b = t = [1, 0, 0]
# A^T A = [[-0.2,0,1]·[-0.2,0,1], [-0.2,0,1]·[0,0,-1]] = [[1.04, -1],[-1, 1]]
# A^T b = [-0.2*1 + 0*0 + 1*0, 0*1 + 0*0 + (-1)*0] = [-0.2, 0]
# Solve: (A^T A) x = A^T b  →  d2≈1.0, d1≈5.0  ✓
# P = d1 * y1 = 5 * [0,0,1] = [0,0,5]  ✓

Linear triangulation — full code

python
def triangulate_point(y1, y2, R, t):
    """Linear triangulation.
    y1, y2: calibrated 2D points (K^-1 * pixel, shape (2,))
    R: rotation, t: translation (camera 2 relative to camera 1).
    Returns 3D point in camera 1 frame."""
    y1h = np.array([y1[0], y1[1], 1.0])
    y2h = np.array([y2[0], y2[1], 1.0])
    Ry2 = R @ y2h  # rotate ray from cam2 into cam1 frame
    # System: [y1h | -Ry2] [d1; d2] = t
    A = np.column_stack([y1h, -Ry2])   # shape (3, 2)
    # Least-squares: d = (A^T A)^-1 A^T t
    d, _, _, _ = np.linalg.lstsq(A, t, rcond=None)
    d1, d2 = d[0], d[1]
    # Midpoint estimate for robustness
    P_from_cam1 = d1 * y1h
    P_from_cam2 = R.T @ (d2 * y2h - t)
    return (P_from_cam1 + P_from_cam2) * 0.5

# Quick test: t=[1,0,0], R=I, y1=[0,0,1], y2=[-0.2,0,1], P=[0,0,5]
R, t = np.eye(3), np.array([1.0, 0, 0])
P_est = triangulate_point([0.0,0.0], [-0.2,0.0], R, t)
print(P_est)  # → [0. 0. 5.]  ✓

Why small baseline kills depth accuracy

The angular uncertainty in a correspondence translates to a large depth uncertainty when the rays are nearly parallel (small baseline). If the baseline is b and the scene is at depth Z, the depth estimation error scales as Z2 / b. A drone moving 1 cm between frames cannot accurately triangulate a building 50 m away — the rays are nearly parallel and any noise in the match pushes the intersection point wildly. This is the motivation for having large baseline in stereo systems.

Misconception: more feature matches improve depth accuracy. More matches average away noise in R and t estimates (good!), but they don't fix the fundamental geometry: shallow-angle intersections give poor depth even with perfect R and t. The geometry of the baseline and scene depth, not the number of matches, limits triangulation quality.
Triangulation noise — two rays that nearly meet

Top-down view. Two cameras (triangles) look at a 3D point (green). With zero noise the two rays (blue, purple dashed lines) intersect exactly at the true point. Add noise: the rays become skew lines and the midpoint estimate (orange) drifts. Small baseline makes this much worse — the rays are nearly parallel, so even tiny angular noise causes a huge shift in the intersection.

Match noise (σ px) 0.00
Baseline (m) 1.00
A drone takes two frames 1 mm apart (tiny baseline) and triangulates a building 30 m away. A second test takes frames 1 m apart (large baseline) to the same building. Assuming same pixel noise in both cases, which gives more accurate depth estimates?

Chapter 8: Showcase — Full SfM Simulator

This simulation runs the complete two-view geometry pipeline: generate synthetic 3D points, project them into two cameras, add noise, run the 8-point algorithm, recover R and t, and triangulate. Use the sliders to explore how each parameter degrades the reconstruction.

Two-View SfM — full pipeline

Left panel: top-down view of the two cameras (triangles) and the true 3D points (teal dots) vs reconstructed points (orange dots). Right panel: reprojection error histogram. Adjust sliders to break the reconstruction.

Baseline (m) 1.00
Noise (px) 0.50
N points 20

Chapter 9: Connections & Cheat Sheet

The full two-view geometry cheat sheet

ConceptFormulaNotes
Epipolar constraint (calibrated)2T E ỹ1 = 0ỹ = K−1
Essential matrixE = [t]×Rrank 2, SVs {σ,σ,0}, 5 DOF
Fundamental matrixF = K2−TEK1−1works on pixels, 7 DOF
8-point algorithmAf = 0 → SVD → rank-2 projectNormalize first (Hartley)
Pose from ESVD → 4 candidatesCheirality picks the correct one
Triangulation[ỹ2 | −Rỹ1][d2;d1] = tLeast-squares for depths
Scale ambiguity||t|| = 1 by conventionNeed external reference for metric scale

Degraded cases — when two-view geometry fails

ScenarioWhat breaksFix
Pure rotation (no translation)E = [0]×R = 0 — singular, E undefined; cannot triangulate depthUse homography instead; add IMU for translation
Small baselineRays nearly parallel → depth error ∝ Z2⁄bLarger baseline; multi-view SfM for averaging
Planar sceneE degenerate; all points on one plane → homography betterUse homography H instead of E; degenerate 8-pt
Outlier matchesEven one outlier row poisons the linear systemRANSAC (L8): random 8-subset, fit, count inliers

Connections to the VNAV curriculum

Intrinsic matrix K · back-projection · "one pixel = a ray" — the foundation that makes ỹ = K−1x̃ meaningful
Harris corners, SIFT, KLT — produces the matched pairs (x1, x2) that feed the 8-point algorithm
L7: Two-View Geometry (this lesson)
Epipolar constraint · E and F · 8-point · cheirality · triangulation · scale ambiguity
L8: RANSAC
Random sample consensus: robustly estimate E⁄F in the presence of outlier matches (wrong correspondences from L6)
VIO & SLAM
Chain two-view pose estimates over hundreds of frames; add IMU for metric scale and loop closure
Connections to VNAV L1 & L2: The skew-symmetric matrix [t]× is the hat operator from L1/L2's SE(3) treatment. The rotation R in E = [t]×R is an SO(3) element. The SVD decomposition for recovering R ensures R stays in SO(3) (det = +1) — the same manifold structure explored with Lie groups in L2.
Scale ambiguity in the wild. Real SLAM systems (ORB-SLAM3, COLMAP) handle scale ambiguity by either (a) using a stereo rig with fixed known baseline — then you know ||t|| exactly; (b) fusing with IMU which gives metric acceleration; (c) detecting loop closures and enforcing global consistency. Without one of these, monocular SfM produces a "shape-up-to-scale" reconstruction: the shape is correct, but every depth could be multiplied by any constant.

How two-view geometry fits into a full SLAM pipeline

Visual Odometry (VO)

Chain pairwise two-view estimates across hundreds of frames. Each frame pair gives a relative pose (Rk, tk). Compose them: T0→N = TN−1 · TN−2 · … · T1. Drift accumulates since each R and t has small errors. Bundle adjustment periodically refines all poses and 3D points jointly to minimize reprojection error across all views.

VIO (adding IMU)

IMU provides: (1) metric acceleration, which integrates to give metric translation — solving the scale ambiguity; (2) orientation estimates via gyroscope, tightening R recovery; (3) high-rate prediction (100–400 Hz) between camera frames (10–30 Hz). Tight coupling of IMU and camera in a single factor graph is the foundation of modern VIO systems (VINS-Mono, OKVIS, MSCKF).

The full SfM pipeline in five steps: (1) Detect features (L6). (2) Match features across views (ratio test, L6). (3) Estimate F or E with 8-point + RANSAC (L8). (4) Recover R, t via SVD + cheirality. (5) Triangulate 3D points. Then add more views (multi-view SfM) and refine with bundle adjustment. This pipeline is the core of COLMAP, ORB-SLAM, and every visual navigation system in production today.
A monocular drone estimates its trajectory using two-view geometry between consecutive frames, no IMU, no known objects. What can it recover?