Two-View Geometry: Epipolar & SfM

Chapter 0: The Two-Photo Problem

A drone flies past a building. Its forward-facing camera snaps two frames 0.3 seconds apart. Your feature tracker (Lecture 12–13) hands you 200 matched pixel pairs: point (u₁, v₁) in frame 1 corresponds to (u₂, v₂) in frame 2. Those pairs all picture the same physical points — a window corner, a ledge, a drainpipe. Now what?

The matched pixels alone don't tell you anything about how the camera moved. A 1-pixel shift could mean the drone moved 1 cm or 100 m, depending on the distance to the building. And each pixel is just a ray shooting out from the camera — the feature could be anywhere along that ray. You need more structure.

The saving grace is that the two cameras are geometrically related. They are both looking at the same rigid 3D world. That rigidity imposes a constraint between every correct correspondence — a constraint that can be written as a single equation involving only the pixel coordinates, the camera rotation R, and the translation t between the two views. Epipolar geometry makes that constraint explicit.

The single most important intuition of this lesson: a single pixel correspondence does NOT determine the 3D position of the point. It gives you a ray. Two correspondences from two views give you TWO rays. Where the rays meet is the 3D point — but only if you know R and t. Recovering R and t from many correspondences is the goal of the 8-point algorithm. Then triangulating rays gives you the 3D structure. These two steps together are called Structure from Motion (SfM).

Why matched pixels alone aren't enough

From L5, a calibrated pixel (u, v) in camera 1 back-projects to a ray: every 3D point d · [x, y, 1]^T (in normalized camera coordinates) maps to that pixel, for any depth d > 0. The same matching pixel in camera 2 back-projects to another ray through camera 2's origin. The two rays should meet at the 3D point — but you don't know R and t yet, so you can't intersect them.

The epipolar constraint is what you can assert before knowing R and t. It says: whatever R and t are, the correct match must satisfy a single algebraic equation. If you collect enough correct matches, that equation system is solvable for R and t.

Two views of the same 3D point

Drag the 3D point (blue dot) or the second camera position (orange dot) to see how the projected image positions change. Notice that the match carries no depth information on its own — the point could be anywhere along either dashed ray.

You are given a single pixel (u, v) in a calibrated camera. After back-projecting through K⁻¹, what do you know about the 3D location of the corresponding scene point?

The exact 3D position: it is the unique point that projects to (u, v) It lies somewhere on a ray from the camera center through (u, v) — depth is unknown It lies on a plane parallel to the image plane at the camera's focal length Nothing — a single pixel gives zero 3D information

Chapter 1: Epipolar Geometry

Place two cameras in space. Pick any 3D point P visible to both. Draw lines from P to each camera's optical center O₁ and O₂. Those three points — P, O₁, O₂ — define a plane. This plane is called the epipolar plane.

The epipolar plane intersects each image plane in a line. These are the epipolar lines: in camera 1's image, the epipolar line corresponding to a point in camera 2; and vice versa. All epipolar lines in camera 1 pass through a single special point called the epipole e₁ — the image of camera 2's optical center as seen by camera 1. Symmetrically, e₂ is camera 1's center projected into camera 2.

The 2D-to-1D reduction. Without epipolar geometry, finding the match for a point in image 1 requires searching the entire image 2 (a 2D search). With epipolar geometry, the match must lie exactly on the epipolar line — a 1D search. This is one of the most practically important facts in all of computer vision: it halves the dimensionality of the correspondence problem and dramatically reduces false matches.

Here is the key geometric fact: as P moves along a 3D ray through O₁, the epipolar plane rotates around the baseline (the line through O₁ and O₂). The image of P in camera 1 stays fixed (it's always the same pixel on the ray), but the image in camera 2 traces out the epipolar line. So: given the pixel in camera 1, the epipolar geometry tells you exactly which line to search in camera 2.

This is not just elegant geometry — it has a direct computational consequence. Stereo matching algorithms exploit it to limit correlation search to single scan-lines. Dense stereo SLAM (like DepthAI or ZED camera systems) is powered by exactly this: rectify the image pair so epipolar lines become horizontal, then search each row independently. The reduction from 2D search (W × H comparisons per pixel) to 1D search (W comparisons per pixel) yields a roughly ×H speedup — typically 500× for a 500-pixel-tall image.

The epipolar line equation

Given a point ỹ₁ in camera 1's calibrated coordinates, the corresponding epipolar line in camera 2 is given by:

l₂ = E ỹ₁

This is a 3-vector [a, b, c]^T representing the line au + bv + c = 0 in calibrated coordinates. Any correct match ỹ₂ must satisfy l₂^Tỹ₂ = 0 — which is exactly the epipolar constraint ỹ₂^TEỹ₁ = 0. The distance from a point ỹ₂ to this line is |l₂^Tỹ₂| / ||[a,b]|| and is called the Sampson distance — a first-order approximation to the reprojection error used in RANSAC inlier scoring.

Epipoles: where all epipolar lines meet

The epipole e₁ is the projection of O₂ into camera 1. It may fall outside the image frame — that's fine, it's still a well-defined point (at infinity if the cameras are side-by-side with the same heading). All epipolar lines in camera 1 pass through e₁. Symmetrically, all epipolar lines in camera 2 pass through e₂.

Special case: parallel cameras (e.g., a stereo pair with identical orientations). Here O₁ and O₂ differ only in the horizontal direction. The epipoles are at infinity, and all epipolar lines are horizontal — the y-coordinate of a match is identical in both images. This is the classical stereo rectification setup.

Misconception: epipolar geometry depends on the scene. It does NOT. The epipolar geometry depends only on the camera positions and orientations — the extrinsic relationship between the two cameras. A single 3D point determines which epipolar plane we're in, but the family of all epipolar lines (and the epipoles) is fixed by R and t alone. Change the scene, keep the cameras: same epipolar geometry.

Counting the epipolar geometry DOF

The relative pose between two cameras has 6 DOF (3 rotation + 3 translation). But two-view geometry only determines translation direction, not magnitude — that's the scale ambiguity. So the epipolar geometry has 5 DOF: 3 for R, 2 for the unit direction of t. This is why the minimal solver for the essential matrix needs exactly 5 correspondences (the 5-point algorithm), and why the 8-point algorithm over-parameterizes by 3 DOF (it fits a general 3×3 matrix with 8 DOF before projecting to rank 2).

Epipolar geometry — drag the 3D point

Left: camera 1's image. Right: camera 2's image. Drag the orange dot (3D point position projected onto left image). Watch how the epipolar line in camera 2 (right panel) updates — the match for the orange dot must lie on that teal line. The blue dots show the epipoles.

If the two cameras are pointed in exactly the same direction (no relative rotation), just displaced sideways by baseline b, where do the epipoles lie?

At the center of each image Anywhere along the vertical axis of each image At infinity — the cameras' optical axes are parallel and never converge At the corners of each image

Chapter 2: The Epipolar Constraint

Let's derive the epipolar constraint from scratch. Set camera 1 at the world origin with no rotation — a convenient choice that doesn't lose generality. Camera 2 has position t and rotation R relative to camera 1. A 3D point P is at depth d₁ in camera 1's frame, so:

d₁ · ỹ₁ = P_cam1 d₂ · ỹ₂ = R · P_cam1 + t

where ỹ₁ = K⁻¹x̃₁ and ỹ₂ = K⁻¹x̃₂ are the calibrated (normalized) coordinates — pixel coordinates pre-multiplied by K⁻¹ to remove the camera's intrinsics. From the left equation, P_cam1 = d₁ỹ₁. Substituting into the right:

d₂ · ỹ₂ = d₁ R ỹ₁ + t

Now cross-multiply both sides by t (i.e., apply [t]_×, the skew-symmetric matrix of t). Since [t]_×t = 0 (any vector crossed with itself is zero):

d₂ [t]_× ỹ₂ = d₁ [t]_× R ỹ₁

Dot both sides with ỹ₂^T. The left side becomes d₂ ỹ₂^T[t]_×ỹ₂ = 0, because [t]_×ỹ₂ is orthogonal to ỹ₂ (cross product gives a perpendicular vector). So:

ỹ₂^T [t]_× R ỹ₁ = 0

This is the epipolar constraint. It must hold for every correct correspondence (ỹ₁, ỹ₂). The quantity in the middle — [t]_×R — is the Essential matrix E (Chapter 3).

The skew-symmetric matrix [t]_×

If t = [t_x, t_y, t_z]^T, the skew-symmetric (hat) matrix is:

[t]_× = [ row1: 0, −t_z, t_y | row2: t_z, 0, −t_x | row3: −t_y, t_x, 0 ]

This is the hat operator from VNAV L1/L2 — the same tool used to convert the angular velocity vector to the cross-product matrix. Here it converts t into a matrix so the cross product t × v can be written as [t]_×v.

Worked numeric check

Let t = [1, 0, 0]^T (camera 2 is 1 unit to the right), R = I (no rotation). A 3D point at [0, 0, 5] in camera 1's frame. Normalized coords: ỹ₁ = [0, 0, 1]^T; ỹ₂ = [−1/5, 0, 1]^T (camera 2 sees the point shifted left by 1/5 = baseline/depth). Check:

epipolar constraint check
# t = [1,0,0],  [t]× = [[0,0,0],[0,0,-1],[0,1,0]]  (skew of [1,0,0])
# Wait: [1,0,0] skew: [ 0, -tz, ty ; tz, 0, -tx ; -ty, tx, 0 ]
#                  = [ 0,  0,  0 ;  0, 0, -1 ;   0,  1,  0 ]
E = [t]× R = [t]×  (since R=I)
# y2^T E y1 = [-1/5, 0, 1] * [[0,0,0],[0,0,-1],[0,1,0]] * [0,0,1]
# E y1 = [0,0,0;0,0,-1;0,1,0]*[0,0,1] = [0,-1,0]
# y2^T * [0,-1,0] = (-1/5)(0) + (0)(-1) + (1)(0) = 0  ✓ constraint holds

Now test a wrong match: ỹ_2,wrong = [0, 0, 1]^T (pointing straight forward in camera 2, not the correct leftward-shifted point). Then ỹ_2,wrong^T E ỹ₁ = [0,0,1]·[0,−1,0] = 0 (by coincidence in this degenerate case — but for a general rotation and t, wrong matches give nonzero values). In practice you test |ỹ₂^TEỹ₁| < threshold.

Non-degenerate example: rotation + translation

Let t = [0.5, 0, 0.866]^T (camera 2 is 1 unit forward-right at 60°), R = rotation of −30° around y-axis. A point P = [1, 0, 5] in camera 1's frame. First, ỹ₁ = [1/5, 0, 1]^T = [0.2, 0, 1]. In camera 2's frame, P₂ = RP − Rt (relative pose), approximately [0.4, 0, 4.7] after the rotation, so ỹ₂ ≈ [0.085, 0, 1]. The epipolar constraint check:

non-degenerate constraint verification
import numpy as np

# Camera 2 rotated -30deg around Y from camera 1
theta = np.radians(-30)
R = np.array([[np.cos(theta), 0, np.sin(theta)],
              [0, 1, 0],
              [-np.sin(theta), 0, np.cos(theta)]])
t = np.array([0.5, 0, 0.866])  # ||t|| ≈ 1

t_hat = np.array([[0, -t[2], t[1]],
                  [t[2], 0, -t[0]],
                  [-t[1], t[0], 0]])
E = t_hat @ R

P = np.array([1.0, 0.0, 5.0])   # 3D point in cam1 frame
y1 = P / P[2]                   # [0.2, 0, 1]
P2 = R @ P + t                  # point in cam2 frame
y2 = P2 / P2[2]                 # ≈ [0.085, 0, 1]

constraint = y2 @ E @ y1        # should be ≈ 0
print(f"Constraint: {constraint:.2e}")  # → ~0.0  ✓

# Wrong match — shift y2 by (0.3, 0, 0)
y2_wrong = y2 + np.array([0.3, 0, 0])
constraint_wrong = y2_wrong @ E @ y1
print(f"Wrong match: {constraint_wrong:.4f}")  # → nonzero ≈ -0.28

Verified: the epipolar constraint is a scalar equation — a single number that equals zero for any correct match and is nonzero for incorrect ones (in non-degenerate cases). This is what makes it testable in RANSAC (L8).

What geometric condition does [t]_×ỹ₂ = 0 encode in the epipolar constraint derivation?

The translation and the second camera ray are perpendicular ỹ₂^T[t]_×ỹ₂ = 0 because a cross-product is orthogonal to both of its inputs The rotation matrix R is orthogonal, so the constraint simplifies The depths d₁ and d₂ cancel because the scene is planar

Chapter 3: The Essential Matrix

The Essential matrix is the 3×3 matrix at the heart of the epipolar constraint for calibrated cameras:

E = [t]_× R

It encodes the full relative pose between two calibrated cameras — six DOF in principle (3 for R, 3 for t), but reduced to 5 DOF because E is scale-invariant (multiplying t by any scalar leaves E unchanged up to scale, since the epipolar constraint ỹ₂^TEỹ₁=0 is homogeneous). This is the scale ambiguity: from two images alone you can recover the direction of t but not its magnitude.

Critical misconception — you cannot recover absolute scale from two views. Suppose the true scene has baseline 1 m and a building at 10 m. A scaled-down version with baseline 1 cm and a model at 10 cm produces identical images. No pixel information can distinguish these two scenarios. You need an external reference (known object size, IMU, GPS, stereo rig with fixed baseline) to recover metric scale. This is a fundamental limit, not a limitation of any particular algorithm.

Properties of E

An essential matrix has singular values {σ, σ, 0}. The two equal nonzero singular values reflect the rotational degrees of freedom; the zero singular value reflects the fact that E is rank-2 (the epipolar constraint collapses the 3D cone of rays to a line in the other image). After running the 8-point algorithm (Ch 5), we project the estimated matrix onto the essential space by enforcing this structure via SVD:

E_proj = U · diag(σ̄, σ̄, 0) · V^T where σ̄ = (σ₁ + σ₂) ⁄ 2

where UΣV^T is the SVD of the unconstrained estimate. This "closest essential matrix" projection is the standard post-processing step.

Stereo camera example — parallel cameras

The simplest case: t = [b, 0, 0]^T (horizontal baseline b), R = I. The skew matrix [t]_× has only two nonzero entries: −b in position (1,2) and b in position (2,1). Then E = [t]_×I = [t]_×, so:

ỹ₂^T E ỹ₁ = ỹ₂^T [0,0,0; 0,0,−b; 0,b,0] ỹ₁

Working this out with ỹ₁ = [u₁, v₁, 1] and ỹ₂ = [u₂, v₂, 1]: the result is b(v₂ − v₁) = 0, so v₂ = v₁. For a horizontal baseline and no rotation, the match must have the same vertical coordinate in both images — the epipolar lines are horizontal scan-lines. This is exactly the stereo rectification property.

Calibrated vs uncalibrated cameras

The derivation above used normalized coordinates ỹ = K⁻¹x̃ — pixel coordinates mapped to unit focal-length space. This assumes K is known. When cameras are calibrated (K known), we work with E. When K is unknown, we must work with raw pixel coordinates and the Fundamental matrix F = K^−TEK⁻¹ instead (Chapter 4).

python
import numpy as np

def hat(t):
    """Skew-symmetric (hat) matrix of a 3-vector."""
    return np.array([[0, -t[2], t[1]],
                     [t[2],  0, -t[0]],
                     [-t[1], t[0],  0]])

def essential_matrix(R, t):
    """E = [t]× R  (calibrated cameras, unit ||t||)."""
    t_hat = hat(t / np.linalg.norm(t))  # normalize t
    return t_hat @ R

# Example: camera 2 is 1 unit right, no rotation
R = np.eye(3)
t = np.array([1.0, 0.0, 0.0])
E = essential_matrix(R, t)
# E = [[0, 0, 0],[0, 0,-1],[0, 1, 0]]
# Singular values of E should be {1, 1, 0}
U, s, Vt = np.linalg.svd(E)
print(s)  # → [1. 1. 0.]  ✓

The Essential matrix E = [t]_×R has rank 2 (one zero singular value). What does this zero singular value correspond to geometrically?

The camera has zero baseline, so no 3D reconstruction is possible The rotation R has a determinant of zero Rays along the baseline direction (epipolar lines through the epipole) map to the same point — the constraint is degenerate in that direction The translation cannot be estimated because it is only 3 numbers

Chapter 4: The Fundamental Matrix

What if you don't know K? Many SfM pipelines encounter images from unknown cameras — internet photos, historical footage, smartphone images with EXIF stripped. The Fundamental matrix F generalizes the epipolar constraint to raw pixel coordinates without requiring calibration:

x̃₂^T F x̃₁ = 0

where x̃₁, x̃₂ are pixel coordinates (not normalized). The relationship between F and E is:

F = K₂^−T E K₁⁻¹ = K₂^−T [t]_× R K₁⁻¹

F absorbs K into the constraint — it works directly with raw pixel coordinates. Like E, F is rank-2 and determined only up to scale, so it has 7 degrees of freedom (9 entries minus 1 for scale minus 1 for the rank constraint). In practice you estimate F with the 8-point algorithm on pixel coordinates, then if you know K you can recover E = K₂^T F K₁.

The epipolar line equation

Given a point x₁ in image 1, the corresponding epipolar line in image 2 is:

l₂ = F x̃₁

This is a 3-vector [a, b, c]^T representing the line au + bv + c = 0 in pixel coordinates. The point x₂ must lie on this line: l₂^Tx̃₂ = 0, which is exactly the constraint x̃₂^TFx̃₁ = 0. Symmetrically, l₁ = F^Tx̃₂ is the epipolar line in image 1 for point x₂.

Scale ambiguity — slide the global scale

Both panels show the same two-view scenario. Left: baseline = 1 m, scene depth = 10 m. Right: baseline = 1 cm, scene depth = 10 cm. The slider multiplies both the baseline and all depths by the same factor. Watch: the projected images (shown as colored dots on image planes) are identical regardless of scale. This is why you cannot recover absolute scale from two images alone.

Global scale 1.00×

Summary: F works on pixels. E works on calibrated (K⁻¹ normalized) coordinates. If you know K, recover E from F. If you don't, use F directly and accept that you can't separate R and t from K. In robotics (calibrated cameras), E is standard. In internet-scale SfM (unknown cameras), F is standard.

Numerical sanity check

With K = diag(500, 500, 1) (focal length 500 px, principal point at origin) and the E computed above (E = [t]_×I for t=[1,0,0]):

F = K^−T E K⁻¹ = (1⁄500) · [[0,0,0],[0,0,−1],[0,1,0]] · (1⁄500)

So F = (1/250000) · [[0,0,0],[0,0,−1],[0,1,0]]. A pixel at x₁ = [250, 250, 1]^T (center-left): l₂ = Fx₁ = [0, −1/250000, 1/250000]^T, or equivalently the line −v + 1 = 0, i.e., v = 1 pixel row. The horizontal epipolar line makes sense for a horizontal baseline.

You estimate F from pixel correspondences. A friend says: "great, now we can recover R and t". What is the catch?

F gives R and t directly via SVD, no catch You also need to run RANSAC on F before extracting poses You need K to recover E = K^TFK from F; without K you can't separate the intrinsics from the pose F has 9 entries so you need 9 correspondences, not 8

Chapter 5: The 8-Point Algorithm

You have N ≥ 8 matched pixel pairs. You want to estimate F (or E for calibrated cameras). The 8-point algorithm (Hartley 1997) turns the epipolar constraint into a linear system.

Building the constraint matrix

Each correspondence (x₁ = [u₁, v₁, 1]^T, x₂ = [u₂, v₂, 1]^T) contributes one row to a constraint matrix A. Expanding x₂^TFx₁ = 0 and stacking the 9 entries of F as a vector f = vec(F):

a_k^T f = 0 where a_k^T = [u₂u₁, u₂v₁, u₂, v₂u₁, v₂v₁, v₂, u₁, v₁, 1]

Stack N such rows into A ∈ ℝ^N×9 and solve Af = 0. The unique (up to scale) solution is the right singular vector corresponding to the smallest singular value of A. With exactly 8 points and exact correspondences, A has rank 8 and the null space is 1D.

Worked rows

Match 1: x₁ = [100, 200, 1], x₂ = [95, 198, 1]
a₁^T = [100×95, 200×95, 95, 100×198, 200×198, 198, 100, 200, 1]
= [9500, 19000, 95, 19800, 39600, 198, 100, 200, 1]

Match 2: x₁ = [300, 150, 1], x₂ = [294, 151, 1]
a₂^T = [294×300, 294×150, 294, 151×300, 151×150, 151, 300, 150, 1]
= [88200, 44100, 294, 45300, 22650, 151, 300, 150, 1]

With 8 such rows assembled, SVD gives the least-squares solution for f (and thus F).

Rank-2 enforcement

The SVD solution gives a generic 3×3 matrix — it won't have rank 2. We enforce rank 2 by zeroing the smallest singular value:

F = U diag(σ₁, σ₂, 0) V^T

This is the nearest rank-2 matrix (in Frobenius norm) to the unconstrained estimate. The same step, with the averaging modification shown in Ch 3, projects onto the essential space.

Why normalization matters (Hartley's insight)

Raw pixel coordinates span [0, W] × [0, H] — numbers in the hundreds. The constraint matrix A has entries like u₂u₁ ≈ 10⁴ in one column and 1 in another. This poor conditioning makes the SVD numerically unstable. Hartley (1997) showed you should normalize each image's points to have zero mean and unit mean distance from origin before running the algorithm, then denormalize at the end. This simple step reduces errors by orders of magnitude.

Key insight: 5 is actually enough, 8 is convenient. E has 5 DOF (3 for R + 2 for the unit direction of t). Each correspondence gives 1 equation. So 5 correspondences suffice in theory — Nistér's 5-point algorithm is a minimal solver used in modern SLAM. The 8-point algorithm uses 8 because it ignores the E-specific structure and solves for a generic 3×3 matrix first. More data (N >> 8) is handled by least-squares minimization of ||Af||².

python
import numpy as np

def normalize_pts(pts):
    """Hartley normalization: zero mean, unit mean distance."""
    mu = pts.mean(axis=0)
    d  = np.linalg.norm(pts - mu, axis=1).mean()
    s  = np.sqrt(2) / (d + 1e-8)
    T  = np.array([[s, 0, -s*mu[0]],
                   [0, s, -s*mu[1]],
                   [0, 0, 1]])
    pts_h = np.hstack([pts, np.ones((len(pts),1))])
    return (T @ pts_h.T).T, T

def eight_point(pts1, pts2):
    """8-point algorithm for Fundamental matrix."""
    p1n, T1 = normalize_pts(pts1)
    p2n, T2 = normalize_pts(pts2)
    N = len(pts1)
    A = np.zeros((N, 9))
    for k in range(N):
        u1, v1 = p1n[k, 0], p1n[k, 1]
        u2, v2 = p2n[k, 0], p2n[k, 1]
        A[k] = [u2*u1, u2*v1, u2, v2*u1, v2*v1, v2, u1, v1, 1]
    # Least-squares: right singular vector of A
    _, _, Vt = np.linalg.svd(A)
    F = Vt[-1].reshape(3, 3)
    # Enforce rank 2
    U, s, Vt2 = np.linalg.svd(F)
    s[2] = 0
    F = U @ np.diag(s) @ Vt2
    # Denormalize
    F = T2.T @ F @ T1
    return F / F[2, 2]

After running the 8-point algorithm you get a 3×3 matrix F̂. Its singular values are {8.4, 3.1, 0.7}. What do you do next before using F̂ as a Fundamental matrix?

Nothing — it already looks like a valid Fundamental matrix Normalize so the singular values sum to 1 Zero the smallest singular value and recompose: F = U diag(8.4, 3.1, 0) V^T Run the algorithm again with 16 points to improve the estimate

Chapter 6: Recovering R & t from E

You have a valid Essential matrix E (either computed directly from calibrated coordinates, or recovered as E = K^TFK₁ from F). Now recover R and t. The result from Ma et al. (2004, Thm 5.7) gives exactly two rotation-translation pairs from E — and another two from −E, giving four candidates total.

SVD decomposition

Compute the SVD: E = UΣV^T. Define the 90° rotation matrix around the z-axis:

R_z(±π⁄2) = [[0, ∓1, 0], [±1, 0, 0], [0, 0, 1]]

The two pose candidates from E are:

R₁ = U R_z(+π⁄2) V^T t₁ = U R_z(+π⁄2) Σ U^T

R₂ = U R_z(−π⁄2) V^T t₂ = U R_z(−π⁄2) Σ U^T

Since −E is also a valid essential matrix (the constraint ỹ₂^TEỹ₁ = 0 holds for −E too), there are two more candidates from −E. Total: four (R, t) pairs.

The cheirality test

Three of the four candidates place the reconstructed 3D points behind one or both cameras (negative depth). Only one candidate has all points in front of both cameras. This is the cheirality constraint (from the Greek for "hand" — it distinguishes left from right, front from back). To apply it: triangulate a test point using each candidate (R, t), and check the signs of d₁ and d₂. The candidate with d₁ > 0 and d₂ > 0 for all (or most) test points is the correct one.

The four-fold ambiguity, explained geometrically. Two of the four solutions correspond to points in front of camera 1 but behind camera 2 (or vice versa). The fourth has points behind both cameras — the camera has "flipped" through the baseline. Only one solution has points in front of both cameras, which is the physical one. In practice you check just one or a few test points; that's enough to identify the correct solution.

python
import numpy as np

def recover_pose(E, pts1_cal, pts2_cal):
    """Recover R, t from E using cheirality test.
    pts1_cal, pts2_cal: calibrated 2D points (K^-1 * pixel)."""
    U, _, Vt = np.linalg.svd(E)
    if np.linalg.det(U) < 0: U *= -1
    if np.linalg.det(Vt) < 0: Vt *= -1
    Rz90p = np.array([[0,-1,0],[1,0,0],[0,0,1]])
    Rz90m = np.array([[0, 1,0],[-1,0,0],[0,0,1]])
    candidates = [
        (U @ Rz90p @ Vt,  U[:, 2]),
        (U @ Rz90p @ Vt, -U[:, 2]),
        (U @ Rz90m @ Vt,  U[:, 2]),
        (U @ Rz90m @ Vt, -U[:, 2]),
    ]
    best_R, best_t, best_count = None, None, -1
    for R, t in candidates:
        count = 0
        for p1, p2 in zip(pts1_cal, pts2_cal):
            X = triangulate_point(p1, p2, R, t)
            d1 = X[2]
            d2 = (R @ X + t)[2]
            if d1 > 0 and d2 > 0: count += 1
        if count > best_count:
            best_count, best_R, best_t = count, R, t
    return best_R, best_t

Four-fold ambiguity: toggle the four (R, t) candidates

Each button shows one of the four decomposition candidates. Watch where the reconstructed point (orange dot) lands relative to both cameras (triangles). Only the correct candidate puts the point in front of both cameras.

After decomposing E into 4 candidate poses, you triangulate a single test point under each candidate. Two candidates yield d₁=−2, d₂=3; d₁=4, d₂=−1 respectively. The third gives d₁=−3, d₂=−5. The fourth gives d₁=4, d₂=3. Which candidate is physically correct?

First (d₁=−2, d₂=3) — one positive depth is enough Second (d₁=4, d₂=−1) — largest d₁ wins Third (d₁=−3, d₂=−5) — the most negative depths indicate the correct camera orientation Fourth (d₁=4, d₂=3) — both depths positive means the point is in front of both cameras

Chapter 7: Triangulation

You have R and t. You have calibrated correspondences (ỹ₁, ỹ₂). To find the 3D position of the matched point, you triangulate: intersect the two back-projected rays.

Ray from camera 1: all points d₁ỹ₁ for d₁ > 0. Ray from camera 2 expressed in camera 1's frame: all points Rỹ₂·d₂ + t for d₂ > 0. In the absence of noise these rays exactly intersect. With noise they are skew lines. We find the 3D point that minimizes reprojection error.

Linear triangulation (DLT)

Rewrite d₂ỹ₂ = Rỹ₁d₁ + t as two equations in (d₁, d₂):

[ỹ₂ | −Rỹ₁] · [d₂, d₁]^T = t

This is Ax = b with A ∈ ℝ^3×2, x = [d₂, d₁]^T, b = t. Solve via least-squares: x = (A^TA)⁻¹A^Tb. The 3D point is then P = d₁ỹ₁.

Worked numbers

Let R = I, t = [1, 0, 0]^T (camera 2 is 1 m to the right). A 3D point P = [0, 0, 5]. Camera 1 sees ỹ₁ = [0, 0, 1]^T (point straight ahead at depth 5). Camera 2 sees ỹ₂ = [−0.2, 0, 1]^T (point shifted left by 1/5).

triangulation worked example
# A = [y2 | -R*y1] where y2=[-.2,0,1], R*y1=[0,0,1]
# A = [[-0.2,  0],       (row 0: y2[0], -y1[0])
#      [ 0.0,  0],       (row 1: y2[1], -y1[1])
#      [ 1.0, -1]]       (row 2: y2[2], -y1[2])
# b = t = [1, 0, 0]
# A^T A = [[-0.2,0,1]·[-0.2,0,1], [-0.2,0,1]·[0,0,-1]] = [[1.04, -1],[-1, 1]]
# A^T b = [-0.2*1 + 0*0 + 1*0, 0*1 + 0*0 + (-1)*0] = [-0.2, 0]
# Solve: (A^T A) x = A^T b  →  d2≈1.0, d1≈5.0  ✓
# P = d1 * y1 = 5 * [0,0,1] = [0,0,5]  ✓

Linear triangulation — full code

python
def triangulate_point(y1, y2, R, t):
    """Linear triangulation.
    y1, y2: calibrated 2D points (K^-1 * pixel, shape (2,))
    R: rotation, t: translation (camera 2 relative to camera 1).
    Returns 3D point in camera 1 frame."""
    y1h = np.array([y1[0], y1[1], 1.0])
    y2h = np.array([y2[0], y2[1], 1.0])
    Ry2 = R @ y2h  # rotate ray from cam2 into cam1 frame
    # System: [y1h | -Ry2] [d1; d2] = t
    A = np.column_stack([y1h, -Ry2])   # shape (3, 2)
    # Least-squares: d = (A^T A)^-1 A^T t
    d, _, _, _ = np.linalg.lstsq(A, t, rcond=None)
    d1, d2 = d[0], d[1]
    # Midpoint estimate for robustness
    P_from_cam1 = d1 * y1h
    P_from_cam2 = R.T @ (d2 * y2h - t)
    return (P_from_cam1 + P_from_cam2) * 0.5

# Quick test: t=[1,0,0], R=I, y1=[0,0,1], y2=[-0.2,0,1], P=[0,0,5]
R, t = np.eye(3), np.array([1.0, 0, 0])
P_est = triangulate_point([0.0,0.0], [-0.2,0.0], R, t)
print(P_est)  # → [0. 0. 5.]  ✓

Why small baseline kills depth accuracy

The angular uncertainty in a correspondence translates to a large depth uncertainty when the rays are nearly parallel (small baseline). If the baseline is b and the scene is at depth Z, the depth estimation error scales as Z² / b. A drone moving 1 cm between frames cannot accurately triangulate a building 50 m away — the rays are nearly parallel and any noise in the match pushes the intersection point wildly. This is the motivation for having large baseline in stereo systems.

Misconception: more feature matches improve depth accuracy. More matches average away noise in R and t estimates (good!), but they don't fix the fundamental geometry: shallow-angle intersections give poor depth even with perfect R and t. The geometry of the baseline and scene depth, not the number of matches, limits triangulation quality.

Triangulation noise — two rays that nearly meet

Top-down view. Two cameras (triangles) look at a 3D point (green). With zero noise the two rays (blue, purple dashed lines) intersect exactly at the true point. Add noise: the rays become skew lines and the midpoint estimate (orange) drifts. Small baseline makes this much worse — the rays are nearly parallel, so even tiny angular noise causes a huge shift in the intersection.

Match noise (σ px) 0.00

Baseline (m) 1.00

A drone takes two frames 1 mm apart (tiny baseline) and triangulates a building 30 m away. A second test takes frames 1 m apart (large baseline) to the same building. Assuming same pixel noise in both cases, which gives more accurate depth estimates?

1 mm baseline — more correspondences fit the epipolar constraint Both give identical accuracy — depth accuracy depends only on feature detection quality 1 m baseline — wider baseline means the two rays intersect at a much larger angle, so depth error (Z²⁄b) is 1000× smaller 1 mm baseline — the shorter the baseline, the less relative rotation, which simplifies triangulation

Chapter 8: Showcase — Full SfM Simulator

This simulation runs the complete two-view geometry pipeline: generate synthetic 3D points, project them into two cameras, add noise, run the 8-point algorithm, recover R and t, and triangulate. Use the sliders to explore how each parameter degrades the reconstruction.

Two-View SfM — full pipeline

Left panel: top-down view of the two cameras (triangles) and the true 3D points (teal dots) vs reconstructed points (orange dots). Right panel: reprojection error histogram. Adjust sliders to break the reconstruction.

Baseline (m) 1.00

Noise (px) 0.50

N points 20

Chapter 9: Connections & Cheat Sheet

The full two-view geometry cheat sheet

Concept	Formula	Notes
Epipolar constraint (calibrated)	ỹ₂^T E ỹ₁ = 0	ỹ = K⁻¹x̃
Essential matrix	E = [t]_×R	rank 2, SVs {σ,σ,0}, 5 DOF
Fundamental matrix	F = K₂^−TEK₁⁻¹	works on pixels, 7 DOF
8-point algorithm	Af = 0 → SVD → rank-2 project	Normalize first (Hartley)
Pose from E	SVD → 4 candidates	Cheirality picks the correct one
Triangulation	[ỹ₂ \| −Rỹ₁][d₂;d₁] = t	Least-squares for depths
Scale ambiguity	\|\|t\|\| = 1 by convention	Need external reference for metric scale

Degraded cases — when two-view geometry fails

Scenario	What breaks	Fix
Pure rotation (no translation)	E = [0]_×R = 0 — singular, E undefined; cannot triangulate depth	Use homography instead; add IMU for translation
Small baseline	Rays nearly parallel → depth error ∝ Z²⁄b	Larger baseline; multi-view SfM for averaging
Planar scene	E degenerate; all points on one plane → homography better	Use homography H instead of E; degenerate 8-pt
Outlier matches	Even one outlier row poisons the linear system	RANSAC (L8): random 8-subset, fit, count inliers

Connections to the VNAV curriculum

L5: Image Formation

Intrinsic matrix K · back-projection · "one pixel = a ray" — the foundation that makes ỹ = K⁻¹x̃ meaningful

↓

L6: Feature Detection & Tracking

Harris corners, SIFT, KLT — produces the matched pairs (x₁, x₂) that feed the 8-point algorithm

↓

L7: Two-View Geometry (this lesson)

Epipolar constraint · E and F · 8-point · cheirality · triangulation · scale ambiguity

↓

L8: RANSAC

Random sample consensus: robustly estimate E⁄F in the presence of outlier matches (wrong correspondences from L6)

↓

VIO & SLAM

Chain two-view pose estimates over hundreds of frames; add IMU for metric scale and loop closure

Connections to VNAV L1 & L2: The skew-symmetric matrix [t]_× is the hat operator from L1/L2's SE(3) treatment. The rotation R in E = [t]_×R is an SO(3) element. The SVD decomposition for recovering R ensures R stays in SO(3) (det = +1) — the same manifold structure explored with Lie groups in L2.

Scale ambiguity in the wild. Real SLAM systems (ORB-SLAM3, COLMAP) handle scale ambiguity by either (a) using a stereo rig with fixed known baseline — then you know ||t|| exactly; (b) fusing with IMU which gives metric acceleration; (c) detecting loop closures and enforcing global consistency. Without one of these, monocular SfM produces a "shape-up-to-scale" reconstruction: the shape is correct, but every depth could be multiplied by any constant.

How two-view geometry fits into a full SLAM pipeline

Visual Odometry (VO)

Chain pairwise two-view estimates across hundreds of frames. Each frame pair gives a relative pose (R_k, t_k). Compose them: T_0→N = T_N−1 · T_N−2 · … · T₁. Drift accumulates since each R and t has small errors. Bundle adjustment periodically refines all poses and 3D points jointly to minimize reprojection error across all views.

VIO (adding IMU)

IMU provides: (1) metric acceleration, which integrates to give metric translation — solving the scale ambiguity; (2) orientation estimates via gyroscope, tightening R recovery; (3) high-rate prediction (100–400 Hz) between camera frames (10–30 Hz). Tight coupling of IMU and camera in a single factor graph is the foundation of modern VIO systems (VINS-Mono, OKVIS, MSCKF).

The full SfM pipeline in five steps: (1) Detect features (L6). (2) Match features across views (ratio test, L6). (3) Estimate F or E with 8-point + RANSAC (L8). (4) Recover R, t via SVD + cheirality. (5) Triangulate 3D points. Then add more views (multi-view SfM) and refine with bundle adjustment. This pipeline is the core of COLMAP, ORB-SLAM, and every visual navigation system in production today.

A monocular drone estimates its trajectory using two-view geometry between consecutive frames, no IMU, no known objects. What can it recover?

The complete 6-DOF trajectory in metric units (meters and radians) The rotation R exactly, and the direction of translation t, but not the magnitude of t — metric scale is unrecoverable without an external reference Only the rotation R — translation cannot be recovered from monocular images Nothing without an IMU — two cameras cannot determine relative pose

Two-View Geometry: Epipolar & SfM

Chapter 0: The Two-Photo Problem

Why matched pixels alone aren't enough

Chapter 1: Epipolar Geometry

The epipolar line equation

Epipoles: where all epipolar lines meet

Counting the epipolar geometry DOF

Chapter 2: The Epipolar Constraint

The skew-symmetric matrix [t]×

Worked numeric check

Non-degenerate example: rotation + translation

Chapter 3: The Essential Matrix

Properties of E

Stereo camera example — parallel cameras

Calibrated vs uncalibrated cameras

Chapter 4: The Fundamental Matrix

The epipolar line equation

Numerical sanity check

Chapter 5: The 8-Point Algorithm

Building the constraint matrix

Worked rows

Rank-2 enforcement

Why normalization matters (Hartley's insight)

Chapter 6: Recovering R & t from E

SVD decomposition

The cheirality test

Chapter 7: Triangulation

Linear triangulation (DLT)

Worked numbers

Linear triangulation — full code

Why small baseline kills depth accuracy

Chapter 8: Showcase — Full SfM Simulator

Chapter 9: Connections & Cheat Sheet

The full two-view geometry cheat sheet

Degraded cases — when two-view geometry fails

Connections to the VNAV curriculum

How two-view geometry fits into a full SLAM pipeline

Visual Odometry (VO)

VIO (adding IMU)

The skew-symmetric matrix [t]_×