You have two photos of the same street corner, taken from slightly different spots. Hundreds of feature correspondences link pixels between the images. From those pairs alone — no GPS, no IMU — can you deduce how much the camera moved, and where each matched point sits in 3D space? Yes. But the geometry connecting the two views imposes a tight constraint first: every correct match must lie on a specific line in the second image. That constraint is epipolar geometry, and it is the bridge from pixels to poses. MIT 16.485 by Luca Carlone, Lecture 14.
A drone flies past a building. Its forward-facing camera snaps two frames 0.3 seconds apart. Your feature tracker (Lecture 12–13) hands you 200 matched pixel pairs: point (u1, v1) in frame 1 corresponds to (u2, v2) in frame 2. Those pairs all picture the same physical points — a window corner, a ledge, a drainpipe. Now what?
The matched pixels alone don't tell you anything about how the camera moved. A 1-pixel shift could mean the drone moved 1 cm or 100 m, depending on the distance to the building. And each pixel is just a ray shooting out from the camera — the feature could be anywhere along that ray. You need more structure.
The saving grace is that the two cameras are geometrically related. They are both looking at the same rigid 3D world. That rigidity imposes a constraint between every correct correspondence — a constraint that can be written as a single equation involving only the pixel coordinates, the camera rotation R, and the translation t between the two views. Epipolar geometry makes that constraint explicit.
From L5, a calibrated pixel (u, v) in camera 1 back-projects to a ray: every 3D point d · [x, y, 1]T (in normalized camera coordinates) maps to that pixel, for any depth d > 0. The same matching pixel in camera 2 back-projects to another ray through camera 2's origin. The two rays should meet at the 3D point — but you don't know R and t yet, so you can't intersect them.
The epipolar constraint is what you can assert before knowing R and t. It says: whatever R and t are, the correct match must satisfy a single algebraic equation. If you collect enough correct matches, that equation system is solvable for R and t.
Drag the 3D point (blue dot) or the second camera position (orange dot) to see how the projected image positions change. Notice that the match carries no depth information on its own — the point could be anywhere along either dashed ray.
Place two cameras in space. Pick any 3D point P visible to both. Draw lines from P to each camera's optical center O1 and O2. Those three points — P, O1, O2 — define a plane. This plane is called the epipolar plane.
The epipolar plane intersects each image plane in a line. These are the epipolar lines: in camera 1's image, the epipolar line corresponding to a point in camera 2; and vice versa. All epipolar lines in camera 1 pass through a single special point called the epipole e1 — the image of camera 2's optical center as seen by camera 1. Symmetrically, e2 is camera 1's center projected into camera 2.
Here is the key geometric fact: as P moves along a 3D ray through O1, the epipolar plane rotates around the baseline (the line through O1 and O2). The image of P in camera 1 stays fixed (it's always the same pixel on the ray), but the image in camera 2 traces out the epipolar line. So: given the pixel in camera 1, the epipolar geometry tells you exactly which line to search in camera 2.
This is not just elegant geometry — it has a direct computational consequence. Stereo matching algorithms exploit it to limit correlation search to single scan-lines. Dense stereo SLAM (like DepthAI or ZED camera systems) is powered by exactly this: rectify the image pair so epipolar lines become horizontal, then search each row independently. The reduction from 2D search (W × H comparisons per pixel) to 1D search (W comparisons per pixel) yields a roughly ×H speedup — typically 500× for a 500-pixel-tall image.
Given a point ỹ1 in camera 1's calibrated coordinates, the corresponding epipolar line in camera 2 is given by:
This is a 3-vector [a, b, c]T representing the line au + bv + c = 0 in calibrated coordinates. Any correct match ỹ2 must satisfy l2Tỹ2 = 0 — which is exactly the epipolar constraint ỹ2TEỹ1 = 0. The distance from a point ỹ2 to this line is |l2Tỹ2| / ||[a,b]|| and is called the Sampson distance — a first-order approximation to the reprojection error used in RANSAC inlier scoring.
The epipole e1 is the projection of O2 into camera 1. It may fall outside the image frame — that's fine, it's still a well-defined point (at infinity if the cameras are side-by-side with the same heading). All epipolar lines in camera 1 pass through e1. Symmetrically, all epipolar lines in camera 2 pass through e2.
Special case: parallel cameras (e.g., a stereo pair with identical orientations). Here O1 and O2 differ only in the horizontal direction. The epipoles are at infinity, and all epipolar lines are horizontal — the y-coordinate of a match is identical in both images. This is the classical stereo rectification setup.
The relative pose between two cameras has 6 DOF (3 rotation + 3 translation). But two-view geometry only determines translation direction, not magnitude — that's the scale ambiguity. So the epipolar geometry has 5 DOF: 3 for R, 2 for the unit direction of t. This is why the minimal solver for the essential matrix needs exactly 5 correspondences (the 5-point algorithm), and why the 8-point algorithm over-parameterizes by 3 DOF (it fits a general 3×3 matrix with 8 DOF before projecting to rank 2).
Left: camera 1's image. Right: camera 2's image. Drag the orange dot (3D point position projected onto left image). Watch how the epipolar line in camera 2 (right panel) updates — the match for the orange dot must lie on that teal line. The blue dots show the epipoles.
Let's derive the epipolar constraint from scratch. Set camera 1 at the world origin with no rotation — a convenient choice that doesn't lose generality. Camera 2 has position t and rotation R relative to camera 1. A 3D point P is at depth d1 in camera 1's frame, so:
where ỹ1 = K−1x̃1 and ỹ2 = K−1x̃2 are the calibrated (normalized) coordinates — pixel coordinates pre-multiplied by K−1 to remove the camera's intrinsics. From the left equation, Pcam1 = d1ỹ1. Substituting into the right:
Now cross-multiply both sides by t (i.e., apply [t]×, the skew-symmetric matrix of t). Since [t]×t = 0 (any vector crossed with itself is zero):
Dot both sides with ỹ2T. The left side becomes d2 ỹ2T[t]×ỹ2 = 0, because [t]×ỹ2 is orthogonal to ỹ2 (cross product gives a perpendicular vector). So:
This is the epipolar constraint. It must hold for every correct correspondence (ỹ1, ỹ2). The quantity in the middle — [t]×R — is the Essential matrix E (Chapter 3).
If t = [tx, ty, tz]T, the skew-symmetric (hat) matrix is:
This is the hat operator from VNAV L1/L2 — the same tool used to convert the angular velocity vector to the cross-product matrix. Here it converts t into a matrix so the cross product t × v can be written as [t]×v.
Let t = [1, 0, 0]T (camera 2 is 1 unit to the right), R = I (no rotation). A 3D point at [0, 0, 5] in camera 1's frame. Normalized coords: ỹ1 = [0, 0, 1]T; ỹ2 = [−1/5, 0, 1]T (camera 2 sees the point shifted left by 1/5 = baseline/depth). Check:
epipolar constraint check # t = [1,0,0], [t]× = [[0,0,0],[0,0,-1],[0,1,0]] (skew of [1,0,0]) # Wait: [1,0,0] skew: [ 0, -tz, ty ; tz, 0, -tx ; -ty, tx, 0 ] # = [ 0, 0, 0 ; 0, 0, -1 ; 0, 1, 0 ] E = [t]× R = [t]× (since R=I) # y2^T E y1 = [-1/5, 0, 1] * [[0,0,0],[0,0,-1],[0,1,0]] * [0,0,1] # E y1 = [0,0,0;0,0,-1;0,1,0]*[0,0,1] = [0,-1,0] # y2^T * [0,-1,0] = (-1/5)(0) + (0)(-1) + (1)(0) = 0 ✓ constraint holds
Now test a wrong match: ỹ2,wrong = [0, 0, 1]T (pointing straight forward in camera 2, not the correct leftward-shifted point). Then ỹ2,wrongT E ỹ1 = [0,0,1]·[0,−1,0] = 0 (by coincidence in this degenerate case — but for a general rotation and t, wrong matches give nonzero values). In practice you test |ỹ2TEỹ1| < threshold.
Let t = [0.5, 0, 0.866]T (camera 2 is 1 unit forward-right at 60°), R = rotation of −30° around y-axis. A point P = [1, 0, 5] in camera 1's frame. First, ỹ1 = [1/5, 0, 1]T = [0.2, 0, 1]. In camera 2's frame, P2 = RP − Rt (relative pose), approximately [0.4, 0, 4.7] after the rotation, so ỹ2 ≈ [0.085, 0, 1]. The epipolar constraint check:
non-degenerate constraint verification import numpy as np # Camera 2 rotated -30deg around Y from camera 1 theta = np.radians(-30) R = np.array([[np.cos(theta), 0, np.sin(theta)], [0, 1, 0], [-np.sin(theta), 0, np.cos(theta)]]) t = np.array([0.5, 0, 0.866]) # ||t|| ≈ 1 t_hat = np.array([[0, -t[2], t[1]], [t[2], 0, -t[0]], [-t[1], t[0], 0]]) E = t_hat @ R P = np.array([1.0, 0.0, 5.0]) # 3D point in cam1 frame y1 = P / P[2] # [0.2, 0, 1] P2 = R @ P + t # point in cam2 frame y2 = P2 / P2[2] # ≈ [0.085, 0, 1] constraint = y2 @ E @ y1 # should be ≈ 0 print(f"Constraint: {constraint:.2e}") # → ~0.0 ✓ # Wrong match — shift y2 by (0.3, 0, 0) y2_wrong = y2 + np.array([0.3, 0, 0]) constraint_wrong = y2_wrong @ E @ y1 print(f"Wrong match: {constraint_wrong:.4f}") # → nonzero ≈ -0.28
The Essential matrix is the 3×3 matrix at the heart of the epipolar constraint for calibrated cameras:
It encodes the full relative pose between two calibrated cameras — six DOF in principle (3 for R, 3 for t), but reduced to 5 DOF because E is scale-invariant (multiplying t by any scalar leaves E unchanged up to scale, since the epipolar constraint ỹ2TEỹ1=0 is homogeneous). This is the scale ambiguity: from two images alone you can recover the direction of t but not its magnitude.
An essential matrix has singular values {σ, σ, 0}. The two equal nonzero singular values reflect the rotational degrees of freedom; the zero singular value reflects the fact that E is rank-2 (the epipolar constraint collapses the 3D cone of rays to a line in the other image). After running the 8-point algorithm (Ch 5), we project the estimated matrix onto the essential space by enforcing this structure via SVD:
where UΣVT is the SVD of the unconstrained estimate. This "closest essential matrix" projection is the standard post-processing step.
The simplest case: t = [b, 0, 0]T (horizontal baseline b), R = I. The skew matrix [t]× has only two nonzero entries: −b in position (1,2) and b in position (2,1). Then E = [t]×I = [t]×, so:
Working this out with ỹ1 = [u1, v1, 1] and ỹ2 = [u2, v2, 1]: the result is b(v2 − v1) = 0, so v2 = v1. For a horizontal baseline and no rotation, the match must have the same vertical coordinate in both images — the epipolar lines are horizontal scan-lines. This is exactly the stereo rectification property.
The derivation above used normalized coordinates ỹ = K−1x̃ — pixel coordinates mapped to unit focal-length space. This assumes K is known. When cameras are calibrated (K known), we work with E. When K is unknown, we must work with raw pixel coordinates and the Fundamental matrix F = K−TEK−1 instead (Chapter 4).
python import numpy as np def hat(t): """Skew-symmetric (hat) matrix of a 3-vector.""" return np.array([[0, -t[2], t[1]], [t[2], 0, -t[0]], [-t[1], t[0], 0]]) def essential_matrix(R, t): """E = [t]× R (calibrated cameras, unit ||t||).""" t_hat = hat(t / np.linalg.norm(t)) # normalize t return t_hat @ R # Example: camera 2 is 1 unit right, no rotation R = np.eye(3) t = np.array([1.0, 0.0, 0.0]) E = essential_matrix(R, t) # E = [[0, 0, 0],[0, 0,-1],[0, 1, 0]] # Singular values of E should be {1, 1, 0} U, s, Vt = np.linalg.svd(E) print(s) # → [1. 1. 0.] ✓
What if you don't know K? Many SfM pipelines encounter images from unknown cameras — internet photos, historical footage, smartphone images with EXIF stripped. The Fundamental matrix F generalizes the epipolar constraint to raw pixel coordinates without requiring calibration:
where x̃1, x̃2 are pixel coordinates (not normalized). The relationship between F and E is:
F absorbs K into the constraint — it works directly with raw pixel coordinates. Like E, F is rank-2 and determined only up to scale, so it has 7 degrees of freedom (9 entries minus 1 for scale minus 1 for the rank constraint). In practice you estimate F with the 8-point algorithm on pixel coordinates, then if you know K you can recover E = K2T F K1.
Given a point x1 in image 1, the corresponding epipolar line in image 2 is:
This is a 3-vector [a, b, c]T representing the line au + bv + c = 0 in pixel coordinates. The point x2 must lie on this line: l2Tx̃2 = 0, which is exactly the constraint x̃2TFx̃1 = 0. Symmetrically, l1 = FTx̃2 is the epipolar line in image 1 for point x2.
Both panels show the same two-view scenario. Left: baseline = 1 m, scene depth = 10 m. Right: baseline = 1 cm, scene depth = 10 cm. The slider multiplies both the baseline and all depths by the same factor. Watch: the projected images (shown as colored dots on image planes) are identical regardless of scale. This is why you cannot recover absolute scale from two images alone.
With K = diag(500, 500, 1) (focal length 500 px, principal point at origin) and the E computed above (E = [t]×I for t=[1,0,0]):
So F = (1/250000) · [[0,0,0],[0,0,−1],[0,1,0]]. A pixel at x1 = [250, 250, 1]T (center-left): l2 = Fx1 = [0, −1/250000, 1/250000]T, or equivalently the line −v + 1 = 0, i.e., v = 1 pixel row. The horizontal epipolar line makes sense for a horizontal baseline.
You have N ≥ 8 matched pixel pairs. You want to estimate F (or E for calibrated cameras). The 8-point algorithm (Hartley 1997) turns the epipolar constraint into a linear system.
Each correspondence (x1 = [u1, v1, 1]T, x2 = [u2, v2, 1]T) contributes one row to a constraint matrix A. Expanding x2TFx1 = 0 and stacking the 9 entries of F as a vector f = vec(F):
Stack N such rows into A ∈ ℝN×9 and solve Af = 0. The unique (up to scale) solution is the right singular vector corresponding to the smallest singular value of A. With exactly 8 points and exact correspondences, A has rank 8 and the null space is 1D.
Match 1: x1 = [100, 200, 1], x2 = [95, 198, 1]
a1T = [100×95, 200×95, 95, 100×198, 200×198, 198, 100, 200, 1]
= [9500, 19000, 95, 19800, 39600, 198, 100, 200, 1]
Match 2: x1 = [300, 150, 1], x2 = [294, 151, 1]
a2T = [294×300, 294×150, 294, 151×300, 151×150, 151, 300, 150, 1]
= [88200, 44100, 294, 45300, 22650, 151, 300, 150, 1]
With 8 such rows assembled, SVD gives the least-squares solution for f (and thus F).
The SVD solution gives a generic 3×3 matrix — it won't have rank 2. We enforce rank 2 by zeroing the smallest singular value:
This is the nearest rank-2 matrix (in Frobenius norm) to the unconstrained estimate. The same step, with the averaging modification shown in Ch 3, projects onto the essential space.
Raw pixel coordinates span [0, W] × [0, H] — numbers in the hundreds. The constraint matrix A has entries like u2u1 ≈ 104 in one column and 1 in another. This poor conditioning makes the SVD numerically unstable. Hartley (1997) showed you should normalize each image's points to have zero mean and unit mean distance from origin before running the algorithm, then denormalize at the end. This simple step reduces errors by orders of magnitude.
python import numpy as np def normalize_pts(pts): """Hartley normalization: zero mean, unit mean distance.""" mu = pts.mean(axis=0) d = np.linalg.norm(pts - mu, axis=1).mean() s = np.sqrt(2) / (d + 1e-8) T = np.array([[s, 0, -s*mu[0]], [0, s, -s*mu[1]], [0, 0, 1]]) pts_h = np.hstack([pts, np.ones((len(pts),1))]) return (T @ pts_h.T).T, T def eight_point(pts1, pts2): """8-point algorithm for Fundamental matrix.""" p1n, T1 = normalize_pts(pts1) p2n, T2 = normalize_pts(pts2) N = len(pts1) A = np.zeros((N, 9)) for k in range(N): u1, v1 = p1n[k, 0], p1n[k, 1] u2, v2 = p2n[k, 0], p2n[k, 1] A[k] = [u2*u1, u2*v1, u2, v2*u1, v2*v1, v2, u1, v1, 1] # Least-squares: right singular vector of A _, _, Vt = np.linalg.svd(A) F = Vt[-1].reshape(3, 3) # Enforce rank 2 U, s, Vt2 = np.linalg.svd(F) s[2] = 0 F = U @ np.diag(s) @ Vt2 # Denormalize F = T2.T @ F @ T1 return F / F[2, 2]
You have a valid Essential matrix E (either computed directly from calibrated coordinates, or recovered as E = KTFK1 from F). Now recover R and t. The result from Ma et al. (2004, Thm 5.7) gives exactly two rotation-translation pairs from E — and another two from −E, giving four candidates total.
Compute the SVD: E = UΣVT. Define the 90° rotation matrix around the z-axis:
The two pose candidates from E are:
Since −E is also a valid essential matrix (the constraint ỹ2TEỹ1 = 0 holds for −E too), there are two more candidates from −E. Total: four (R, t) pairs.
Three of the four candidates place the reconstructed 3D points behind one or both cameras (negative depth). Only one candidate has all points in front of both cameras. This is the cheirality constraint (from the Greek for "hand" — it distinguishes left from right, front from back). To apply it: triangulate a test point using each candidate (R, t), and check the signs of d1 and d2. The candidate with d1 > 0 and d2 > 0 for all (or most) test points is the correct one.
python import numpy as np def recover_pose(E, pts1_cal, pts2_cal): """Recover R, t from E using cheirality test. pts1_cal, pts2_cal: calibrated 2D points (K^-1 * pixel).""" U, _, Vt = np.linalg.svd(E) if np.linalg.det(U) < 0: U *= -1 if np.linalg.det(Vt) < 0: Vt *= -1 Rz90p = np.array([[0,-1,0],[1,0,0],[0,0,1]]) Rz90m = np.array([[0, 1,0],[-1,0,0],[0,0,1]]) candidates = [ (U @ Rz90p @ Vt, U[:, 2]), (U @ Rz90p @ Vt, -U[:, 2]), (U @ Rz90m @ Vt, U[:, 2]), (U @ Rz90m @ Vt, -U[:, 2]), ] best_R, best_t, best_count = None, None, -1 for R, t in candidates: count = 0 for p1, p2 in zip(pts1_cal, pts2_cal): X = triangulate_point(p1, p2, R, t) d1 = X[2] d2 = (R @ X + t)[2] if d1 > 0 and d2 > 0: count += 1 if count > best_count: best_count, best_R, best_t = count, R, t return best_R, best_t
Each button shows one of the four decomposition candidates. Watch where the reconstructed point (orange dot) lands relative to both cameras (triangles). Only the correct candidate puts the point in front of both cameras.
You have R and t. You have calibrated correspondences (ỹ1, ỹ2). To find the 3D position of the matched point, you triangulate: intersect the two back-projected rays.
Ray from camera 1: all points d1ỹ1 for d1 > 0. Ray from camera 2 expressed in camera 1's frame: all points Rỹ2·d2 + t for d2 > 0. In the absence of noise these rays exactly intersect. With noise they are skew lines. We find the 3D point that minimizes reprojection error.
Rewrite d2ỹ2 = Rỹ1d1 + t as two equations in (d1, d2):
This is Ax = b with A ∈ ℝ3×2, x = [d2, d1]T, b = t. Solve via least-squares: x = (ATA)−1ATb. The 3D point is then P = d1ỹ1.
Let R = I, t = [1, 0, 0]T (camera 2 is 1 m to the right). A 3D point P = [0, 0, 5]. Camera 1 sees ỹ1 = [0, 0, 1]T (point straight ahead at depth 5). Camera 2 sees ỹ2 = [−0.2, 0, 1]T (point shifted left by 1/5).
triangulation worked example # A = [y2 | -R*y1] where y2=[-.2,0,1], R*y1=[0,0,1] # A = [[-0.2, 0], (row 0: y2[0], -y1[0]) # [ 0.0, 0], (row 1: y2[1], -y1[1]) # [ 1.0, -1]] (row 2: y2[2], -y1[2]) # b = t = [1, 0, 0] # A^T A = [[-0.2,0,1]·[-0.2,0,1], [-0.2,0,1]·[0,0,-1]] = [[1.04, -1],[-1, 1]] # A^T b = [-0.2*1 + 0*0 + 1*0, 0*1 + 0*0 + (-1)*0] = [-0.2, 0] # Solve: (A^T A) x = A^T b → d2≈1.0, d1≈5.0 ✓ # P = d1 * y1 = 5 * [0,0,1] = [0,0,5] ✓
python def triangulate_point(y1, y2, R, t): """Linear triangulation. y1, y2: calibrated 2D points (K^-1 * pixel, shape (2,)) R: rotation, t: translation (camera 2 relative to camera 1). Returns 3D point in camera 1 frame.""" y1h = np.array([y1[0], y1[1], 1.0]) y2h = np.array([y2[0], y2[1], 1.0]) Ry2 = R @ y2h # rotate ray from cam2 into cam1 frame # System: [y1h | -Ry2] [d1; d2] = t A = np.column_stack([y1h, -Ry2]) # shape (3, 2) # Least-squares: d = (A^T A)^-1 A^T t d, _, _, _ = np.linalg.lstsq(A, t, rcond=None) d1, d2 = d[0], d[1] # Midpoint estimate for robustness P_from_cam1 = d1 * y1h P_from_cam2 = R.T @ (d2 * y2h - t) return (P_from_cam1 + P_from_cam2) * 0.5 # Quick test: t=[1,0,0], R=I, y1=[0,0,1], y2=[-0.2,0,1], P=[0,0,5] R, t = np.eye(3), np.array([1.0, 0, 0]) P_est = triangulate_point([0.0,0.0], [-0.2,0.0], R, t) print(P_est) # → [0. 0. 5.] ✓
The angular uncertainty in a correspondence translates to a large depth uncertainty when the rays are nearly parallel (small baseline). If the baseline is b and the scene is at depth Z, the depth estimation error scales as Z2 / b. A drone moving 1 cm between frames cannot accurately triangulate a building 50 m away — the rays are nearly parallel and any noise in the match pushes the intersection point wildly. This is the motivation for having large baseline in stereo systems.
Top-down view. Two cameras (triangles) look at a 3D point (green). With zero noise the two rays (blue, purple dashed lines) intersect exactly at the true point. Add noise: the rays become skew lines and the midpoint estimate (orange) drifts. Small baseline makes this much worse — the rays are nearly parallel, so even tiny angular noise causes a huge shift in the intersection.
This simulation runs the complete two-view geometry pipeline: generate synthetic 3D points, project them into two cameras, add noise, run the 8-point algorithm, recover R and t, and triangulate. Use the sliders to explore how each parameter degrades the reconstruction.
Left panel: top-down view of the two cameras (triangles) and the true 3D points (teal dots) vs reconstructed points (orange dots). Right panel: reprojection error histogram. Adjust sliders to break the reconstruction.
| Concept | Formula | Notes |
|---|---|---|
| Epipolar constraint (calibrated) | ỹ2T E ỹ1 = 0 | ỹ = K−1x̃ |
| Essential matrix | E = [t]×R | rank 2, SVs {σ,σ,0}, 5 DOF |
| Fundamental matrix | F = K2−TEK1−1 | works on pixels, 7 DOF |
| 8-point algorithm | Af = 0 → SVD → rank-2 project | Normalize first (Hartley) |
| Pose from E | SVD → 4 candidates | Cheirality picks the correct one |
| Triangulation | [ỹ2 | −Rỹ1][d2;d1] = t | Least-squares for depths |
| Scale ambiguity | ||t|| = 1 by convention | Need external reference for metric scale |
| Scenario | What breaks | Fix |
|---|---|---|
| Pure rotation (no translation) | E = [0]×R = 0 — singular, E undefined; cannot triangulate depth | Use homography instead; add IMU for translation |
| Small baseline | Rays nearly parallel → depth error ∝ Z2⁄b | Larger baseline; multi-view SfM for averaging |
| Planar scene | E degenerate; all points on one plane → homography better | Use homography H instead of E; degenerate 8-pt |
| Outlier matches | Even one outlier row poisons the linear system | RANSAC (L8): random 8-subset, fit, count inliers |
Chain pairwise two-view estimates across hundreds of frames. Each frame pair gives a relative pose (Rk, tk). Compose them: T0→N = TN−1 · TN−2 · … · T1. Drift accumulates since each R and t has small errors. Bundle adjustment periodically refines all poses and 3D points jointly to minimize reprojection error across all views.
IMU provides: (1) metric acceleration, which integrates to give metric translation — solving the scale ambiguity; (2) orientation estimates via gyroscope, tightening R recovery; (3) high-rate prediction (100–400 Hz) between camera frames (10–30 Hz). Tight coupling of IMU and camera in a single factor graph is the foundation of modern VIO systems (VINS-Mono, OKVIS, MSCKF).