Every pixel your autonomous vehicle sees is the shadow of a 3D ray collapsed to a single point. Navigation from cameras requires understanding this projection exactly — so we can invert it, reconstruct depth, and triangulate structure. This lesson derives the pinhole model from similar triangles, builds the full projection pipeline P = K[R|t], shows what each number in the intrinsic matrix K means physically, and confronts the depth-is-lost problem head-on. Hand-derived numbers, five interactive canvases, worked projection/back-projection/distortion examples. MIT 16.485 by Luca Carlone, Lecture 11.
Your robot drives toward a stop sign. Its front camera sees a red octagon filling 12% of the image. Is the sign 3 meters away, or 30? Both distances produce a sign at exactly the same pixel coordinates — only the apparent size changes, and size is ambiguous without knowing the real size of the sign.
This is not a sensor limitation that better hardware can fix. It is a mathematical fact: a camera collapses 3D space onto a 2D image, and that collapse is irreversible from a single view. One pixel corresponds to an entire ray of 3D points, all projecting to the same location.
To navigate, a robot must estimate ego-motion (how it moved between frames) and reconstruct the 3D structure of the environment. Both tasks require undoing the camera's projection — inverting the map from 3D to 2D. You cannot invert what you don't understand. This lesson derives the forward model so carefully that every step of the inverse becomes obvious.
The camera center is at the left. The image plane is the vertical line. Two 3D points sit on the same ray — different depths but identical pixel coordinates. Drag the depth slider to move the near point. Watch: the pixel never moves.
The pinhole camera model imagines a box with a single tiny hole at one face. Light from the world passes through the hole and lands on the opposite face — the image plane. No lens, no blur, just straight rays.
Place the camera center (the pinhole, also called the optical center) at the origin of the camera frame. The optical axis is the z-axis, pointing into the scene. The image plane is at distance f (the focal length) along the z-axis.
A 3D point in the camera frame has coordinates pc = (pxc, pyc, pzc). The ray from the origin through pc intersects the image plane at some 2D location (um, vm), where the subscript m means "in meters" (we convert to pixels later).
Look at the xz-plane only. The point is at (pxc, pzc). The image plane is at z = f. The ray from origin through the point hits z = f at:
This is pure similar-triangles geometry: the small triangle (origin to image plane, height um, base f) is similar to the large triangle (origin to point, height pxc, base pzc). By symmetry, the same holds for y:
A point is at pc = (0.3 m, −0.1 m, 2.0 m). Focal length f = 0.05 m (5 cm — a typical wide-angle lens).
The image is 7.5 mm right of center and 2.5 mm above center (y-axis flipped in image coords). Now look at pc = (0.6 m, −0.2 m, 4.0 m) — double the distance, double all coords. Same pixel! Scale ambiguity confirmed.
3D point (warm dot) in the scene projects through the pinhole onto the image plane. Drag sliders to change X, Z, or focal length. Watch how the projected pixel (teal) moves.
The projection formulas um = f·X/Z and vm = f·Y/Z involve a division by Z, making them nonlinear. Nonlinear equations are hard to chain together, hard to invert, and hard to reason about algebraically. Homogeneous coordinates solve this by embedding a 3D point into 4D space, transforming the nonlinear projection into a linear matrix multiplication — at the cost of one extra dimension.
A 3D point p = (X, Y, Z) becomes the homogeneous vector p̃ = (X, Y, Z, 1)T. A 2D image point (u, v) becomes (u, v, 1)T, or more generally (u·w, v·w, w)T for any w ≠ 0 — all such triples represent the same image point.
Watch the division-by-Z become the third row in a matrix product:
The right side gives (fX, fY, Z)T. Dividing all entries by Z (the third component) recovers (fX/Z, fY/Z, 1)T = (um, vm, 1)T. This "divide by the third entry to get physical coordinates" is called dehomogenization.
More compactly, using the 3×4 canonical projection matrix Π0:
Point pc = (0.3, −0.1, 2.0, 1)T in homogeneous form. With f = 0.05:
python import numpy as np f = 0.05 K_metric = np.array([[f, 0, 0], [0, f, 0], [0, 0, 1]]) Pi0 = np.array([[1,0,0,0], [0,1,0,0], [0,0,1,0]]) p_hom = np.array([0.3, -0.1, 2.0, 1.0]) lam_x = K_metric @ Pi0 @ p_hom # = [0.015, -0.005, 2.0] u_m = lam_x[0] / lam_x[2] # = 0.0075 m v_m = lam_x[1] / lam_x[2] # = -0.0025 m print(f"u_m = {u_m:.4f} m, v_m = {v_m:.4f} m") # u_m = 0.0075 m, v_m = -0.0025 m (same as hand calc)
We have projection in meters (um, vm). Sensors don't return meters — they return pixels. Converting involves two steps: scaling by pixel density and shifting the origin to the image corner.
A digital sensor has sx horizontal pixels per meter and sy vertical pixels per meter. Multiplying the metric coordinates by these densities converts to pixels:
where (ox, oy) is the principal point — the pixel where the optical axis pierces the image plane (typically near the center of the image). The shift is needed because the metric frame has its origin at the image center, but pixels are counted from the top-left corner.
Substituting um = f·X/Z and the pixel conversion together, and writing it as a single matrix acting on camera-frame coordinates:
where:
| Entry | Value | Physical meaning |
|---|---|---|
| fx = sx·f | e.g. 800 px | Focal length in horizontal pixels |
| fy = sy·f | e.g. 800 px | Focal length in vertical pixels |
| cx = ox | e.g. 320 px | Principal point x (column of optical axis) |
| cy = oy | e.g. 240 px | Principal point y (row of optical axis) |
| s | ≈ 0 | Pixel skew (non-rectangular pixels; essentially zero on modern cameras) |
The full expression, step by step:
Camera: fx=800, fy=800, cx=320, cy=240, s=0. Point in camera frame: (0.3, −0.1, 2.0).
A 3×3 grid of scene points is projected using your K. Adjust fx, fy, cx, cy and watch: bigger f = narrower field of view (zoom); moving cx/cy shifts the principal point marker (teal cross) and the whole projected grid.
The intrinsic matrix K handles the camera's internal geometry. But our 3D points often live in a world frame w, while K expects coordinates in the camera frame c. We need a transform that moves points from world to camera.
From VNAV L1, the pose of camera frame c relative to world frame w is the SE(3) matrix Twc. To transform a world point pw into camera coordinates pc, we need the inverse — how to express a world point in camera coordinates. This is Tcw = Twc−1:
where Rcw is the rotation from world to camera (a 3×3 orthogonal matrix) and tcw is the camera center position expressed in the camera frame (roughly, where the world origin is, as seen by the camera).
In homogeneous coordinates, this transform becomes a 3×4 matrix:
It maps a homogeneous world point p̃w = (Xw, Yw, Zw, 1)T to a (non-homogeneous) camera-frame vector (Xc, Yc, Zc)T.
tcw = −Rcw C, where C is the camera center in world coordinates. It is not the camera center position — it is the world origin expressed in the camera frame. This is a common confusion. To find where the camera is in the world: C = −RcwT tcw.
A cube of 3D world points (warm) is projected into the image. Adjust camera position (xcam, zcam) and yaw angle to change the extrinsics. Watch how the projected image (teal dots) shifts and distorts.
We now have all the pieces: extrinsics [R|t] to go from world to camera, and K to go from camera 3D to image pixels. Chaining them gives the projection matrix P — a single 3×4 matrix that projects any world point directly to a pixel.
Where p̃w = (Xw, Yw, Zw, 1)T is the world point in homogeneous coordinates and Zc is the depth in the camera frame (required for the final dehomogenization step).
Compute q = P p̃w ∈ R3. Then:
Camera at world position Cw = (1.0, 0, 0) m, looking along the z-axis (identity rotation). World point at pw = (1.3, −0.1, 3.0). K: fx=800, fy=800, cx=320, cy=240.
python import numpy as np # Camera at C_w = [1.0, 0, 0], looking along world-z (identity rotation) R_cw = np.eye(3) t_cw = -R_cw @ np.array([1.0, 0.0, 0.0]) # = [-1, 0, 0] K = np.array([[800, 0, 320], [0, 800, 240], [0, 0, 1]]) Ext = np.column_stack([R_cw, t_cw]) # 3x4 P = K @ Ext # 3x4 projection matrix p_world = np.array([1.3, -0.1, 3.0, 1.0]) q = P @ p_world # = [P*p], 3-vector u, v = q[0]/q[2], q[1]/q[2] print(f"pixel: ({u:.1f}, {v:.1f})") # Step by step: # p_c = R_cw @ p_world[:3] + t_cw = [1.3-1, -0.1, 3.0] = [0.3, -0.1, 3.0] # K @ p_c = [800*0.3/3+320, 800*(-0.1)/3+240] = [400, 213.3] # pixel: (400.0, 213.3)
Given a pixel (u, v), can we find the corresponding 3D point? Not uniquely — we can only find the ray along which the 3D point lies. This is the inverse of projection, and it is the foundation of triangulation, depth estimation, and structure-from-motion.
Starting from u = (fx X/Z) + cx, solve for X/Z:
This gives us the normalized image coordinates (xn, yn) = ((u−cx)/fx, (v−cy)/fy). The ray in camera coordinates is proportional to:
Any 3D point on the ray satisfies pc = λ dc for some scalar λ > 0. The depth λ is unknown from a single image.
K: fx=800, fy=800, cx=320, cy=240.
Ray direction (in camera frame): d = (0.15, −0.05, 1)T. Normalize: |d| = √(0.0225 + 0.0025 + 1) ≈ 1.0124. Unit ray: (0.148, −0.049, 0.988).
If we know depth Zc=2m: pc = 2 × (0.15, −0.05, 1) = (0.3, −0.1, 2.0) — which matches the original 3D point from Chapter 1. The pipeline is consistent.
python import numpy as np K = np.array([[800,0,320],[0,800,240],[0,0,1]]) K_inv = np.linalg.inv(K) pixel = np.array([440, 200, 1]) # homogeneous pixel ray = K_inv @ pixel # = [0.15, -0.05, 1.0] ray_unit = ray / np.linalg.norm(ray) print("Ray direction:", ray) # [0.15, -0.05, 1.0] # If depth Z_c = 2m is known: depth = 2.0 p_cam = depth * ray # = [0.3, -0.1, 2.0] ✓
The pinhole model assumes perfectly straight rays. Real lenses bend light. Wide-angle lenses bend it a lot. The result is lens distortion: straight lines in the world appear curved in the image.
The most common distortion is radial: the further a point is from the image center, the more it shifts outward (barrel distortion) or inward (pincushion distortion). The model in the camera frame (before adding the principal point):
where r2 = (udistort)2 + (vdistort)2 is the squared distance from the image center. The coefficients a1, a2 are called distortion coefficients:
| Sign | Type | Effect |
|---|---|---|
| a1 < 0 | Barrel distortion | Straight lines bow outward; common in wide-angle/fish-eye |
| a1 > 0 | Pincushion distortion | Straight lines bow inward; common in telephoto |
Using the image frame (with principal point), distortion correction becomes:
Distorted pixel at (ud=350, vd=150), cx=320, cy=240, a1=−0.2, a2=0.05.
Wait — that blows up! This is why distortion coefficients are always very small (|a1| < 0.5, |a2| < 0.1), and r is measured in normalized coordinates (not pixels). Let's redo with normalized coords (after dividing by f):
A gentle inward pull of 0.28% — typical barrel correction. The corrected normalized coords: (0.0375 × 0.99719, −0.1125 × 0.99719) ≈ (0.0374, −0.1122).
python import numpy as np def undistort_pixel(u_d, v_d, K, a1, a2): """Apply radial undistortion to a single pixel.""" fx, fy = K[0,0], K[1,1] cx, cy = K[0,2], K[1,2] # Normalize to camera frame (unitless) xn = (u_d - cx) / fx yn = (v_d - cy) / fy r2 = xn**2 + yn**2 factor = 1 + a1*r2 + a2*r2**2 # Undistorted normalized coords xn_u = xn * factor yn_u = yn * factor # Back to pixels u = xn_u * fx + cx v = yn_u * fy + cy return u, v K = np.array([[800,0,320],[0,800,240],[0,0,1]]) u, v = undistort_pixel(350, 150, K, a1=-0.2, a2=0.05) print(f"Undistorted: ({u:.2f}, {v:.2f})") # (349.92, 150.25)
A grid of undistorted points is shown (warm). Apply the distortion model: the teal dots show where those points appear in the raw (distorted) image. Drag a1 negative for barrel, positive for pincushion. The outer grid points distort most.
Now everything comes together. A 3D scene (wireframe cube) is projected through the full pipeline P = K[R|t], with optional radial distortion applied on top. You control K (focal length, principal point), camera pose (translation, yaw), and distortion coefficients. This is the complete model you need to understand to do any visual navigation.
Left half: top-down 3D view showing the cube (warm), camera (teal triangle), and projected rays. Right half: the resulting image with projected points and connecting lines. All parameters update live.
You now hold the complete forward model for how cameras see the world. Here is the distilled reference you'll reach for constantly as you go deeper into visual navigation.
| Quantity | Formula | Key point |
|---|---|---|
| Perspective projection (m) | um = f·X/Z, vm = f·Y/Z | Similar triangles; divide by depth |
| Intrinsic matrix K | [[fx, s, cx],[0, fy, cy],[0,0,1]] | fx=sxf in pixels; s≈0 |
| Extrinsic matrix | [Rcw | tcw] | World → camera; Cworld = −RTt |
| Full projection | P = K[R|t]; Z·[u,v,1]T=P·p̃w | Dehomogenize: u=q[0]/q[2] |
| Back-projection | d = K−1[u,v,1]T | One pixel = one ray, depth unknown |
| Scale ambiguity | p & λp same pixel for any λ>0 | Need stereo/motion/LiDAR for depth |
| Radial distortion | factor = 1+a1r2+a2r4 | r in normalized coords; undistort before K |
"The camera does not capture reality. It records projections of reality. Understanding the projection is the first step toward recovering reality."
— Luca Carlone, MIT 16.485