Visual Navigation for Autonomous Vehicles · MIT 16.485 · Lecture 11

Image Formation: The Pinhole Camera Model

Every pixel your autonomous vehicle sees is the shadow of a 3D ray collapsed to a single point. Navigation from cameras requires understanding this projection exactly — so we can invert it, reconstruct depth, and triangulate structure. This lesson derives the pinhole model from similar triangles, builds the full projection pipeline P = K[R|t], shows what each number in the intrinsic matrix K means physically, and confronts the depth-is-lost problem head-on. Hand-derived numbers, five interactive canvases, worked projection/back-projection/distortion examples. MIT 16.485 by Luca Carlone, Lecture 11.

Prerequisites: VNAV L1 (3D frames & SE(3)). Basic similar triangles. No linear algebra beyond matrix multiplication.
10
Chapters
5
Live Canvases
Derived
From First Principles

Chapter 0: The Depth-Loss Problem

Your robot drives toward a stop sign. Its front camera sees a red octagon filling 12% of the image. Is the sign 3 meters away, or 30? Both distances produce a sign at exactly the same pixel coordinates — only the apparent size changes, and size is ambiguous without knowing the real size of the sign.

This is not a sensor limitation that better hardware can fix. It is a mathematical fact: a camera collapses 3D space onto a 2D image, and that collapse is irreversible from a single view. One pixel corresponds to an entire ray of 3D points, all projecting to the same location.

One pixel = one ray. When you look at a pixel, you know the direction of the corresponding 3D point (the ray through the camera center), but not the distance along that ray. Depth is the lost dimension. Recovering it requires either multiple views (stereo, SfM) or additional sensors (LiDAR, radar).

Why does this matter for autonomous navigation?

To navigate, a robot must estimate ego-motion (how it moved between frames) and reconstruct the 3D structure of the environment. Both tasks require undoing the camera's projection — inverting the map from 3D to 2D. You cannot invert what you don't understand. This lesson derives the forward model so carefully that every step of the inverse becomes obvious.

Depth ambiguity: two points, one pixel

The camera center is at the left. The image plane is the vertical line. Two 3D points sit on the same ray — different depths but identical pixel coordinates. Drag the depth slider to move the near point. Watch: the pixel never moves.

Depth of near point (m) 2.0
A camera captures a scene. Point A is at depth Z=2m and point B is at depth Z=6m. Both appear at exactly the same pixel (u,v). What can you conclude?

Chapter 1: Similar Triangles: x = fX/Z

The pinhole camera model imagines a box with a single tiny hole at one face. Light from the world passes through the hole and lands on the opposite face — the image plane. No lens, no blur, just straight rays.

Setting up coordinates

Place the camera center (the pinhole, also called the optical center) at the origin of the camera frame. The optical axis is the z-axis, pointing into the scene. The image plane is at distance f (the focal length) along the z-axis.

A 3D point in the camera frame has coordinates pc = (pxc, pyc, pzc). The ray from the origin through pc intersects the image plane at some 2D location (um, vm), where the subscript m means "in meters" (we convert to pixels later).

Deriving um = f · pxc / pzc

Look at the xz-plane only. The point is at (pxc, pzc). The image plane is at z = f. The ray from origin through the point hits z = f at:

um / f = pxc / pzc   ⇒   um = f · pxc / pzc

This is pure similar-triangles geometry: the small triangle (origin to image plane, height um, base f) is similar to the large triangle (origin to point, height pxc, base pzc). By symmetry, the same holds for y:

vm = f · pyc / pzc
Perspective division. Both formulas divide by pzc, the depth. This single division is the source of all perspective effects: objects appear smaller as depth increases (divide by larger pz), and the rate of apparent motion of a moving object depends on how far it is (optical flow ∝ 1/Z).

Worked example

A point is at pc = (0.3 m, −0.1 m, 2.0 m). Focal length f = 0.05 m (5 cm — a typical wide-angle lens).

um = 0.05 × (0.3 / 2.0) = 0.0075 m = 7.5 mm
vm = 0.05 × (−0.1 / 2.0) = −0.0025 m = −2.5 mm

The image is 7.5 mm right of center and 2.5 mm above center (y-axis flipped in image coords). Now look at pc = (0.6 m, −0.2 m, 4.0 m) — double the distance, double all coords. Same pixel! Scale ambiguity confirmed.

Misconception: focal length is a lens property. The focal length f here is in meters — it is a physical distance (how far the image plane is from the pinhole). But "focal length in pixels" (what your K matrix stores) is f measured in units of pixel-widths. A 50mm lens on a 1-inch sensor gives different pixel focal lengths than the same lens on a smaller phone sensor. More on this in Chapter 3.
Pinhole projection — 3D points through the optical center

3D point (warm dot) in the scene projects through the pinhole onto the image plane. Drag sliders to change X, Z, or focal length. Watch how the projected pixel (teal) moves.

X (m) 0.50
Z depth (m) 2.0
Focal length f (m) 0.050
A point at (X=1m, Z=4m) in camera coordinates projects to pixel x using f=0.04m. What is um?

Chapter 2: Homogeneous Coordinates

The projection formulas um = f·X/Z and vm = f·Y/Z involve a division by Z, making them nonlinear. Nonlinear equations are hard to chain together, hard to invert, and hard to reason about algebraically. Homogeneous coordinates solve this by embedding a 3D point into 4D space, transforming the nonlinear projection into a linear matrix multiplication — at the cost of one extra dimension.

The homogeneous lift

A 3D point p = (X, Y, Z) becomes the homogeneous vector p̃ = (X, Y, Z, 1)T. A 2D image point (u, v) becomes (u, v, 1)T, or more generally (u·w, v·w, w)T for any w ≠ 0 — all such triples represent the same image point.

Projection as matrix multiplication

Watch the division-by-Z become the third row in a matrix product:

Z · [um, vm, 1]T = [f, 0, 0; 0, f, 0; 0, 0, 1] · [X, Y, Z]T

The right side gives (fX, fY, Z)T. Dividing all entries by Z (the third component) recovers (fX/Z, fY/Z, 1)T = (um, vm, 1)T. This "divide by the third entry to get physical coordinates" is called dehomogenization.

More compactly, using the 3×4 canonical projection matrix Π0:

Π0 = [I3 | 03] = [[1,0,0,0], [0,1,0,0], [0,0,1,0]]
c = [X, Y, Z, 1]T  ⇒  z·[um, vm, 1]T = [f,0,0; 0,f,0; 0,0,1] · Π0 · p̃c
Why homogeneous? Because 3D rotations and translations and projections all become matrix multiplications in homogeneous space. You can chain any number of transforms — rotation, translation, perspective projection — as a single matrix product. This is why all of computer graphics and computer vision uses homogeneous coordinates.

Worked example

Point pc = (0.3, −0.1, 2.0, 1)T in homogeneous form. With f = 0.05:

python
import numpy as np

f = 0.05
K_metric = np.array([[f, 0, 0],
                     [0, f, 0],
                     [0, 0, 1]])

Pi0 = np.array([[1,0,0,0],
                [0,1,0,0],
                [0,0,1,0]])

p_hom = np.array([0.3, -0.1, 2.0, 1.0])
lam_x = K_metric @ Pi0 @ p_hom   # = [0.015, -0.005, 2.0]
u_m = lam_x[0] / lam_x[2]        # = 0.0075 m
v_m = lam_x[1] / lam_x[2]        # = -0.0025 m
print(f"u_m = {u_m:.4f} m, v_m = {v_m:.4f} m")
# u_m = 0.0075 m, v_m = -0.0025 m  (same as hand calc)
In homogeneous coordinates, the image point (u, v) is represented as which vector?

Chapter 3: The Intrinsic Matrix K

We have projection in meters (um, vm). Sensors don't return meters — they return pixels. Converting involves two steps: scaling by pixel density and shifting the origin to the image corner.

Step 1: meter → pixel scaling

A digital sensor has sx horizontal pixels per meter and sy vertical pixels per meter. Multiplying the metric coordinates by these densities converts to pixels:

upx = sx · um + ox     vpx = sy · vm + oy

where (ox, oy) is the principal point — the pixel where the optical axis pierces the image plane (typically near the center of the image). The shift is needed because the metric frame has its origin at the image center, but pixels are counted from the top-left corner.

Step 2: combining into K

Substituting um = f·X/Z and the pixel conversion together, and writing it as a single matrix acting on camera-frame coordinates:

K = [[fx, s, cx], [0, fy, cy], [0, 0, 1]]

where:

EntryValuePhysical meaning
fx = sx·fe.g. 800 pxFocal length in horizontal pixels
fy = sy·fe.g. 800 pxFocal length in vertical pixels
cx = oxe.g. 320 pxPrincipal point x (column of optical axis)
cy = oye.g. 240 pxPrincipal point y (row of optical axis)
s≈ 0Pixel skew (non-rectangular pixels; essentially zero on modern cameras)
Misconception: focal length in pixels depends only on the lens. fx = sx · f depends on BOTH the physical focal length f (a property of the lens) AND the pixel density sx (a property of the sensor). Crop a sensor (same lens): fx goes up. Bin pixels (same lens, same sensor): fx goes down. fx encodes the combination of lens + sensor — it is not a pure lens property.

Projection formula in pixels

The full expression, step by step:

Z · [u, v, 1]T = K · [X, Y, Z]T
u = (fx · X + s · Y) / Z + cx     v = fy · Y / Z + cy

Worked numbers

Camera: fx=800, fy=800, cx=320, cy=240, s=0. Point in camera frame: (0.3, −0.1, 2.0).

u = 800 × 0.3 / 2.0 + 320 = 120 + 320 = 440 px
v = 800 × (−0.1) / 2.0 + 240 = −40 + 240 = 200 px
Intrinsics explorer — how K shifts and scales the image

A 3×3 grid of scene points is projected using your K. Adjust fx, fy, cx, cy and watch: bigger f = narrower field of view (zoom); moving cx/cy shifts the principal point marker (teal cross) and the whole projected grid.

fx (px) 600
fy (px) 600
cx (px) 320
cy (px) 240
A camera has f=0.02m focal length and sx=5000 pixels/meter. What is fx?

Chapter 4: Extrinsics [R|t]: World → Camera

The intrinsic matrix K handles the camera's internal geometry. But our 3D points often live in a world frame w, while K expects coordinates in the camera frame c. We need a transform that moves points from world to camera.

The rigid body transform

From VNAV L1, the pose of camera frame c relative to world frame w is the SE(3) matrix Twc. To transform a world point pw into camera coordinates pc, we need the inverse — how to express a world point in camera coordinates. This is Tcw = Twc−1:

pc = Rcw pw + tcw

where Rcw is the rotation from world to camera (a 3×3 orthogonal matrix) and tcw is the camera center position expressed in the camera frame (roughly, where the world origin is, as seen by the camera).

The extrinsic matrix

In homogeneous coordinates, this transform becomes a 3×4 matrix:

[Rcw | tcw]   is the extrinsic matrix

It maps a homogeneous world point p̃w = (Xw, Yw, Zw, 1)T to a (non-homogeneous) camera-frame vector (Xc, Yc, Zc)T.

Extrinsics encode the camera's pose. If you know where the camera is in the world (its translation tcw) and how it's oriented (Rcw), you know the extrinsics. Estimating the extrinsics from matched image points is camera localization — the core problem in autonomous navigation.

Physical intuition: what does tcw mean?

tcw = −Rcw C, where C is the camera center in world coordinates. It is not the camera center position — it is the world origin expressed in the camera frame. This is a common confusion. To find where the camera is in the world: C = −RcwT tcw.

Extrinsics — how camera pose changes the view

A cube of 3D world points (warm) is projected into the image. Adjust camera position (xcam, zcam) and yaw angle to change the extrinsics. Watch how the projected image (teal dots) shifts and distorts.

Camera x (m) 0.0
Camera z offset (m) 4.0
Yaw θ (deg) 0
The extrinsic matrix [R|t] transforms points from which frame to which frame?

Chapter 5: Full Pipeline: P = K[R|t]

We now have all the pieces: extrinsics [R|t] to go from world to camera, and K to go from camera 3D to image pixels. Chaining them gives the projection matrix P — a single 3×4 matrix that projects any world point directly to a pixel.

P = K [Rcw | tcw]
Zc · [u, v, 1]T = P · p̃w

Where p̃w = (Xw, Yw, Zw, 1)T is the world point in homogeneous coordinates and Zc is the depth in the camera frame (required for the final dehomogenization step).

Recovering pixel coordinates

Compute q = P p̃w ∈ R3. Then:

u = q[0] / q[2]     v = q[1] / q[2]

Worked end-to-end example

Camera at world position Cw = (1.0, 0, 0) m, looking along the z-axis (identity rotation). World point at pw = (1.3, −0.1, 3.0). K: fx=800, fy=800, cx=320, cy=240.

python
import numpy as np

# Camera at C_w = [1.0, 0, 0], looking along world-z (identity rotation)
R_cw = np.eye(3)
t_cw = -R_cw @ np.array([1.0, 0.0, 0.0])  # = [-1, 0, 0]

K = np.array([[800, 0, 320],
              [0, 800, 240],
              [0, 0, 1]])

Ext = np.column_stack([R_cw, t_cw])   # 3x4
P = K @ Ext                             # 3x4 projection matrix

p_world = np.array([1.3, -0.1, 3.0, 1.0])
q = P @ p_world                         # = [P*p], 3-vector
u, v = q[0]/q[2], q[1]/q[2]
print(f"pixel: ({u:.1f}, {v:.1f})")
# Step by step:
# p_c = R_cw @ p_world[:3] + t_cw = [1.3-1, -0.1, 3.0] = [0.3, -0.1, 3.0]
# K @ p_c = [800*0.3/3+320, 800*(-0.1)/3+240] = [400, 213.3]
# pixel: (400.0, 213.3)
The projection pipeline in one line. pworld → [R|t] → pcamera → K → dehomogenize → pixel. Memorize this pipeline. Every classical computer vision algorithm starts here.
The projection matrix P = K[R|t] is a 3×4 matrix that maps a 4D homogeneous world point to a 3D homogeneous image point. To get actual pixel coordinates (u,v) from the result q ∈ R3, you must:

Chapter 6: Back-Projection: Pixel → Ray

Given a pixel (u, v), can we find the corresponding 3D point? Not uniquely — we can only find the ray along which the 3D point lies. This is the inverse of projection, and it is the foundation of triangulation, depth estimation, and structure-from-motion.

Inverting K to get a direction

Starting from u = (fx X/Z) + cx, solve for X/Z:

X/Z = (u − cx) / fx     Y/Z = (v − cy) / fy

This gives us the normalized image coordinates (xn, yn) = ((u−cx)/fx, (v−cy)/fy). The ray in camera coordinates is proportional to:

dc = (xn, yn, 1)T = K−1 [u, v, 1]T

Any 3D point on the ray satisfies pc = λ dc for some scalar λ > 0. The depth λ is unknown from a single image.

Worked example: back-projecting pixel (440, 200)

K: fx=800, fy=800, cx=320, cy=240.

xn = (440 − 320) / 800 = 0.15    yn = (200 − 240) / 800 = −0.05

Ray direction (in camera frame): d = (0.15, −0.05, 1)T. Normalize: |d| = √(0.0225 + 0.0025 + 1) ≈ 1.0124. Unit ray: (0.148, −0.049, 0.988).

If we know depth Zc=2m: pc = 2 × (0.15, −0.05, 1) = (0.3, −0.1, 2.0) — which matches the original 3D point from Chapter 1. The pipeline is consistent.

python
import numpy as np

K = np.array([[800,0,320],[0,800,240],[0,0,1]])
K_inv = np.linalg.inv(K)

pixel = np.array([440, 200, 1])    # homogeneous pixel
ray = K_inv @ pixel                 # = [0.15, -0.05, 1.0]
ray_unit = ray / np.linalg.norm(ray)
print("Ray direction:", ray)          # [0.15, -0.05, 1.0]

# If depth Z_c = 2m is known:
depth = 2.0
p_cam = depth * ray                  # = [0.3, -0.1, 2.0] ✓
Stereo triangulation in a nutshell. With two cameras, you get two rays for the same 3D point. Find where those rays intersect (or their closest approach) in 3D — that's the triangulated depth. This is why stereo cameras recover depth: two views, two rays, one intersection point.
Given K with fx=fy=500, cx=cy=250, and pixel (u=300, v=200), what are the normalized image coordinates (xn, yn)?

Chapter 7: Lens Distortion

The pinhole model assumes perfectly straight rays. Real lenses bend light. Wide-angle lenses bend it a lot. The result is lens distortion: straight lines in the world appear curved in the image.

Radial distortion

The most common distortion is radial: the further a point is from the image center, the more it shifts outward (barrel distortion) or inward (pincushion distortion). The model in the camera frame (before adding the principal point):

uc = (1 + a1r2 + a2r4) · udistortc
vc = (1 + a1r2 + a2r4) · vdistortc

where r2 = (udistort)2 + (vdistort)2 is the squared distance from the image center. The coefficients a1, a2 are called distortion coefficients:

SignTypeEffect
a1 < 0Barrel distortionStraight lines bow outward; common in wide-angle/fish-eye
a1 > 0Pincushion distortionStraight lines bow inward; common in telephoto

Undistortion in image coordinates

Using the image frame (with principal point), distortion correction becomes:

u = (1 + a1r2 + a2r4)(udistort − cx) + cx
r2 = (udistort − cx)2 + (vdistort − cy)2
Calibration order matters. In practice, you first undistort the raw image, then apply K. Distortion happens at the physical lens (before digitization), while K converts camera-frame coords to pixels. Always undistort first — the K matrix is defined for undistorted images.

Worked distortion example

Distorted pixel at (ud=350, vd=150), cx=320, cy=240, a1=−0.2, a2=0.05.

r2 = (350−320)2 + (150−240)2 = 900 + 8100 = 9000   (in px2)
factor = 1 + (−0.2)(9000) + (0.05)(81000000) = 1 − 1800 + 4050000 ≈ 4048201

Wait — that blows up! This is why distortion coefficients are always very small (|a1| < 0.5, |a2| < 0.1), and r is measured in normalized coordinates (not pixels). Let's redo with normalized coords (after dividing by f):

rn2 = (30/800)2 + (−90/800)2 = 0.001392 + 0.01266 = 0.01406
factor = 1 + (−0.2)(0.01406) + (0.05)(0.0001977) ≈ 1 − 0.00281 = 0.99719

A gentle inward pull of 0.28% — typical barrel correction. The corrected normalized coords: (0.0375 × 0.99719, −0.1125 × 0.99719) ≈ (0.0374, −0.1122).

python
import numpy as np

def undistort_pixel(u_d, v_d, K, a1, a2):
    """Apply radial undistortion to a single pixel."""
    fx, fy = K[0,0], K[1,1]
    cx, cy = K[0,2], K[1,2]
    # Normalize to camera frame (unitless)
    xn = (u_d - cx) / fx
    yn = (v_d - cy) / fy
    r2 = xn**2 + yn**2
    factor = 1 + a1*r2 + a2*r2**2
    # Undistorted normalized coords
    xn_u = xn * factor
    yn_u = yn * factor
    # Back to pixels
    u = xn_u * fx + cx
    v = yn_u * fy + cy
    return u, v

K = np.array([[800,0,320],[0,800,240],[0,0,1]])
u, v = undistort_pixel(350, 150, K, a1=-0.2, a2=0.05)
print(f"Undistorted: ({u:.2f}, {v:.2f})")  # (349.92, 150.25)
Lens distortion — barrel vs pincushion

A grid of undistorted points is shown (warm). Apply the distortion model: the teal dots show where those points appear in the raw (distorted) image. Drag a1 negative for barrel, positive for pincushion. The outer grid points distort most.

a1 (radial) 0.00
a2 (radial r4) 0.00
A camera exhibits barrel distortion. Lines near the image edges appear to bow outward. What sign would you expect for the dominant radial distortion coefficient a1?

Chapter 8: Showcase: Full Camera Explorer

Now everything comes together. A 3D scene (wireframe cube) is projected through the full pipeline P = K[R|t], with optional radial distortion applied on top. You control K (focal length, principal point), camera pose (translation, yaw), and distortion coefficients. This is the complete model you need to understand to do any visual navigation.

Full camera pipeline: K · [R|t] · distortion

Left half: top-down 3D view showing the cube (warm), camera (teal triangle), and projected rays. Right half: the resulting image with projected points and connecting lines. All parameters update live.

fx=fy600
cx320
Cam X0.0
Cam Z5.0
Yaw (deg)0
Distortion a10.00

Chapter 9: Connections & Cheat Sheet

You now hold the complete forward model for how cameras see the world. Here is the distilled reference you'll reach for constantly as you go deeper into visual navigation.

Camera Cheat Sheet

QuantityFormulaKey point
Perspective projection (m)um = f·X/Z, vm = f·Y/ZSimilar triangles; divide by depth
Intrinsic matrix K[[fx, s, cx],[0, fy, cy],[0,0,1]]fx=sxf in pixels; s≈0
Extrinsic matrix[Rcw | tcw]World → camera; Cworld = −RTt
Full projectionP = K[R|t]; Z·[u,v,1]T=P·p̃wDehomogenize: u=q[0]/q[2]
Back-projectiond = K−1[u,v,1]TOne pixel = one ray, depth unknown
Scale ambiguityp & λp same pixel for any λ>0Need stereo/motion/LiDAR for depth
Radial distortionfactor = 1+a1r2+a2r4r in normalized coords; undistort before K

What's next in VNAV

L11 (This lesson)
Camera model: pixels are projections of 3D rays
L12: Feature Detection & Tracking
Find and match keypoints across frames using the projection model
L13–14: Two-View Geometry
The Essential/Fundamental matrices encode geometry between two cameras
L15–16: Structure from Motion
Recover camera poses and 3D structure from many views

Links to related Gleams

The key insight to carry forward. Every visual navigation algorithm — feature tracking, pose estimation, SLAM, NeRF — is built on the projection model P = K[R|t]. When something goes wrong with a vision algorithm, the first question is always: "Is the camera model correct?" That means: calibrated K, correct R and t, and undistorted images.
"The camera does not capture reality. It records projections of reality. Understanding the projection is the first step toward recovering reality."
— Luca Carlone, MIT 16.485