Visual Navigation for Autonomous Vehicles · MIT 16.485 · Lecture 11

Image Formation: The Pinhole Camera Model

Every pixel your autonomous vehicle sees is the shadow of a 3D ray collapsed to a single point. Navigation from cameras requires understanding this projection exactly — so we can invert it, reconstruct depth, and triangulate structure. This lesson derives the pinhole model from similar triangles, builds the full projection pipeline P = K[R|t], shows what each number in the intrinsic matrix K means physically, and confronts the depth-is-lost problem head-on. Hand-derived numbers, five interactive canvases, worked projection/back-projection/distortion examples. MIT 16.485 by Luca Carlone, Lecture 11.

Prerequisites: VNAV L1 (3D frames & SE(3)). Basic similar triangles. No linear algebra beyond matrix multiplication.

Chapters

Live Canvases

Derived

From First Principles

Chapter 0: The Depth-Loss Problem

Your robot drives toward a stop sign. Its front camera sees a red octagon filling 12% of the image. Is the sign 3 meters away, or 30? Both distances produce a sign at exactly the same pixel coordinates — only the apparent size changes, and size is ambiguous without knowing the real size of the sign.

This is not a sensor limitation that better hardware can fix. It is a mathematical fact: a camera collapses 3D space onto a 2D image, and that collapse is irreversible from a single view. One pixel corresponds to an entire ray of 3D points, all projecting to the same location.

One pixel = one ray. When you look at a pixel, you know the direction of the corresponding 3D point (the ray through the camera center), but not the distance along that ray. Depth is the lost dimension. Recovering it requires either multiple views (stereo, SfM) or additional sensors (LiDAR, radar).

Why does this matter for autonomous navigation?

To navigate, a robot must estimate ego-motion (how it moved between frames) and reconstruct the 3D structure of the environment. Both tasks require undoing the camera's projection — inverting the map from 3D to 2D. You cannot invert what you don't understand. This lesson derives the forward model so carefully that every step of the inverse becomes obvious.

Depth ambiguity: two points, one pixel

The camera center is at the left. The image plane is the vertical line. Two 3D points sit on the same ray — different depths but identical pixel coordinates. Drag the depth slider to move the near point. Watch: the pixel never moves.

Depth of near point (m) 2.0

A camera captures a scene. Point A is at depth Z=2m and point B is at depth Z=6m. Both appear at exactly the same pixel (u,v). What can you conclude?

The camera has a calibration error Points A and B are at the same 3D location Points A and B lie on the same ray from the camera center — depth is ambiguous from a single image Point A is occluded by point B

Chapter 1: Similar Triangles: x = fX/Z

The pinhole camera model imagines a box with a single tiny hole at one face. Light from the world passes through the hole and lands on the opposite face — the image plane. No lens, no blur, just straight rays.

Setting up coordinates

Place the camera center (the pinhole, also called the optical center) at the origin of the camera frame. The optical axis is the z-axis, pointing into the scene. The image plane is at distance f (the focal length) along the z-axis.

A 3D point in the camera frame has coordinates p^c = (p_x^c, p_y^c, p_z^c). The ray from the origin through p^c intersects the image plane at some 2D location (u_m, v_m), where the subscript m means "in meters" (we convert to pixels later).

Deriving u_m = f · p_x^c / p_z^c

Look at the xz-plane only. The point is at (p_x^c, p_z^c). The image plane is at z = f. The ray from origin through the point hits z = f at:

u_m / f = p_x^c / p_z^c ⇒ u_m = f · p_x^c / p_z^c

This is pure similar-triangles geometry: the small triangle (origin to image plane, height u_m, base f) is similar to the large triangle (origin to point, height p_x^c, base p_z^c). By symmetry, the same holds for y:

v_m = f · p_y^c / p_z^c

Perspective division. Both formulas divide by p_z^c, the depth. This single division is the source of all perspective effects: objects appear smaller as depth increases (divide by larger p_z), and the rate of apparent motion of a moving object depends on how far it is (optical flow ∝ 1/Z).

Worked example

A point is at p^c = (0.3 m, −0.1 m, 2.0 m). Focal length f = 0.05 m (5 cm — a typical wide-angle lens).

u_m = 0.05 × (0.3 / 2.0) = 0.0075 m = 7.5 mm

v_m = 0.05 × (−0.1 / 2.0) = −0.0025 m = −2.5 mm

The image is 7.5 mm right of center and 2.5 mm above center (y-axis flipped in image coords). Now look at p^c = (0.6 m, −0.2 m, 4.0 m) — double the distance, double all coords. Same pixel! Scale ambiguity confirmed.

Misconception: focal length is a lens property. The focal length f here is in meters — it is a physical distance (how far the image plane is from the pinhole). But "focal length in pixels" (what your K matrix stores) is f measured in units of pixel-widths. A 50mm lens on a 1-inch sensor gives different pixel focal lengths than the same lens on a smaller phone sensor. More on this in Chapter 3.

Pinhole projection — 3D points through the optical center

3D point (warm dot) in the scene projects through the pinhole onto the image plane. Drag sliders to change X, Z, or focal length. Watch how the projected pixel (teal) moves.

X (m) 0.50

Z depth (m) 2.0

Focal length f (m) 0.050

A point at (X=1m, Z=4m) in camera coordinates projects to pixel x using f=0.04m. What is u_m?

0.16 m (= f × X) 0.4 m (= X × Z) 0.01 m (= f × X / Z = 0.04 × 1 / 4) 4 m (= Z)

Chapter 2: Homogeneous Coordinates

The projection formulas u_m = f·X/Z and v_m = f·Y/Z involve a division by Z, making them nonlinear. Nonlinear equations are hard to chain together, hard to invert, and hard to reason about algebraically. Homogeneous coordinates solve this by embedding a 3D point into 4D space, transforming the nonlinear projection into a linear matrix multiplication — at the cost of one extra dimension.

The homogeneous lift

A 3D point p = (X, Y, Z) becomes the homogeneous vector p̃ = (X, Y, Z, 1)^T. A 2D image point (u, v) becomes (u, v, 1)^T, or more generally (u·w, v·w, w)^T for any w ≠ 0 — all such triples represent the same image point.

Projection as matrix multiplication

Watch the division-by-Z become the third row in a matrix product:

Z · [u_m, v_m, 1]^T = [f, 0, 0; 0, f, 0; 0, 0, 1] · [X, Y, Z]^T

The right side gives (fX, fY, Z)^T. Dividing all entries by Z (the third component) recovers (fX/Z, fY/Z, 1)^T = (u_m, v_m, 1)^T. This "divide by the third entry to get physical coordinates" is called dehomogenization.

More compactly, using the 3×4 canonical projection matrix Π₀:

Π₀ = [I₃ | 0₃] = [[1,0,0,0], [0,1,0,0], [0,0,1,0]]

p̃^c = [X, Y, Z, 1]^T ⇒ z·[u_m, v_m, 1]^T = [f,0,0; 0,f,0; 0,0,1] · Π₀ · p̃^c

Why homogeneous? Because 3D rotations and translations and projections all become matrix multiplications in homogeneous space. You can chain any number of transforms — rotation, translation, perspective projection — as a single matrix product. This is why all of computer graphics and computer vision uses homogeneous coordinates.

Worked example

Point p^c = (0.3, −0.1, 2.0, 1)^T in homogeneous form. With f = 0.05:

python
import numpy as np

f = 0.05
K_metric = np.array([[f, 0, 0],
                     [0, f, 0],
                     [0, 0, 1]])

Pi0 = np.array([[1,0,0,0],
                [0,1,0,0],
                [0,0,1,0]])

p_hom = np.array([0.3, -0.1, 2.0, 1.0])
lam_x = K_metric @ Pi0 @ p_hom   # = [0.015, -0.005, 2.0]
u_m = lam_x[0] / lam_x[2]        # = 0.0075 m
v_m = lam_x[1] / lam_x[2]        # = -0.0025 m
print(f"u_m = {u_m:.4f} m, v_m = {v_m:.4f} m")
# u_m = 0.0075 m, v_m = -0.0025 m  (same as hand calc)

In homogeneous coordinates, the image point (u, v) is represented as which vector?

(u, v, 0)^T (u, v, 1)^T or (ku, kv, k)^T for any k ≠ 0 (1, u, v)^T (u, v, u+v)^T

Chapter 3: The Intrinsic Matrix K

We have projection in meters (u_m, v_m). Sensors don't return meters — they return pixels. Converting involves two steps: scaling by pixel density and shifting the origin to the image corner.

Step 1: meter → pixel scaling

A digital sensor has s_x horizontal pixels per meter and s_y vertical pixels per meter. Multiplying the metric coordinates by these densities converts to pixels:

u_px = s_x · u_m + o_x v_px = s_y · v_m + o_y

where (o_x, o_y) is the principal point — the pixel where the optical axis pierces the image plane (typically near the center of the image). The shift is needed because the metric frame has its origin at the image center, but pixels are counted from the top-left corner.

Step 2: combining into K

Substituting u_m = f·X/Z and the pixel conversion together, and writing it as a single matrix acting on camera-frame coordinates:

K = [[f_x, s, c_x], [0, f_y, c_y], [0, 0, 1]]

where:

Entry	Value	Physical meaning
f_x = s_x·f	e.g. 800 px	Focal length in horizontal pixels
f_y = s_y·f	e.g. 800 px	Focal length in vertical pixels
c_x = o_x	e.g. 320 px	Principal point x (column of optical axis)
c_y = o_y	e.g. 240 px	Principal point y (row of optical axis)
s	≈ 0	Pixel skew (non-rectangular pixels; essentially zero on modern cameras)

Misconception: focal length in pixels depends only on the lens. f_x = s_x · f depends on BOTH the physical focal length f (a property of the lens) AND the pixel density s_x (a property of the sensor). Crop a sensor (same lens): f_x goes up. Bin pixels (same lens, same sensor): f_x goes down. f_x encodes the combination of lens + sensor — it is not a pure lens property.

Projection formula in pixels

The full expression, step by step:

Z · [u, v, 1]^T = K · [X, Y, Z]^T

u = (f_x · X + s · Y) / Z + c_x v = f_y · Y / Z + c_y

Worked numbers

Camera: f_x=800, f_y=800, c_x=320, c_y=240, s=0. Point in camera frame: (0.3, −0.1, 2.0).

u = 800 × 0.3 / 2.0 + 320 = 120 + 320 = 440 px

v = 800 × (−0.1) / 2.0 + 240 = −40 + 240 = 200 px

Intrinsics explorer — how K shifts and scales the image

A 3×3 grid of scene points is projected using your K. Adjust f_x, f_y, c_x, c_y and watch: bigger f = narrower field of view (zoom); moving c_x/c_y shifts the principal point marker (teal cross) and the whole projected grid.

f_x (px) 600

f_y (px) 600

c_x (px) 320

c_y (px) 240

A camera has f=0.02m focal length and s_x=5000 pixels/meter. What is f_x?

0.02 px 5000 px 100 px (= 5000 × 0.02) 250 px (= 5000 / 0.02)

Chapter 4: Extrinsics [R|t]: World → Camera

The intrinsic matrix K handles the camera's internal geometry. But our 3D points often live in a world frame w, while K expects coordinates in the camera frame c. We need a transform that moves points from world to camera.

The rigid body transform

From VNAV L1, the pose of camera frame c relative to world frame w is the SE(3) matrix T_wc. To transform a world point p^w into camera coordinates p^c, we need the inverse — how to express a world point in camera coordinates. This is T_cw = T_wc⁻¹:

p^c = R_cw p^w + t_cw

where R_cw is the rotation from world to camera (a 3×3 orthogonal matrix) and t_cw is the camera center position expressed in the camera frame (roughly, where the world origin is, as seen by the camera).

The extrinsic matrix

In homogeneous coordinates, this transform becomes a 3×4 matrix:

[R_cw | t_cw] is the extrinsic matrix

It maps a homogeneous world point p̃^w = (X_w, Y_w, Z_w, 1)^T to a (non-homogeneous) camera-frame vector (X_c, Y_c, Z_c)^T.

Extrinsics encode the camera's pose. If you know where the camera is in the world (its translation t_cw) and how it's oriented (R_cw), you know the extrinsics. Estimating the extrinsics from matched image points is camera localization — the core problem in autonomous navigation.

Physical intuition: what does t_cw mean?

t_cw = −R_cw C, where C is the camera center in world coordinates. It is not the camera center position — it is the world origin expressed in the camera frame. This is a common confusion. To find where the camera is in the world: C = −R_cw^T t_cw.

Extrinsics — how camera pose changes the view

A cube of 3D world points (warm) is projected into the image. Adjust camera position (x_cam, z_cam) and yaw angle to change the extrinsics. Watch how the projected image (teal dots) shifts and distorts.

Camera x (m) 0.0

Camera z offset (m) 4.0

Yaw θ (deg) 0

The extrinsic matrix [R|t] transforms points from which frame to which frame?

Camera frame → image plane (pixels) World frame → camera frame (3D coordinates in camera's reference) Image plane → 3D world Camera frame → world frame

Chapter 5: Full Pipeline: P = K[R|t]

We now have all the pieces: extrinsics [R|t] to go from world to camera, and K to go from camera 3D to image pixels. Chaining them gives the projection matrix P — a single 3×4 matrix that projects any world point directly to a pixel.

P = K [R_cw | t_cw]

Z_c · [u, v, 1]^T = P · p̃^w

Where p̃^w = (X_w, Y_w, Z_w, 1)^T is the world point in homogeneous coordinates and Z_c is the depth in the camera frame (required for the final dehomogenization step).

Recovering pixel coordinates

Compute q = P p̃^w ∈ R³. Then:

u = q[0] / q[2] v = q[1] / q[2]

Worked end-to-end example

Camera at world position C^w = (1.0, 0, 0) m, looking along the z-axis (identity rotation). World point at p^w = (1.3, −0.1, 3.0). K: f_x=800, f_y=800, c_x=320, c_y=240.

python
import numpy as np

# Camera at C_w = [1.0, 0, 0], looking along world-z (identity rotation)
R_cw = np.eye(3)
t_cw = -R_cw @ np.array([1.0, 0.0, 0.0])  # = [-1, 0, 0]

K = np.array([[800, 0, 320],
              [0, 800, 240],
              [0, 0, 1]])

Ext = np.column_stack([R_cw, t_cw])   # 3x4
P = K @ Ext                             # 3x4 projection matrix

p_world = np.array([1.3, -0.1, 3.0, 1.0])
q = P @ p_world                         # = [P*p], 3-vector
u, v = q[0]/q[2], q[1]/q[2]
print(f"pixel: ({u:.1f}, {v:.1f})")
# Step by step:
# p_c = R_cw @ p_world[:3] + t_cw = [1.3-1, -0.1, 3.0] = [0.3, -0.1, 3.0]
# K @ p_c = [800*0.3/3+320, 800*(-0.1)/3+240] = [400, 213.3]
# pixel: (400.0, 213.3)

The projection pipeline in one line. p^world → [R|t] → p^camera → K → dehomogenize → pixel. Memorize this pipeline. Every classical computer vision algorithm starts here.

The projection matrix P = K[R|t] is a 3×4 matrix that maps a 4D homogeneous world point to a 3D homogeneous image point. To get actual pixel coordinates (u,v) from the result q ∈ R³, you must:

Take q[0] and q[1] directly as pixels Normalize q so that q[2]=1, then K maps to pixels Divide q[0] and q[1] by q[2] (dehomogenize) Multiply q by the inverse of K

Chapter 6: Back-Projection: Pixel → Ray

Given a pixel (u, v), can we find the corresponding 3D point? Not uniquely — we can only find the ray along which the 3D point lies. This is the inverse of projection, and it is the foundation of triangulation, depth estimation, and structure-from-motion.

Inverting K to get a direction

Starting from u = (f_x X/Z) + c_x, solve for X/Z:

X/Z = (u − c_x) / f_x Y/Z = (v − c_y) / f_y

This gives us the normalized image coordinates (x_n, y_n) = ((u−c_x)/f_x, (v−c_y)/f_y). The ray in camera coordinates is proportional to:

d^c = (x_n, y_n, 1)^T = K⁻¹ [u, v, 1]^T

Any 3D point on the ray satisfies p^c = λ d^c for some scalar λ > 0. The depth λ is unknown from a single image.

Worked example: back-projecting pixel (440, 200)

K: f_x=800, f_y=800, c_x=320, c_y=240.

x_n = (440 − 320) / 800 = 0.15 y_n = (200 − 240) / 800 = −0.05

Ray direction (in camera frame): d = (0.15, −0.05, 1)^T. Normalize: |d| = √(0.0225 + 0.0025 + 1) ≈ 1.0124. Unit ray: (0.148, −0.049, 0.988).

If we know depth Z_c=2m: p^c = 2 × (0.15, −0.05, 1) = (0.3, −0.1, 2.0) — which matches the original 3D point from Chapter 1. The pipeline is consistent.

python
import numpy as np

K = np.array([[800,0,320],[0,800,240],[0,0,1]])
K_inv = np.linalg.inv(K)

pixel = np.array([440, 200, 1])    # homogeneous pixel
ray = K_inv @ pixel                 # = [0.15, -0.05, 1.0]
ray_unit = ray / np.linalg.norm(ray)
print("Ray direction:", ray)          # [0.15, -0.05, 1.0]

# If depth Z_c = 2m is known:
depth = 2.0
p_cam = depth * ray                  # = [0.3, -0.1, 2.0] ✓

Stereo triangulation in a nutshell. With two cameras, you get two rays for the same 3D point. Find where those rays intersect (or their closest approach) in 3D — that's the triangulated depth. This is why stereo cameras recover depth: two views, two rays, one intersection point.

Given K with f_x=f_y=500, c_x=c_y=250, and pixel (u=300, v=200), what are the normalized image coordinates (x_n, y_n)?

(300, 200) (0.1, −0.1) since x_n=(300−250)/500=0.1, y_n=(200−250)/500=−0.1 (1.2, 0.8) since x_n=300/250, y_n=200/250 (50, −50) since (300−250)=50, (200−250)=−50

Chapter 7: Lens Distortion

The pinhole model assumes perfectly straight rays. Real lenses bend light. Wide-angle lenses bend it a lot. The result is lens distortion: straight lines in the world appear curved in the image.

Radial distortion

The most common distortion is radial: the further a point is from the image center, the more it shifts outward (barrel distortion) or inward (pincushion distortion). The model in the camera frame (before adding the principal point):

u_c = (1 + a₁r² + a₂r⁴) · u^distort_c

v_c = (1 + a₁r² + a₂r⁴) · v^distort_c

where r² = (u^distort)² + (v^distort)² is the squared distance from the image center. The coefficients a₁, a₂ are called distortion coefficients:

Sign	Type	Effect
a₁ < 0	Barrel distortion	Straight lines bow outward; common in wide-angle/fish-eye
a₁ > 0	Pincushion distortion	Straight lines bow inward; common in telephoto

Undistortion in image coordinates

Using the image frame (with principal point), distortion correction becomes:

u = (1 + a₁r² + a₂r⁴)(u^distort − c_x) + c_x

r² = (u^distort − c_x)² + (v^distort − c_y)²

Calibration order matters. In practice, you first undistort the raw image, then apply K. Distortion happens at the physical lens (before digitization), while K converts camera-frame coords to pixels. Always undistort first — the K matrix is defined for undistorted images.

Worked distortion example

Distorted pixel at (u^d=350, v^d=150), c_x=320, c_y=240, a₁=−0.2, a₂=0.05.

r² = (350−320)² + (150−240)² = 900 + 8100 = 9000 (in px²)

factor = 1 + (−0.2)(9000) + (0.05)(81000000) = 1 − 1800 + 4050000 ≈ 4048201

Wait — that blows up! This is why distortion coefficients are always very small (|a₁| < 0.5, |a₂| < 0.1), and r is measured in normalized coordinates (not pixels). Let's redo with normalized coords (after dividing by f):

r_n² = (30/800)² + (−90/800)² = 0.001392 + 0.01266 = 0.01406

factor = 1 + (−0.2)(0.01406) + (0.05)(0.0001977) ≈ 1 − 0.00281 = 0.99719

A gentle inward pull of 0.28% — typical barrel correction. The corrected normalized coords: (0.0375 × 0.99719, −0.1125 × 0.99719) ≈ (0.0374, −0.1122).

python
import numpy as np

def undistort_pixel(u_d, v_d, K, a1, a2):
    """Apply radial undistortion to a single pixel."""
    fx, fy = K[0,0], K[1,1]
    cx, cy = K[0,2], K[1,2]
    # Normalize to camera frame (unitless)
    xn = (u_d - cx) / fx
    yn = (v_d - cy) / fy
    r2 = xn**2 + yn**2
    factor = 1 + a1*r2 + a2*r2**2
    # Undistorted normalized coords
    xn_u = xn * factor
    yn_u = yn * factor
    # Back to pixels
    u = xn_u * fx + cx
    v = yn_u * fy + cy
    return u, v

K = np.array([[800,0,320],[0,800,240],[0,0,1]])
u, v = undistort_pixel(350, 150, K, a1=-0.2, a2=0.05)
print(f"Undistorted: ({u:.2f}, {v:.2f})")  # (349.92, 150.25)

Lens distortion — barrel vs pincushion

A grid of undistorted points is shown (warm). Apply the distortion model: the teal dots show where those points appear in the raw (distorted) image. Drag a₁ negative for barrel, positive for pincushion. The outer grid points distort most.

a₁ (radial) 0.00

a₂ (radial r⁴) 0.00

A camera exhibits barrel distortion. Lines near the image edges appear to bow outward. What sign would you expect for the dominant radial distortion coefficient a₁?

a₁ > 0 (pincushion) a₁ < 0 (barrel) — points are pushed further from center in the distorted image a₁ = 0 (no distortion) The sign depends on the focal length

Chapter 8: Showcase: Full Camera Explorer

Now everything comes together. A 3D scene (wireframe cube) is projected through the full pipeline P = K[R|t], with optional radial distortion applied on top. You control K (focal length, principal point), camera pose (translation, yaw), and distortion coefficients. This is the complete model you need to understand to do any visual navigation.

Full camera pipeline: K · [R|t] · distortion

Left half: top-down 3D view showing the cube (warm), camera (teal triangle), and projected rays. Right half: the resulting image with projected points and connecting lines. All parameters update live.

f_x=f_y600

c_x320

Cam X0.0

Cam Z5.0

Yaw (deg)0

Distortion a₁0.00

Chapter 9: Connections & Cheat Sheet

You now hold the complete forward model for how cameras see the world. Here is the distilled reference you'll reach for constantly as you go deeper into visual navigation.

Camera Cheat Sheet

Quantity	Formula	Key point
Perspective projection (m)	u_m = f·X/Z, v_m = f·Y/Z	Similar triangles; divide by depth
Intrinsic matrix K	[[f_x, s, c_x],[0, f_y, c_y],[0,0,1]]	f_x=s_xf in pixels; s≈0
Extrinsic matrix	[R_cw \| t_cw]	World → camera; C_world = −R^Tt
Full projection	P = K[R\|t]; Z·[u,v,1]^T=P·p̃^w	Dehomogenize: u=q[0]/q[2]
Back-projection	d = K⁻¹[u,v,1]^T	One pixel = one ray, depth unknown
Scale ambiguity	p & λp same pixel for any λ>0	Need stereo/motion/LiDAR for depth
Radial distortion	factor = 1+a₁r²+a₂r⁴	r in normalized coords; undistort before K

What's next in VNAV

L11 (This lesson)

Camera model: pixels are projections of 3D rays

↓

L12: Feature Detection & Tracking

Find and match keypoints across frames using the projection model

↓

L13–14: Two-View Geometry

The Essential/Fundamental matrices encode geometry between two cameras

↓

L15–16: Structure from Motion

Recover camera poses and 3D structure from many views

Links to related Gleams

VNAV L1: 3D Geometry & SE(3) — the reference frames and transforms used by [R|t]
VNAV L2: Lie Groups — optimizing camera poses on the SO(3) manifold
Multiview Geometry — epipolar constraints, triangulation, and the Essential matrix
NeRF & 3D Gaussian Splatting — modern 3D reconstruction using the camera model from this lesson
Classical VIO — fusing camera projection with IMU for visual inertial odometry

The key insight to carry forward. Every visual navigation algorithm — feature tracking, pose estimation, SLAM, NeRF — is built on the projection model P = K[R|t]. When something goes wrong with a vision algorithm, the first question is always: "Is the camera model correct?" That means: calibrated K, correct R and t, and undistorted images.

"The camera does not capture reality. It records projections of reality. Understanding the projection is the first step toward recovering reality."
— Luca Carlone, MIT 16.485

Image Formation: The Pinhole Camera Model

Chapter 0: The Depth-Loss Problem

Why does this matter for autonomous navigation?

Chapter 1: Similar Triangles: x = fX/Z

Setting up coordinates

Deriving um = f · pxc / pzc

Worked example

Chapter 2: Homogeneous Coordinates

The homogeneous lift

Projection as matrix multiplication

Worked example

Chapter 3: The Intrinsic Matrix K

Step 1: meter → pixel scaling

Step 2: combining into K

Projection formula in pixels

Worked numbers

Chapter 4: Extrinsics [R|t]: World → Camera

The rigid body transform

The extrinsic matrix

Physical intuition: what does tcw mean?

Chapter 5: Full Pipeline: P = K[R|t]

Recovering pixel coordinates

Worked end-to-end example

Chapter 6: Back-Projection: Pixel → Ray

Inverting K to get a direction

Worked example: back-projecting pixel (440, 200)

Chapter 7: Lens Distortion

Radial distortion

Undistortion in image coordinates

Worked distortion example

Chapter 8: Showcase: Full Camera Explorer

Chapter 9: Connections & Cheat Sheet

Camera Cheat Sheet

What's next in VNAV

Links to related Gleams

Deriving u_m = f · p_x^c / p_z^c

Physical intuition: what does t_cw mean?