Visual Navigation for Autonomous Vehicles · MIT 16.485 · Lecture 1

3D Geometry: Rotations, Rigid Transforms & Frames

A drone's camera spots a landmark "2 m ahead, 1 m left." But in what frame? Without rigorously relating coordinate frames — world, body, camera — navigation is impossible. This lesson builds the geometric language every later VNAV topic speaks: points and frames, rotation matrices in SO(3), Euler angles and their singularities, axis-angle and quaternions, and rigid-body transforms in SE(3). Every formula derived with actual numbers. MIT 16.485 by Luca Carlone.

Prerequisites: linear algebra basics (matrix multiply, transpose, determinant). No prior robotics assumed.
10
Chapters
5
Live Canvases
Derived
From First Principles

Chapter 0: The Frame Problem

Your autonomous drone hovers at 10 m altitude. Its forward-facing camera detects a traffic cone directly ahead at a pixel offset that translates to "2.0 m in front, 1.0 m to the left." You want to navigate around it. Simple enough — but here is the first question that will define the next ten chapters: 2.0 m in front and 1.0 m to the left of what, exactly?

The camera has its own camera frame: x-axis pointing right, y-axis pointing down, z-axis pointing into the scene. The drone body has a body frame: x forward, y left, z up. The world has a world frame with a fixed origin on the ground. These three frames are all rotated and translated with respect to each other — and they change relative to each other every millisecond the drone moves.

If you naively treat "2 m forward" in the camera frame as "2 m forward" in the world frame, you will command the drone to fly into the cone if the camera is pitched down, or to miss it entirely if the drone has yawed 90 degrees since the last measurement. Navigation collapses without a way to rigorously convert coordinates between frames.

The core problem: every sensor speaks in its own frame. A LIDAR returns distances in a sensor-centered frame. A GPS gives position in a Earth-centered frame. An IMU measures accelerations in a body frame. Vision gives pixels in an image frame. To fuse them, you must relate all those frames to one another with precise mathematical transforms.

This is not a small bookkeeping detail. The history of robotics disasters is full of frame confusion: the Mars Climate Orbiter was lost in 1999 because one team used metric units and another imperial — a frame/units mismatch worth $327 million. In visual navigation the analogous mistake is misrepresenting rotation: apply the wrong transform and a quadrotor that should hover begins to spiral.

The good news: the math is clean, closed-form, and fully derivable from first principles. By the end of this lesson you will have a complete toolkit — rotation matrices, Euler angles, quaternions, and 4×4 homogeneous transforms — and the intuition to know when to use each one.

Frame problem visualizer — three coordinate frames

The world frame (grey) is fixed. The body frame (orange) is attached to the drone. The camera frame (teal) is attached to the camera. Drag the Yaw slider to rotate the drone; notice the body frame rotates but the world frame stays fixed. The white dot is a landmark — its coordinates change in every frame as the drone rotates.

Body Yaw 30°
A camera reports a landmark at position [0, 0, 3] in the camera frame (3 m straight ahead of the lens). The camera is mounted on a drone that has yawed 90° clockwise from its initial heading. In the world frame, where is the landmark relative to where the drone started?

Chapter 1: Points, Vectors & Coordinate Frames

Before we can rotate anything, we need to be precise about what a coordinate frame is. A coordinate frame is a set of three mutually orthogonal axes meeting at an origin. We attach frames to physical objects: the world, the robot body, each sensor. Every frame is right-handed — if you point your right-hand index finger along the x-axis and your middle finger along y, your thumb points along z.

The standard conventions in VNAV (following MIT 16.485):

Note that the camera frame is rotated roughly 180° relative to what you might expect: z points away from you, not toward you, and y points down. This is a deliberate convention that simplifies perspective projection math (Lecture 4). For now, just register that cameras and bodies use different conventions — and that transform between them matters.

Representing points algebraically

The advantage of a frame is that it lets us turn geometric intuition into algebra. A 3D point p has coordinates in the world frame:

pw = [pwx, pwy, pwz]T

The superscript w is not decoration — it is essential. The same physical point has completely different numerical coordinates in the camera frame c or the body frame r. The superscript tells you which frame's ruler you are using. In this lesson (and VNAV generally), we always write the frame as a superscript on a vector. Dropping the superscript is one of the most common mistakes in robotics code.

Translations between points

Given two points p₁ and p₂ both expressed in the world frame, the displacement (translation vector) from p₁ to p₂ is simply:

tw12 = pw2 − pw1

And the inverse: tw21 = −tw12. This "composition" rule for translations (p₂ = p₁ + t₁₂) and its inverse look trivially obvious in a single frame — but the real challenge begins when p₁ and p₂ are expressed in different frames. That requires rotation, the topic of the next six chapters.

Key insight: vectors are frame-less, coordinates are not. The physical vector pointing from p₁ to p₂ exists in the real world independent of any frame. But the three numbers [1.2, −0.4, 3.1] only mean something relative to a specific frame's axes. Change the frame, the numbers change completely — even though the physical displacement is the same.
python
import numpy as np

# A point expressed in the world frame
p1_w = np.array([1.0, 2.0, 0.5])  # [x, y, z] in world frame
p2_w = np.array([3.0, 2.0, 0.5])  # same frame — subtraction is valid

t12_w = p2_w - p1_w          # displacement in world frame = [2, 0, 0]
t21_w = p1_w - p2_w          # inverse displacement = [-2, 0, 0]

print(f"t12 in world: {t12_w}")  # [2.  0.  0.]
print(f"Inverse: {t21_w}")       # [-2.  0.  0.]

# NOTE: you CANNOT subtract vectors from different frames directly:
# p1_world - p1_camera is meaningless without first converting frames.
Point p₁ is at world coordinates [1, 0, 0] and point p₂ is at [4, 3, 0]. What is the displacement vector t₁₂ from p₁ to p₂, expressed in the world frame?

Chapter 2: Rotation Matrices & SO(3)

Imagine you have two frames sharing the same origin but pointing in different directions — say, the world frame and a body frame that has been rotated. How do you represent that orientation difference mathematically? The surprisingly elegant answer: treat each body axis as a point, and stack the world-frame coordinates of those axes as columns of a matrix.

Let the body frame have axes xr, yr, zr (unit vectors in the real world). Expressed in the world frame, these three unit vectors have coordinates xwr, ywr, zwr ∈ ℝ³. The rotation matrix Rwr is:

Rwr = [ xwr | ywr | zwr ] ∈ ℝ3×3

Read this as: "the columns of Rwr are the body axes expressed in the world frame." This is not just a clever bookkeeping trick — it is exactly the right representation because it makes transforming points a simple matrix-vector multiply (Chapter 3).

The SO(3) constraints

Not every 3×3 matrix is a rotation matrix. A valid rotation matrix must satisfy two constraints that come directly from the geometry of rigid frames:

1. Orthogonality: The three axes are unit-length and mutually orthogonal. This is equivalent to RTR = I (where I is the 3×3 identity). A matrix satisfying this is called orthogonal, and it implies:

R−1 = RT

This is the most important property of rotation matrices for computation. Inverting a matrix normally costs O(n³) operations. For a rotation matrix, the inverse is free — just take the transpose.

2. Positive determinant: The axes must satisfy the right-hand rule, which requires det(R) = +1. Orthogonal matrices with det = −1 are reflections, not rotations.

The set of all 3×3 rotation matrices forms a mathematical group called SO(3) — the Special Orthogonal group in 3D. "Special" means det = +1. Closed under matrix multiplication (composing two rotations gives a rotation), associative, has identity (I₃), and inverses (RT).

Worked example: 90° rotation about z-axis

Build the rotation matrix for a 90° counterclockwise rotation about the world z-axis. After this rotation, the body's x-axis points in the world y-direction, and the body's y-axis points in the negative world x-direction:

xwr = [cos(90°), sin(90°), 0]T = [0, 1, 0]T
ywr = [−sin(90°), cos(90°), 0]T = [−1, 0, 0]T
zwr = [0, 0, 1]T (z unchanged)

Rz(90°) =
  col1=[0,1,0]   col2=[−1,0,0]   col3=[0,0,1]

  = [[0, −1, 0],
     [1,  0, 0],
     [0,  0, 1]]

Verify: RTR = I? Column 1 dotted with column 1: 0² + 1² + 0² = 1 ✓. Column 1 dotted with column 2: 0·(−1) + 1·0 + 0·0 = 0 ✓. det = 1 ✓.

python
import numpy as np

def Rz(theta):
    """Rotation matrix for angle theta about z-axis (radians)."""
    c, s = np.cos(theta), np.sin(theta)
    return np.array([[c, -s, 0],
                     [s,  c, 0],
                     [0,  0, 1]])

def Rx(theta):
    c, s = np.cos(theta), np.sin(theta)
    return np.array([[1,  0,  0],
                     [0,  c, -s],
                     [0,  s,  c]])

def Ry(theta):
    c, s = np.cos(theta), np.sin(theta)
    return np.array([[ c, 0, s],
                     [ 0, 1, 0],
                     [-s, 0, c]])

R90 = Rz(np.pi/2)  # 90° about z
print(np.round(R90, 4))
# [[ 0. -1.  0.]
#  [ 1.  0.  0.]
#  [ 0.  0.  1.]]

# Verify orthogonality: R^T R should be identity
print(np.round(R90.T @ R90, 4))  # Identity matrix
# Verify det = +1
print(np.round(np.linalg.det(R90), 4))  # 1.0
Misconception: the inverse of a rotation matrix requires expensive linear algebra. Because R is orthogonal (RTR = I), the inverse is exactly RT — a free operation. Never call np.linalg.inv(R) on a rotation matrix in real-time robotics code. It wastes compute and introduces numerical error. Just transpose.
What two properties uniquely characterize a rotation matrix R ∈ SO(3)?

Chapter 3: Rotating Points — Composition & Inverse

We have a rotation matrix Rwr (attitude of body frame r relative to world frame w). Now three fundamental operations: express a point in a new frame, compose two rotations, and invert a rotation.

Expressing a point in a rotated frame

Given the coordinates of a point p in the body frame r, and given Rwr, what are the world-frame coordinates pw?

pw = Rwr · pr

Proof sketch: pw = prx · xwr + pry · ywr + prz · zwr — the world coordinates of p are a weighted sum of the body axes (columns of R), weighted by the body-frame coordinates of p. That weighted sum is exactly a matrix-vector product.

Worked example: The body has yawed 90° about z. A point sits 1 m directly in front of the body (in body frame: pr = [1, 0, 0]T). Where is it in the world frame?

pw = Rz(90°) · [1, 0, 0]T
= [[0,−1,0],[1,0,0],[0,0,1]] · [1,0,0]T
= [0·1+(−1)·0+0·0, 1·1+0·0+0·0, 0·1+0·0+1·0]T
= [0, 1, 0]T

The point is now 1 m along the world y-axis — makes sense, because after a 90° yaw, the body's forward direction (xr) aligns with the world's y direction.

Rotation composition

Given Rwr (r relative to w) and Rrc (camera frame c relative to r), how do we get Rwc (camera relative to world)?

Rwc = Rwr · Rrc

Proof: the columns of Rrc are the camera axes in the body frame. Multiplying on the left by Rwr expresses each of those axes in the world frame — which is exactly what Rwc should contain.

Important: Rotation composition is not commutative. Rwr Rrc ≠ Rrc Rwr in general. The notation subscript–superscript rule keeps you safe: you can only compose RAB · RBC — the subscript of the first must match the superscript of the second, and they "cancel" to give RAC.

Inverting a rotation

Given Rwr (r's orientation in world), how do we compute Rrw (world's orientation in r)?

Rrw = (Rwr)−1 = (Rwr)T

Proof: Rwr Rrw = Rww = I (identity — the world expressed relative to itself). Multiplying both sides on the left by (Rwr)−1 gives Rrw = (Rwr)−1. And since R is orthogonal, (R)−1 = RT.

Rotate a point — live visualization

Drag Rotation angle θ to rotate the body frame about z. The red dot is at pr = [1, 0] in the body frame. Watch the blue dot (pw = R·pr) trace an arc as you rotate. Numeric coordinates shown in real time.

Angle θ
pr = [1.00, 0.00]  |  pw = R·pr = [1.00, 0.00]
python
import numpy as np

# Compose rotations: world←body←camera
R_w_r = Rz(np.radians(45))   # body yawed 45° relative to world
R_r_c = Rx(np.radians(-30))  # camera pitched -30° relative to body
R_w_c = R_w_r @ R_r_c         # camera expressed in world
print("Camera in world:\n", np.round(R_w_c, 3))

# Rotate a point: landmark 1m ahead of camera, in body frame
p_r = np.array([1.0, 0.0, 0.0])
p_w = R_w_r @ p_r
print(f"Landmark in world frame: {np.round(p_w, 3)}")
# With 45° yaw: [0.707, 0.707, 0.0] — 45° between x and y axes

# Invert a rotation: world→body (just transpose)
R_r_w = R_w_r.T
print("Identity check:", np.round(R_w_r @ R_r_w, 4))  # [[1,0,0],[0,1,0],[0,0,1]]
The body frame is rotated 90° about z relative to the world frame: R = [[0,−1,0],[1,0,0],[0,0,1]]. A landmark is at p_body = [0, 2, 0]ᵀ (2 m to the left of the body). What are its world-frame coordinates?

Chapter 4: Euler Angles & Gimbal Lock

The rotation matrix R needs 9 numbers, constrained by 6 equations (RᵀR=I, det=1), leaving only 3 free parameters — which matches our intuition that 3D orientation has 3 degrees of freedom (roll, pitch, yaw). Is there a 3-number representation that directly gives us those DOF?

Euler's rotation theorem (1775) guarantees that any 3D rotation can be written as a product of no more than three elementary rotations about coordinate axes. The three angles used are called Euler angles. A particularly popular choice is the roll-pitch-yaw (RPY) convention:

R = Rz(γ) · Ry(β) · Rx(α)

Where γ is yaw (rotation about z — turning left/right), β is pitch (rotation about y — nose up/down), and α is roll (rotation about x — tilting sideways). Applied right-to-left: first roll, then pitch, then yaw.

The three elementary rotation matrices

Rx(α) = [[1, 0, 0], [0, cosα, −sinα], [0, sinα, cosα]]
Ry(β) = [[cosβ, 0, sinβ], [0, 1, 0], [−sinβ, 0, cosβ]]
Rz(γ) = [[cosγ, −sinγ, 0], [sinγ, cosγ, 0], [0, 0, 1]]

Gimbal lock — losing a degree of freedom

Euler angles look perfect: 3 intuitive parameters for 3 DOF. But they have a fatal flaw: gimbal lock. At certain configurations (specifically, when the pitch angle β = ±90°), the first and third rotation axes align. At that point, roll and yaw both rotate about the same world axis — and you lose one degree of freedom entirely.

Concretely, at β = π/2:

Rz(δ) · Ry(π/2) · Rx(α+δ) = Rz(0) · Ry(π/2) · Rx(α)

No matter how you change δ — the yaw angle — the overall rotation matrix doesn't change. Roll and yaw have become redundant: varying δ in both simultaneously leaves the result unchanged. You have a 2-parameter family of (roll, yaw) combinations that all map to the same physical orientation.

Why gimbal lock is catastrophic in navigation. If your attitude estimator uses Euler angles and the vehicle reaches a 90° pitch, the estimator becomes numerically degenerate — small measurement noise on roll and yaw produces huge corrections in opposite directions that cancel out, making gradient descent or Kalman filter updates ill-conditioned. Unmanned aircraft that attempt aerobatic maneuvers with Euler-angle estimators can lose tracking entirely at the top of a loop.

Euler angles are still widely used for display and pilot interfaces (a human understands "pitch 30°" immediately). But for computation — especially in estimation and control — rotation matrices, axis-angle, or quaternions avoid singularities entirely.

Gimbal lock demonstration

Observe how at Pitch = 90°, changing the Roll slider produces no change in the 3D frame orientation — roll and yaw have merged into the same axis. The orange triangle shows the locked state. Drag Pitch to 90° and try varying Roll.

Roll α
Pitch β
Yaw γ
Gimbal lock occurs in RPY (yaw-pitch-roll) Euler angles when the pitch angle β reaches ±90°. What goes wrong?

Chapter 5: Axis-Angle & Rodrigues' Formula

Euler's second rotation theorem (1776) suggests a different way to think about rotations: every rotation in 3D is equivalent to a single rotation of some angle θ about some fixed axis u (a unit vector). This is the axis-angle representation, and it has a beautiful geometric clarity that matrices and Euler angles lack.

Given a rotation described as (u, θ) — a unit axis u = [ux, uy, uz]T with ‖u‖=1, and an angle θ — we can convert to a rotation matrix via Rodrigues' rotation formula:

R(u, θ) = cos(θ) I3 + sin(θ) [u]× + (1 − cos(θ)) uuT

Where [u]× is the skew-symmetric matrix (also called the cross-product matrix) of u:

[u]× =   [[ 0, −uz, uy],
 [ uz, 0, −ux],
 [−uy, ux, 0]]

The skew-symmetric matrix encodes cross products: [u]×v = u × v. This matrix will reappear throughout VNAV — it is the infinitesimal generator of rotations and the key object in Lie algebra (next lecture).

Worked example: 90° rotation about [0, 0, 1] via Rodrigues

Let u = [0, 0, 1]T (z-axis), θ = π/2. Then:

[u]× = [[0,−1,0],[1,0,0],[0,0,0]]
uuT = [[0,0,0],[0,0,0],[0,0,1]]

R = cos(90°)I + sin(90°)[u]× + (1−cos(90°))uuT
= 0·I + 1·[[0,−1,0],[1,0,0],[0,0,0]] + 1·[[0,0,0],[0,0,0],[0,0,1]]
= [[0,−1,0],[1,0,0],[0,0,1]]

This matches Rz(90°) exactly, confirming Rodrigues' formula is consistent with the column-stacking definition.

Axis-angle pros and cons

The axis-angle representation is elegant for visualization — "rotate 30° about [0,1,0]" is immediately understandable. The inverse is trivial: R(u, θ)⁻¹ = R(u, −θ) = R(−u, θ). But computing a product (composing two rotations) requires expanding each back to a matrix first — there is no closed-form axis-angle composition rule as simple as matrix multiply.

Key insight: 3 independent parameters, no singularity. Axis-angle encodes orientation with 4 numbers (ux, uy, uz, θ) but only 3 are free (because ‖u‖=1 and θ lives in [0, 2π)). Crucially, it has no singularity — unlike Euler angles. This makes it the basis for the even more powerful quaternion representation (next chapter).
python
import numpy as np

def skew(u):
    """Skew-symmetric (cross-product) matrix of a 3-vector u."""
    return np.array([[      0, -u[2],  u[1]],
                     [ u[2],       0, -u[0]],
                     [-u[1],  u[0],       0]])

def rodrigues(u, theta):
    """Rodrigues' formula: rotation matrix from axis u (unit) and angle theta."""
    u = np.asarray(u, dtype=float)
    u /= np.linalg.norm(u)  # ensure unit vector
    K = skew(u)
    return (np.cos(theta) * np.eye(3)
            + np.sin(theta) * K
            + (1 - np.cos(theta)) * np.outer(u, u))

# 30° rotation about the [1, 1, 0]/√2 axis
axis = np.array([1, 1, 0]) / np.sqrt(2)
R = rodrigues(axis, np.radians(30))
print(np.round(R, 3))
# [[0.933 0.067 0.354]
#  [0.067 0.933 -0.354]
#  [-0.354 0.354 0.866]]

# Verify: trace(R) = 1 + 2cos(theta)
tr = np.trace(R)
theta_recovered = np.arccos((tr - 1) / 2)
print(f"Recovered angle: {np.degrees(theta_recovered):.1f}°")  # 30.0°
In Rodrigues' formula R = cos(θ)I + sin(θ)[u]× + (1−cos(θ))uuᵀ, what is the role of the term (1−cos(θ))uuᵀ?

Chapter 6: Quaternions — No Singularities, Minimal Parameters

Euler angles use 3 parameters but have singularities. Rotation matrices use 9 (with constraints). Axis-angle uses 4 (with ‖u‖=1 reducing it to 3 effective). Is there a representation that is singularity-free and uses the minimum number of parameters? Hamilton answered this in 1843 with quaternions.

The key idea: rearrange the axis-angle (u, θ) as a 4-vector by splitting the axis into a "scaled" form:

q = [vT, s]T   where   v = sin(θ/2) u,    s = cos(θ/2)

That is: the vector part v = sin(θ/2)·u encodes the axis scaled by half-angle sine, and the scalar part s = cos(θ/2) encodes the half-angle cosine. Notice this is a unit vector in ℝ⁴: ‖q‖² = sin²(θ/2)‖u‖² + cos²(θ/2) = sin²(θ/2) + cos²(θ/2) = 1. So a unit quaternion lives on the 3-sphere S³ ⊂ ℝ⁴.

Why half-angles?

The half-angle encoding is not arbitrary — it arises naturally from the algebraic structure of Hamilton's quaternion multiplication, and it is what makes quaternions singularity-free. Unlike Euler angles, there is no pitch angle at which two axes align. The cost: the same physical rotation is represented by both q and −q (negating all four components flips θ → θ + 2π, which is the same rotation). This double cover of SO(3) by S³ is the only ambiguity in quaternion representations.

Quaternion to rotation matrix

Given a unit quaternion q = [q₁, q₂, q₃, q₄]T (where q₄ is the scalar part in our convention), the corresponding rotation matrix is:

R(q) =
[[q₁²−q₂²−q₃²+q₄², 2(q₁q₂−q₃q₄), 2(q₁q₃+q₂q₄)],
 [2(q₁q₂+q₃q₄), −q₁²+q₂²−q₃²+q₄², 2(q₂q₃−q₁q₄)],
 [2(q₁q₃−q₂q₄), 2(q₂q₃+q₁q₄), −q₁²−q₂²+q₃²+q₄²]]

Worked example: A 90° rotation about z: θ=90°, u=[0,0,1]. Then v = sin(45°)·[0,0,1] = [0, 0, 0.707], s = cos(45°) = 0.707. So q = [0, 0, 0.707, 0.707]T. Substituting into R(q): q₁=0, q₂=0, q₃=0.707, q₄=0.707.

R11 = 0−0−0.5+0.5 = 0 ✓
R21 = 2(0+0.707·0.707) = 2·0.5 = 1 ✓
R33 = 0−0+0.5+0.5 = 1 ✓

Quaternion composition and inverse

Composing two rotations qA and qB (applying B first, then A) uses quaternion product ⊗:

qAC = qA ⊗ qC

The product involves 16 multiplications (versus 27 for matrix multiply) — a computational advantage for graphics and robotics code that must transform many points quickly. The inverse of a unit quaternion is its conjugate: flip the sign of the vector part:

q−1 = [−vT, s]T

This reflects the axis-angle intuition: to undo a rotation of θ about u, rotate −θ about u, which maps sin(−θ/2)=−sin(θ/2) but leaves cos(−θ/2)=cos(θ/2) unchanged.

Rotation representation explorer

Drag the Angle θ slider. The shape (orange square) rotates by θ about the z-axis. Below, see the equivalent quaternion, rotation matrix column 1, and axis-angle. Notice the quaternion components vary smoothly while the rotation matrix entries oscillate.

Angle θ 45°
python
import numpy as np

def axis_angle_to_quat(u, theta):
    """Convert axis-angle to unit quaternion [v; s] where v=sin(θ/2)u, s=cos(θ/2)."""
    u = np.array(u) / np.linalg.norm(u)
    v = np.sin(theta / 2) * u
    s = np.cos(theta / 2)
    return np.append(v, s)  # [v1, v2, v3, s]

def quat_to_rot(q):
    """Unit quaternion [q1,q2,q3,q4] (scalar last) to rotation matrix."""
    q1, q2, q3, q4 = q
    return np.array([
        [q1**2-q2**2-q3**2+q4**2, 2*(q1*q2-q3*q4),         2*(q1*q3+q2*q4)        ],
        [2*(q1*q2+q3*q4),         -q1**2+q2**2-q3**2+q4**2, 2*(q2*q3-q1*q4)        ],
        [2*(q1*q3-q2*q4),          2*(q2*q3+q1*q4),        -q1**2-q2**2+q3**2+q4**2]
    ])

# 90° rotation about z-axis via quaternion
q = axis_angle_to_quat([0,0,1], np.pi/2)
print(f"q = {np.round(q, 3)}")  # [0.    0.    0.707 0.707]

R_from_q = quat_to_rot(q)
print(np.round(R_from_q, 3))
# [[0. -1.  0.]
#  [1.  0.  0.]   ← matches R_z(90°) exactly
#  [0.  0.  1.]]

# Double cover: q and -q give the same rotation
R_neg = quat_to_rot(-q)
print("Same matrix?", np.allclose(R_from_q, R_neg))  # True
Summary of rotation representations: Rotation matrices (9 params, no singularity, slow to compose), Euler angles (3 params, intuitive, gimbal lock), axis-angle (4 params/3 DOF, clear geometry, no singularity), quaternions (4 params/3 DOF, no singularity, fast composition). For computation in estimation and navigation: rotation matrices or quaternions. For display: Euler angles.
A quaternion q and its negation −q both represent the same rotation. What is the physical reason for this double cover?

Chapter 7: Rigid Body Transforms & SE(3)

So far we have only handled rotation — frames that share the same origin but point in different directions. In practice, a camera is not at the same location as the robot's center of mass, and the robot is not at the world origin. We need to handle both rotation and translation simultaneously. This combined operation is a rigid-body transform, and the set of all such transforms forms the group SE(3) — the Special Euclidean group in 3D.

A rigid-body transform (or pose) of frame r relative to world frame w is fully described by a pair: a rotation Rwr ∈ SO(3) and a translation twr ∈ ℝ³ (position of the r origin in world coordinates). The transformation of a point pr (in frame r) to world coordinates is:

pw = Rwr · pr + twr

Simple enough — but composing two such transforms (applying T₁ then T₂) becomes messy with separate R and t: we need (R₂R₁)p + R₂t₁ + t₂. The algebra gets worse with each additional transform in the chain. The solution is elegant: lift into homogeneous coordinates.

Homogeneous coordinates — the magic trick

Append a 1 to any 3D point to get a homogeneous coordinate vector in ℝ⁴:

r = [prx, pry, prz, 1]T ∈ ℝ4

Now define the 4×4 transform matrix Twr ∈ SE(3):

Twr = [[Rwr, twr], [0T, 1]]

Written out fully — a 4×4 matrix where the top-left 3×3 is the rotation, the top-right 3×1 is the translation, the bottom row is [0 0 0 1]:

Twr =
 [[r₁₁, r₁₂, r₁₃, tx],
  [r₂₁, r₂₂, r₂₃, ty],
  [r₃₁, r₃₂, r₃₃, tz],
  [0, 0, 0, 1]]

Then the transformation of a point is simply:

w = Twr · p̃r

Expanding: the top three rows give R·pr + t (the rotation plus translation), and the bottom row ensures the last entry stays 1. The homogeneous 1 "carries" the translation through the matrix multiply — an algebraic trick that turns an affine operation into a linear one.

Composing transforms

Given Twr (pose of body in world) and Trc (pose of camera in body), the pose of the camera in the world:

Twc = Twr · Trc

Proof by block matrix multiplication:

Twr Trc = [[Rwr, twr],[0,1]] · [[Rrc, trc],[0,1]]
= [[RwrRrc, Rwrtrc+twr],[0,1]]
= [[Rwc, twc],[0,1]] = Twc

The world-frame translation of the camera is Rwr·trc + twr: first rotate the camera's offset into world coordinates, then add the body's world-frame position. Exactly right geometrically.

Inverting a transform

Given Twr, compute Trw (world expressed in body frame):

Trw = (Twr)−1 = [[(Rwr)T, −(Rwr)Ttwr], [0, 1]]

Unlike pure rotations, the inverse of a transform matrix is not its transpose (T is not orthogonal in general). But it still has a closed form — just transpose the rotation block and multiply the translation by it. This is O(12) operations, much cheaper than general 4×4 matrix inversion.

Worked example — full chain: Body is at world position [5, 0, 0] with 90° yaw. Camera is at body position [0.3, 0, 0.2] with 0° rotation relative to body. Where is the camera in the world?

Twr: R=Rz(90°), t=[5,0,0]
Trc: R=I, t=[0.3,0,0.2]

twc = Rwr·trc + twr
= [[0,−1,0],[1,0,0],[0,0,1]]·[0.3,0,0.2] + [5,0,0]
= [0·0.3+(−1)·0+0·0.2, 1·0.3+0·0+0·0.2, 0·0.3+0·0+1·0.2] + [5,0,0]
= [0, 0.3, 0.2] + [5, 0, 0] = [5.0, 0.3, 0.2]

The camera sits 5 m along world-x, 0.3 m along world-y (because the 90° yaw maps body-x to world-y), and 0.2 m up. Correct.

python
import numpy as np

def make_T(R, t):
    """Build 4×4 SE(3) transform from 3×3 rotation R and 3-vector t."""
    T = np.eye(4)
    T[:3, :3] = R
    T[:3,  3] = t
    return T

def invert_T(T):
    """Invert SE(3) transform using the closed-form formula (NOT np.linalg.inv)."""
    R = T[:3, :3]
    t = T[:3,  3]
    T_inv = np.eye(4)
    T_inv[:3, :3] = R.T
    T_inv[:3,  3] = -R.T @ t
    return T_inv

# Worked example: body at world [5,0,0] with 90° yaw
T_w_r = make_T(Rz(np.pi/2), np.array([5, 0, 0]))
T_r_c = make_T(np.eye(3),     np.array([0.3, 0, 0.2]))

T_w_c = T_w_r @ T_r_c  # camera pose in world
print("Camera position in world:", np.round(T_w_c[:3, 3], 3))
# [5.   0.3  0.2]  ← matches worked example above

# Invert: express world frame in body frame
T_r_w = invert_T(T_w_r)
print("Identity check:", np.round(T_w_r @ T_r_w, 3))  # 4×4 identity

# Transform a 3D point from camera frame to world frame
p_c = np.array([0, 0, 3, 1])  # 3m ahead of camera (homogeneous)
p_w_hom = T_w_c @ p_c
print("Landmark in world:", np.round(p_w_hom[:3], 3))
Misconception: invert a transform with np.linalg.inv(T). This computes a general matrix inverse (O(n³) with floating-point pivoting). For SE(3) transforms you have an exact closed form: T⁻¹ = [[Rᵀ, −Rᵀt],[0,1]]. It is ~10× faster and numerically cleaner. In real-time robotics, calling general matrix inversion on a pose is a code smell.
A transform T_w_r has rotation R and translation t. The formula for its inverse is T_r_w = [[Rᵀ, −Rᵀt],[0,1]]. Why is the translation block −Rᵀt (not just −t)?

Chapter 8: Showcase — Transform Composer

This chapter is the payoff. You control an entire chain of coordinate transforms: world → body → camera → point. Drag the sliders to move and rotate each frame. Watch the point's coordinates update live in every frame. Compose transforms, invert them, and see how a single physical point has completely different numerical representations depending on which frame's ruler you use.

This is exactly the computation that runs inside every VIO/SLAM system 100+ times per second: an IMU gives the body pose in the world, calibration gives the camera pose in the body, and the pipeline computes where each image feature lies in the world.

Transform composer — world → body → camera → landmark

Drag sliders to change the body position (tx, ty) and orientation (yaw), and the camera offset from the body (camx, camy). A landmark (white star) sits at a fixed world position. The display shows the landmark's coordinates in the world frame, body frame, and camera frame in real time.

Body pos X 1.0
Body pos Y 0.0
Body yaw
Camera offset X 0.5
Key takeaway from the showcase: The same physical point (landmark in world) has three very different coordinate vectors depending on the frame. World coordinates never change. Body-frame coordinates depend on the drone's pose. Camera-frame coordinates depend on the drone's pose AND the camera's mounting. All three are related by exact matrix operations — no approximation, no loss of information.

Chapter 9: Connections & Cheat Sheet

Rotation representation comparison

RepresentationDOF / paramsSingularity?CompositionBest for
Rotation matrix R ∈ SO(3)3 DOF / 9 paramsNoMatrix multiply (27 mults)Computation, kinematics chains
Euler angles (roll, pitch, yaw)3 DOF / 3 paramsYes (gimbal lock at ±90° pitch)Slow (trig)Display, pilot interface
Axis-angle (u, θ)3 DOF / 4 paramsNo (but θ=0 has non-unique axis)Convert to R firstVisualization, Lie algebra
Quaternion q ∈ S³3 DOF / 4 paramsNo (double cover: q = −q)Quat product (16 mults)Storage, interpolation (SLERP), graphics

SO(3) / SE(3) key properties cheat sheet

OperationSO(3) rotationSE(3) transform
InverseR−1 = RTT−1 = [[RT, −RTt],[0,1]]
CompositionRAC = RAB · RBCTAC = TAB · TBC
Transform pointpw = R prw = T p̃r
Commutative?No (R₁R₂ ≠ R₂R₁ generally)No
Notation ruleRAB · RBC: inner indices cancelSame rule
ConstraintsRTR=I, det=+1Bottom row = [0,0,0,1]

What comes next: Lie Groups (Lecture 2)

SO(3) and SE(3) are not just sets — they are Lie groups, meaning they are both groups (closed under composition, have identity and inverses) and smooth manifolds (you can take derivatives on them). This matters for estimation: the Kalman filter requires adding perturbations to a state, but "adding" two rotation matrices does not give a rotation matrix. Lecture 2 covers the Lie algebra of SO(3) — the tangent space at identity, where you can do ordinary vector addition — and the exponential map that brings you back to the group. The skew-symmetric matrix [u]× from Rodrigues' formula is the Lie algebra element; the full Rodrigues formula is the exponential map.

Related lessons on this site

Uses the same frame conventions for state vectors (pose of robot in world). The "state" is a point in SE(3); the filter update lives in the Lie algebra.
↓ depends on rotation/transform fundamentals
Linearization of nonlinear state transitions — the Jacobian of a rotation is closely related to the skew-symmetric matrix from this lesson.
Visual-Inertial Odometry: SE(3) transforms are the core state; camera projection adds a frame (pixel frame) to the chain covered here.
SLAM maintains a map of landmarks and a robot pose — all in SE(3). The constraint graph connecting frames is exactly transform composition.
"What I cannot create, I do not understand." You now have everything needed to implement a full rigid-body transform pipeline from scratch: build rotation matrices from axis-angle via Rodrigues, compose them into SE(3) transforms, transform point clouds between sensor frames, and invert transforms to get the world from the robot's perspective. That is the geometric core of every visual navigation system ever built.
You have three frames: world (w), body (r), camera (c). You know T_w_r (body in world) and T_w_c (camera in world). How do you compute T_r_c (camera in body frame)?