A drone's camera spots a landmark "2 m ahead, 1 m left." But in what frame? Without rigorously relating coordinate frames — world, body, camera — navigation is impossible. This lesson builds the geometric language every later VNAV topic speaks: points and frames, rotation matrices in SO(3), Euler angles and their singularities, axis-angle and quaternions, and rigid-body transforms in SE(3). Every formula derived with actual numbers. MIT 16.485 by Luca Carlone.
Your autonomous drone hovers at 10 m altitude. Its forward-facing camera detects a traffic cone directly ahead at a pixel offset that translates to "2.0 m in front, 1.0 m to the left." You want to navigate around it. Simple enough — but here is the first question that will define the next ten chapters: 2.0 m in front and 1.0 m to the left of what, exactly?
The camera has its own camera frame: x-axis pointing right, y-axis pointing down, z-axis pointing into the scene. The drone body has a body frame: x forward, y left, z up. The world has a world frame with a fixed origin on the ground. These three frames are all rotated and translated with respect to each other — and they change relative to each other every millisecond the drone moves.
If you naively treat "2 m forward" in the camera frame as "2 m forward" in the world frame, you will command the drone to fly into the cone if the camera is pitched down, or to miss it entirely if the drone has yawed 90 degrees since the last measurement. Navigation collapses without a way to rigorously convert coordinates between frames.
This is not a small bookkeeping detail. The history of robotics disasters is full of frame confusion: the Mars Climate Orbiter was lost in 1999 because one team used metric units and another imperial — a frame/units mismatch worth $327 million. In visual navigation the analogous mistake is misrepresenting rotation: apply the wrong transform and a quadrotor that should hover begins to spiral.
The good news: the math is clean, closed-form, and fully derivable from first principles. By the end of this lesson you will have a complete toolkit — rotation matrices, Euler angles, quaternions, and 4×4 homogeneous transforms — and the intuition to know when to use each one.
The world frame (grey) is fixed. The body frame (orange) is attached to the drone. The camera frame (teal) is attached to the camera. Drag the Yaw slider to rotate the drone; notice the body frame rotates but the world frame stays fixed. The white dot is a landmark — its coordinates change in every frame as the drone rotates.
Before we can rotate anything, we need to be precise about what a coordinate frame is. A coordinate frame is a set of three mutually orthogonal axes meeting at an origin. We attach frames to physical objects: the world, the robot body, each sensor. Every frame is right-handed — if you point your right-hand index finger along the x-axis and your middle finger along y, your thumb points along z.
The standard conventions in VNAV (following MIT 16.485):
Note that the camera frame is rotated roughly 180° relative to what you might expect: z points away from you, not toward you, and y points down. This is a deliberate convention that simplifies perspective projection math (Lecture 4). For now, just register that cameras and bodies use different conventions — and that transform between them matters.
The advantage of a frame is that it lets us turn geometric intuition into algebra. A 3D point p has coordinates in the world frame:
The superscript w is not decoration — it is essential. The same physical point has completely different numerical coordinates in the camera frame c or the body frame r. The superscript tells you which frame's ruler you are using. In this lesson (and VNAV generally), we always write the frame as a superscript on a vector. Dropping the superscript is one of the most common mistakes in robotics code.
Given two points p₁ and p₂ both expressed in the world frame, the displacement (translation vector) from p₁ to p₂ is simply:
And the inverse: tw21 = −tw12. This "composition" rule for translations (p₂ = p₁ + t₁₂) and its inverse look trivially obvious in a single frame — but the real challenge begins when p₁ and p₂ are expressed in different frames. That requires rotation, the topic of the next six chapters.
python import numpy as np # A point expressed in the world frame p1_w = np.array([1.0, 2.0, 0.5]) # [x, y, z] in world frame p2_w = np.array([3.0, 2.0, 0.5]) # same frame — subtraction is valid t12_w = p2_w - p1_w # displacement in world frame = [2, 0, 0] t21_w = p1_w - p2_w # inverse displacement = [-2, 0, 0] print(f"t12 in world: {t12_w}") # [2. 0. 0.] print(f"Inverse: {t21_w}") # [-2. 0. 0.] # NOTE: you CANNOT subtract vectors from different frames directly: # p1_world - p1_camera is meaningless without first converting frames.
Imagine you have two frames sharing the same origin but pointing in different directions — say, the world frame and a body frame that has been rotated. How do you represent that orientation difference mathematically? The surprisingly elegant answer: treat each body axis as a point, and stack the world-frame coordinates of those axes as columns of a matrix.
Let the body frame have axes xr, yr, zr (unit vectors in the real world). Expressed in the world frame, these three unit vectors have coordinates xwr, ywr, zwr ∈ ℝ³. The rotation matrix Rwr is:
Read this as: "the columns of Rwr are the body axes expressed in the world frame." This is not just a clever bookkeeping trick — it is exactly the right representation because it makes transforming points a simple matrix-vector multiply (Chapter 3).
Not every 3×3 matrix is a rotation matrix. A valid rotation matrix must satisfy two constraints that come directly from the geometry of rigid frames:
1. Orthogonality: The three axes are unit-length and mutually orthogonal. This is equivalent to RTR = I (where I is the 3×3 identity). A matrix satisfying this is called orthogonal, and it implies:
This is the most important property of rotation matrices for computation. Inverting a matrix normally costs O(n³) operations. For a rotation matrix, the inverse is free — just take the transpose.
2. Positive determinant: The axes must satisfy the right-hand rule, which requires det(R) = +1. Orthogonal matrices with det = −1 are reflections, not rotations.
The set of all 3×3 rotation matrices forms a mathematical group called SO(3) — the Special Orthogonal group in 3D. "Special" means det = +1. Closed under matrix multiplication (composing two rotations gives a rotation), associative, has identity (I₃), and inverses (RT).
Build the rotation matrix for a 90° counterclockwise rotation about the world z-axis. After this rotation, the body's x-axis points in the world y-direction, and the body's y-axis points in the negative world x-direction:
Verify: RTR = I? Column 1 dotted with column 1: 0² + 1² + 0² = 1 ✓. Column 1 dotted with column 2: 0·(−1) + 1·0 + 0·0 = 0 ✓. det = 1 ✓.
python import numpy as np def Rz(theta): """Rotation matrix for angle theta about z-axis (radians).""" c, s = np.cos(theta), np.sin(theta) return np.array([[c, -s, 0], [s, c, 0], [0, 0, 1]]) def Rx(theta): c, s = np.cos(theta), np.sin(theta) return np.array([[1, 0, 0], [0, c, -s], [0, s, c]]) def Ry(theta): c, s = np.cos(theta), np.sin(theta) return np.array([[ c, 0, s], [ 0, 1, 0], [-s, 0, c]]) R90 = Rz(np.pi/2) # 90° about z print(np.round(R90, 4)) # [[ 0. -1. 0.] # [ 1. 0. 0.] # [ 0. 0. 1.]] # Verify orthogonality: R^T R should be identity print(np.round(R90.T @ R90, 4)) # Identity matrix # Verify det = +1 print(np.round(np.linalg.det(R90), 4)) # 1.0
We have a rotation matrix Rwr (attitude of body frame r relative to world frame w). Now three fundamental operations: express a point in a new frame, compose two rotations, and invert a rotation.
Given the coordinates of a point p in the body frame r, and given Rwr, what are the world-frame coordinates pw?
Proof sketch: pw = prx · xwr + pry · ywr + prz · zwr — the world coordinates of p are a weighted sum of the body axes (columns of R), weighted by the body-frame coordinates of p. That weighted sum is exactly a matrix-vector product.
Worked example: The body has yawed 90° about z. A point sits 1 m directly in front of the body (in body frame: pr = [1, 0, 0]T). Where is it in the world frame?
The point is now 1 m along the world y-axis — makes sense, because after a 90° yaw, the body's forward direction (xr) aligns with the world's y direction.
Given Rwr (r relative to w) and Rrc (camera frame c relative to r), how do we get Rwc (camera relative to world)?
Proof: the columns of Rrc are the camera axes in the body frame. Multiplying on the left by Rwr expresses each of those axes in the world frame — which is exactly what Rwc should contain.
Important: Rotation composition is not commutative. Rwr Rrc ≠ Rrc Rwr in general. The notation subscript–superscript rule keeps you safe: you can only compose RAB · RBC — the subscript of the first must match the superscript of the second, and they "cancel" to give RAC.
Given Rwr (r's orientation in world), how do we compute Rrw (world's orientation in r)?
Proof: Rwr Rrw = Rww = I (identity — the world expressed relative to itself). Multiplying both sides on the left by (Rwr)−1 gives Rrw = (Rwr)−1. And since R is orthogonal, (R)−1 = RT.
Drag Rotation angle θ to rotate the body frame about z. The red dot is at pr = [1, 0] in the body frame. Watch the blue dot (pw = R·pr) trace an arc as you rotate. Numeric coordinates shown in real time.
python import numpy as np # Compose rotations: world←body←camera R_w_r = Rz(np.radians(45)) # body yawed 45° relative to world R_r_c = Rx(np.radians(-30)) # camera pitched -30° relative to body R_w_c = R_w_r @ R_r_c # camera expressed in world print("Camera in world:\n", np.round(R_w_c, 3)) # Rotate a point: landmark 1m ahead of camera, in body frame p_r = np.array([1.0, 0.0, 0.0]) p_w = R_w_r @ p_r print(f"Landmark in world frame: {np.round(p_w, 3)}") # With 45° yaw: [0.707, 0.707, 0.0] — 45° between x and y axes # Invert a rotation: world→body (just transpose) R_r_w = R_w_r.T print("Identity check:", np.round(R_w_r @ R_r_w, 4)) # [[1,0,0],[0,1,0],[0,0,1]]
The rotation matrix R needs 9 numbers, constrained by 6 equations (RᵀR=I, det=1), leaving only 3 free parameters — which matches our intuition that 3D orientation has 3 degrees of freedom (roll, pitch, yaw). Is there a 3-number representation that directly gives us those DOF?
Euler's rotation theorem (1775) guarantees that any 3D rotation can be written as a product of no more than three elementary rotations about coordinate axes. The three angles used are called Euler angles. A particularly popular choice is the roll-pitch-yaw (RPY) convention:
Where γ is yaw (rotation about z — turning left/right), β is pitch (rotation about y — nose up/down), and α is roll (rotation about x — tilting sideways). Applied right-to-left: first roll, then pitch, then yaw.
Euler angles look perfect: 3 intuitive parameters for 3 DOF. But they have a fatal flaw: gimbal lock. At certain configurations (specifically, when the pitch angle β = ±90°), the first and third rotation axes align. At that point, roll and yaw both rotate about the same world axis — and you lose one degree of freedom entirely.
Concretely, at β = π/2:
No matter how you change δ — the yaw angle — the overall rotation matrix doesn't change. Roll and yaw have become redundant: varying δ in both simultaneously leaves the result unchanged. You have a 2-parameter family of (roll, yaw) combinations that all map to the same physical orientation.
Euler angles are still widely used for display and pilot interfaces (a human understands "pitch 30°" immediately). But for computation — especially in estimation and control — rotation matrices, axis-angle, or quaternions avoid singularities entirely.
Observe how at Pitch = 90°, changing the Roll slider produces no change in the 3D frame orientation — roll and yaw have merged into the same axis. The orange triangle shows the locked state. Drag Pitch to 90° and try varying Roll.
Euler's second rotation theorem (1776) suggests a different way to think about rotations: every rotation in 3D is equivalent to a single rotation of some angle θ about some fixed axis u (a unit vector). This is the axis-angle representation, and it has a beautiful geometric clarity that matrices and Euler angles lack.
Given a rotation described as (u, θ) — a unit axis u = [ux, uy, uz]T with ‖u‖=1, and an angle θ — we can convert to a rotation matrix via Rodrigues' rotation formula:
Where [u]× is the skew-symmetric matrix (also called the cross-product matrix) of u:
The skew-symmetric matrix encodes cross products: [u]×v = u × v. This matrix will reappear throughout VNAV — it is the infinitesimal generator of rotations and the key object in Lie algebra (next lecture).
Let u = [0, 0, 1]T (z-axis), θ = π/2. Then:
This matches Rz(90°) exactly, confirming Rodrigues' formula is consistent with the column-stacking definition.
The axis-angle representation is elegant for visualization — "rotate 30° about [0,1,0]" is immediately understandable. The inverse is trivial: R(u, θ)⁻¹ = R(u, −θ) = R(−u, θ). But computing a product (composing two rotations) requires expanding each back to a matrix first — there is no closed-form axis-angle composition rule as simple as matrix multiply.
python import numpy as np def skew(u): """Skew-symmetric (cross-product) matrix of a 3-vector u.""" return np.array([[ 0, -u[2], u[1]], [ u[2], 0, -u[0]], [-u[1], u[0], 0]]) def rodrigues(u, theta): """Rodrigues' formula: rotation matrix from axis u (unit) and angle theta.""" u = np.asarray(u, dtype=float) u /= np.linalg.norm(u) # ensure unit vector K = skew(u) return (np.cos(theta) * np.eye(3) + np.sin(theta) * K + (1 - np.cos(theta)) * np.outer(u, u)) # 30° rotation about the [1, 1, 0]/√2 axis axis = np.array([1, 1, 0]) / np.sqrt(2) R = rodrigues(axis, np.radians(30)) print(np.round(R, 3)) # [[0.933 0.067 0.354] # [0.067 0.933 -0.354] # [-0.354 0.354 0.866]] # Verify: trace(R) = 1 + 2cos(theta) tr = np.trace(R) theta_recovered = np.arccos((tr - 1) / 2) print(f"Recovered angle: {np.degrees(theta_recovered):.1f}°") # 30.0°
Euler angles use 3 parameters but have singularities. Rotation matrices use 9 (with constraints). Axis-angle uses 4 (with ‖u‖=1 reducing it to 3 effective). Is there a representation that is singularity-free and uses the minimum number of parameters? Hamilton answered this in 1843 with quaternions.
The key idea: rearrange the axis-angle (u, θ) as a 4-vector by splitting the axis into a "scaled" form:
That is: the vector part v = sin(θ/2)·u encodes the axis scaled by half-angle sine, and the scalar part s = cos(θ/2) encodes the half-angle cosine. Notice this is a unit vector in ℝ⁴: ‖q‖² = sin²(θ/2)‖u‖² + cos²(θ/2) = sin²(θ/2) + cos²(θ/2) = 1. So a unit quaternion lives on the 3-sphere S³ ⊂ ℝ⁴.
The half-angle encoding is not arbitrary — it arises naturally from the algebraic structure of Hamilton's quaternion multiplication, and it is what makes quaternions singularity-free. Unlike Euler angles, there is no pitch angle at which two axes align. The cost: the same physical rotation is represented by both q and −q (negating all four components flips θ → θ + 2π, which is the same rotation). This double cover of SO(3) by S³ is the only ambiguity in quaternion representations.
Given a unit quaternion q = [q₁, q₂, q₃, q₄]T (where q₄ is the scalar part in our convention), the corresponding rotation matrix is:
Worked example: A 90° rotation about z: θ=90°, u=[0,0,1]. Then v = sin(45°)·[0,0,1] = [0, 0, 0.707], s = cos(45°) = 0.707. So q = [0, 0, 0.707, 0.707]T. Substituting into R(q): q₁=0, q₂=0, q₃=0.707, q₄=0.707.
Composing two rotations qA and qB (applying B first, then A) uses quaternion product ⊗:
The product involves 16 multiplications (versus 27 for matrix multiply) — a computational advantage for graphics and robotics code that must transform many points quickly. The inverse of a unit quaternion is its conjugate: flip the sign of the vector part:
This reflects the axis-angle intuition: to undo a rotation of θ about u, rotate −θ about u, which maps sin(−θ/2)=−sin(θ/2) but leaves cos(−θ/2)=cos(θ/2) unchanged.
Drag the Angle θ slider. The shape (orange square) rotates by θ about the z-axis. Below, see the equivalent quaternion, rotation matrix column 1, and axis-angle. Notice the quaternion components vary smoothly while the rotation matrix entries oscillate.
python import numpy as np def axis_angle_to_quat(u, theta): """Convert axis-angle to unit quaternion [v; s] where v=sin(θ/2)u, s=cos(θ/2).""" u = np.array(u) / np.linalg.norm(u) v = np.sin(theta / 2) * u s = np.cos(theta / 2) return np.append(v, s) # [v1, v2, v3, s] def quat_to_rot(q): """Unit quaternion [q1,q2,q3,q4] (scalar last) to rotation matrix.""" q1, q2, q3, q4 = q return np.array([ [q1**2-q2**2-q3**2+q4**2, 2*(q1*q2-q3*q4), 2*(q1*q3+q2*q4) ], [2*(q1*q2+q3*q4), -q1**2+q2**2-q3**2+q4**2, 2*(q2*q3-q1*q4) ], [2*(q1*q3-q2*q4), 2*(q2*q3+q1*q4), -q1**2-q2**2+q3**2+q4**2] ]) # 90° rotation about z-axis via quaternion q = axis_angle_to_quat([0,0,1], np.pi/2) print(f"q = {np.round(q, 3)}") # [0. 0. 0.707 0.707] R_from_q = quat_to_rot(q) print(np.round(R_from_q, 3)) # [[0. -1. 0.] # [1. 0. 0.] ← matches R_z(90°) exactly # [0. 0. 1.]] # Double cover: q and -q give the same rotation R_neg = quat_to_rot(-q) print("Same matrix?", np.allclose(R_from_q, R_neg)) # True
So far we have only handled rotation — frames that share the same origin but point in different directions. In practice, a camera is not at the same location as the robot's center of mass, and the robot is not at the world origin. We need to handle both rotation and translation simultaneously. This combined operation is a rigid-body transform, and the set of all such transforms forms the group SE(3) — the Special Euclidean group in 3D.
A rigid-body transform (or pose) of frame r relative to world frame w is fully described by a pair: a rotation Rwr ∈ SO(3) and a translation twr ∈ ℝ³ (position of the r origin in world coordinates). The transformation of a point pr (in frame r) to world coordinates is:
Simple enough — but composing two such transforms (applying T₁ then T₂) becomes messy with separate R and t: we need (R₂R₁)p + R₂t₁ + t₂. The algebra gets worse with each additional transform in the chain. The solution is elegant: lift into homogeneous coordinates.
Append a 1 to any 3D point to get a homogeneous coordinate vector in ℝ⁴:
Now define the 4×4 transform matrix Twr ∈ SE(3):
Written out fully — a 4×4 matrix where the top-left 3×3 is the rotation, the top-right 3×1 is the translation, the bottom row is [0 0 0 1]:
Then the transformation of a point is simply:
Expanding: the top three rows give R·pr + t (the rotation plus translation), and the bottom row ensures the last entry stays 1. The homogeneous 1 "carries" the translation through the matrix multiply — an algebraic trick that turns an affine operation into a linear one.
Given Twr (pose of body in world) and Trc (pose of camera in body), the pose of the camera in the world:
Proof by block matrix multiplication:
The world-frame translation of the camera is Rwr·trc + twr: first rotate the camera's offset into world coordinates, then add the body's world-frame position. Exactly right geometrically.
Given Twr, compute Trw (world expressed in body frame):
Unlike pure rotations, the inverse of a transform matrix is not its transpose (T is not orthogonal in general). But it still has a closed form — just transpose the rotation block and multiply the translation by it. This is O(12) operations, much cheaper than general 4×4 matrix inversion.
Worked example — full chain: Body is at world position [5, 0, 0] with 90° yaw. Camera is at body position [0.3, 0, 0.2] with 0° rotation relative to body. Where is the camera in the world?
The camera sits 5 m along world-x, 0.3 m along world-y (because the 90° yaw maps body-x to world-y), and 0.2 m up. Correct.
python import numpy as np def make_T(R, t): """Build 4×4 SE(3) transform from 3×3 rotation R and 3-vector t.""" T = np.eye(4) T[:3, :3] = R T[:3, 3] = t return T def invert_T(T): """Invert SE(3) transform using the closed-form formula (NOT np.linalg.inv).""" R = T[:3, :3] t = T[:3, 3] T_inv = np.eye(4) T_inv[:3, :3] = R.T T_inv[:3, 3] = -R.T @ t return T_inv # Worked example: body at world [5,0,0] with 90° yaw T_w_r = make_T(Rz(np.pi/2), np.array([5, 0, 0])) T_r_c = make_T(np.eye(3), np.array([0.3, 0, 0.2])) T_w_c = T_w_r @ T_r_c # camera pose in world print("Camera position in world:", np.round(T_w_c[:3, 3], 3)) # [5. 0.3 0.2] ← matches worked example above # Invert: express world frame in body frame T_r_w = invert_T(T_w_r) print("Identity check:", np.round(T_w_r @ T_r_w, 3)) # 4×4 identity # Transform a 3D point from camera frame to world frame p_c = np.array([0, 0, 3, 1]) # 3m ahead of camera (homogeneous) p_w_hom = T_w_c @ p_c print("Landmark in world:", np.round(p_w_hom[:3], 3))
This chapter is the payoff. You control an entire chain of coordinate transforms: world → body → camera → point. Drag the sliders to move and rotate each frame. Watch the point's coordinates update live in every frame. Compose transforms, invert them, and see how a single physical point has completely different numerical representations depending on which frame's ruler you use.
This is exactly the computation that runs inside every VIO/SLAM system 100+ times per second: an IMU gives the body pose in the world, calibration gives the camera pose in the body, and the pipeline computes where each image feature lies in the world.
Drag sliders to change the body position (tx, ty) and orientation (yaw), and the camera offset from the body (camx, camy). A landmark (white star) sits at a fixed world position. The display shows the landmark's coordinates in the world frame, body frame, and camera frame in real time.
| Representation | DOF / params | Singularity? | Composition | Best for |
|---|---|---|---|---|
| Rotation matrix R ∈ SO(3) | 3 DOF / 9 params | No | Matrix multiply (27 mults) | Computation, kinematics chains |
| Euler angles (roll, pitch, yaw) | 3 DOF / 3 params | Yes (gimbal lock at ±90° pitch) | Slow (trig) | Display, pilot interface |
| Axis-angle (u, θ) | 3 DOF / 4 params | No (but θ=0 has non-unique axis) | Convert to R first | Visualization, Lie algebra |
| Quaternion q ∈ S³ | 3 DOF / 4 params | No (double cover: q = −q) | Quat product (16 mults) | Storage, interpolation (SLERP), graphics |
| Operation | SO(3) rotation | SE(3) transform |
|---|---|---|
| Inverse | R−1 = RT | T−1 = [[RT, −RTt],[0,1]] |
| Composition | RAC = RAB · RBC | TAC = TAB · TBC |
| Transform point | pw = R pr | p̃w = T p̃r |
| Commutative? | No (R₁R₂ ≠ R₂R₁ generally) | No |
| Notation rule | RAB · RBC: inner indices cancel | Same rule |
| Constraints | RTR=I, det=+1 | Bottom row = [0,0,0,1] |
SO(3) and SE(3) are not just sets — they are Lie groups, meaning they are both groups (closed under composition, have identity and inverses) and smooth manifolds (you can take derivatives on them). This matters for estimation: the Kalman filter requires adding perturbations to a state, but "adding" two rotation matrices does not give a rotation matrix. Lecture 2 covers the Lie algebra of SO(3) — the tangent space at identity, where you can do ordinary vector addition — and the exponential map that brings you back to the group. The skew-symmetric matrix [u]× from Rodrigues' formula is the Lie algebra element; the full Rodrigues formula is the exponential map.