Classical ML · CS229

Principal Component Analysis

Finding the few directions that matter in data with too many dimensions. Squeeze a thousand features down to a handful that capture almost everything — using one elegant eigenvector calculation.

Prerequisites: Vectors & projection + the idea of variance. Eigenvectors built from scratch.
9
Chapters
7+
Simulations
0
Assumed Knowledge

Chapter 0: When You Have Too Many Numbers

Imagine a dataset of cars, each described by dozens of attributes: top speed, horsepower, weight, fuel economy, price, acceleration, and so on. But many of these are redundant. Suppose two columns are "top speed in miles per hour" and "top speed in kilometers per hour" — they're essentially the same number in different units, almost perfectly correlated. That second column adds a whole dimension but no new information. Your data lives in fewer real dimensions than the number of columns suggests.

This is everywhere, and usually subtler. In a survey of radio-control helicopter pilots, "piloting skill" and "enjoyment of flying" turn out strongly correlated — because only people who truly enjoy it stick with this hard hobby long enough to get good. So the two-dimensional data really lies along a single diagonal axis, an underlying "flying passion" that drives both, plus a little noise off to the side. Two measured numbers, but essentially one hidden quantity.

Two features, but the data lives on a line

Here's 2D data — "skill" on one axis, "enjoyment" on the other. Notice it doesn't fill the plane; it hugs a diagonal line. The spread along that diagonal is huge (real variation between people); the spread across it is tiny (just noise). So this "2D" data is really almost 1D — one direction holds nearly all the information. PCA's job is to find that direction automatically.

correlation0.85
The more correlated, the flatter the cloud — the more "1D" it truly is.

Why does this matter so much? Three reasons. Visualization: you can't plot 50-dimensional data, but you can plot its 2 most important directions. Efficiency & overfitting: fewer dimensions means faster algorithms and less room to overfit (recall the bias-variance lesson). Noise reduction: the unimportant directions are often just noise; dropping them cleans the data. The technique that finds the few directions worth keeping is Principal Component Analysis — PCA — and remarkably, the whole thing comes down to a single eigenvector calculation.

The plan. Ch.1: preprocessing (center the data). Ch.2: the core idea — find the direction of maximum variance. Ch.3: the covariance matrix that encodes all the directions. Ch.4: the punchline — the principal directions are the eigenvectors of the covariance. Ch.5: a PCA lab. Ch.6: reducing dimensions and reconstructing. Ch.7: real applications (compression, eigenfaces, noise removal). From "too many redundant numbers" to a clean, low-dimensional summary that keeps what matters.
Common misconception: "PCA throws away features to reduce dimensions — it picks the most important columns." PCA doesn't select original features; it invents new ones — each principal component is a combination of all the original features (a direction in the full space). "Skill × 0.7 + enjoyment × 0.7" might be the first component, a blend neither column alone captures. PCA finds informative directions, not informative columns.
Why can a dataset with many features often be represented with far fewer dimensions?

Chapter 1: Center First — The Preprocessing That Matters

Before PCA can find directions of variation, we must prepare the data with one essential step (and one usually-recommended one). It's tempting to skip, but getting it wrong quietly breaks everything.

Centering: subtract the mean

The essential step is centering: compute the mean of each feature and subtract it, so the data cloud is centered on the origin. Why is this non-negotiable? Because PCA is about variation around the center — the directions the data spreads. If the cloud sits far from the origin, the dominant "direction" PCA finds would just point from the origin toward the cloud's location — capturing where the data is, not how it varies. Centering removes the position so PCA sees only the shape.

Centering: slide the cloud to the origin

The raw data cloud sits off-center. Press Center to subtract the mean, sliding it onto the origin (the crosshairs). Notice the shape — the spread and tilt — is completely unchanged; only the position moves. That shape is all PCA cares about, which is exactly why we remove the position first.

mean:

Scaling: equalize the units

The usually-recommended step is scaling (standardizing) each feature to unit variance. Here's the trap it avoids: imagine one feature is "weight in grams" (values in the thousands) and another is "fuel economy" (values around 30). The weight feature has vastly larger numbers, so it has vastly larger variance — and PCA, which chases variance, would declare "weight" the most important direction purely because of its units, not its true importance. Scaling each feature to comparable spread prevents arbitrary unit choices from dominating. (Skip scaling only when all features are already in the same meaningful units, like pixel intensities.)

Common misconception: "Centering is just a minor cleanup step you can skip." Skipping it can completely wreck PCA. On un-centered data, the first principal component often just points toward the data's mean location — a useless direction that captures the cloud's position instead of its variation. Centering isn't optional polish; it's a precondition for PCA to mean anything. (Scaling is more of a judgment call, but centering is mandatory.)
Why must you center the data (subtract the mean) before running PCA?

Chapter 2: The Direction of Maximum Variance

Now the heart of PCA, and it's a single beautiful idea: find the direction along which the data varies the most. That direction — the one that captures the largest spread — is the first principal component, the single most informative axis through your data.

To make this precise, we need the idea of projection. Pick any direction — a unit vector u (an arrow of length 1). Now drop each data point perpendicularly onto the line through u; where it lands is the point's projection, a single number measuring how far along u the point sits. Project all the points and you've squashed your 2D cloud down to a 1D set of values on that line. The variance of those projected values — how spread out they are — measures how much of the data's variation that direction captures.

Different directions capture different amounts. Point u along the data's long diagonal, and the projections are widely spread — high variance, lots of information preserved. Point u across the cloud's thin width, and the projections bunch up — low variance, little information. PCA's first component is simply the u that maximizes the projected variance. Find it, and you've found the single direction worth keeping if you could keep only one.

Rotate the line, watch the variance

Rotate the direction line and watch the points project onto it (the small marks on the line) and the projected variance bar respond. Hunt for the angle that maximizes the variance — you'll find it lines up with the cloud's long axis. That maximum-variance direction is the first principal component. Press Snap to best to jump there.

direction angle20°
projected variance: ·
Common misconception: "The best direction is the one that goes through the most points." It's the direction of greatest spread, not greatest density. The first component points along the axis where the projected points are most stretched out — maximizing variance — which is the cloud's longest direction, even if points are evenly scattered. "Most variance preserved" is the precise criterion, and it's what makes the dropped directions safe to discard (they had little variance to lose).
What defines the first principal component of a dataset?

Chapter 3: The Covariance Matrix — All Directions at Once

Rotating a line by hand to find maximum variance works, but it's clumsy and won't scale to 50 dimensions. We need the variance in every direction packaged into one object we can compute with. That object is the covariance matrix — and it's the bridge from "rotate and check" to "solve in one shot."

For centered data, the variance of the projection onto a direction u has a wonderfully compact form. It equals uTΣu, where Σ (Sigma) is the covariance matrix — the same object from the generative-learning lesson. For 2D data it's a 2×2 matrix: the diagonal entries are the variance of each feature on its own, and the off-diagonals are the covariance — how the two features vary together. Σ encodes the entire shape of the data cloud: its spread along each axis and its tilt.

So "find the direction of maximum variance" becomes the crisp problem: find the unit vector u that maximizes uTΣu. No more rotating by hand — it's now a clean optimization over a known matrix. The covariance matrix is the data's shape distilled into numbers, and PCA is about to read the best directions straight out of it.

The covariance ellipse is the data's shape

The ellipse drawn here is the covariance matrix made visible — it traces the data's spread, stretching along the directions of high variance and pinching where variance is low. Reshape the data (drag the stretch and tilt sliders) and watch the covariance ellipse follow. The ellipse's long axis is exactly the maximum-variance direction PCA seeks — you can already see where this is going.

stretch2.4
tilt30°
Σ = []
Common misconception: "The covariance matrix just stores variances on the diagonal." The off-diagonal terms are what make PCA interesting — they capture how features co-vary, which is what tilts the ellipse off the axes. If the off-diagonals were zero, the features would be uncorrelated, the ellipse axis-aligned, and the original axes would already be the principal components. PCA earns its keep precisely when the off-diagonals are non-zero — it rotates to a new basis where they vanish.
For centered data, the variance of the projection onto a unit direction u equals uTΣu. What is Σ?

Chapter 4: Eigenvectors — The Punchline

Here's the result that makes PCA one of the most elegant algorithms in all of machine learning. We want the unit vector u that maximizes uTΣu. When you solve this optimization (with a touch of calculus), the answer is breathtakingly clean: the maximizing direction is the top eigenvector of the covariance matrix Σ.

Let's unpack what that means, building eigenvectors from intuition. An eigenvector of a matrix is a special direction that the matrix only stretches, never rotates — multiply the matrix by that vector and you get the same vector back, just scaled. The amount it's scaled by is the eigenvalue. For a covariance matrix, the eigenvectors are the data's natural axes — the directions along which the cloud is purely stretched — and each eigenvalue is exactly the variance captured along that axis.

So PCA becomes a recipe of stunning simplicity:

That's the entire algorithm. The principal components are the eigenvectors of the covariance, ordered by how much variance they explain. And they're automatically orthogonal (mutually perpendicular), forming a clean new coordinate system perfectly aligned with the data's natural directions. No rotating by hand, no iteration — one eigenvector calculation and you're done.

The principal components, read from the covariance

The two arrows are the eigenvectors of the data's covariance — the principal components. PC1 (long arrow) points along maximum variance; PC2 (short arrow) is perpendicular, capturing the leftover. Each arrow's length is its eigenvalue (variance captured). Reshape the data and watch the components re-align to the cloud's natural axes — always perpendicular, always ordered by variance.

tilt30°
stretch2.6
variance: PC1 = · PC2 = · PC1 explains
Why eigenvectors, intuitively. The covariance matrix "acts" on directions by stretching them according to the data's spread. The eigenvectors are the directions that survive this stretching unrotated — they're the matrix's own preferred axes. And those preferred axes are precisely the directions of pure, uncorrelated variation in the data: the longest one is where the data varies most. It's no coincidence that "the matrix's natural axes" and "the data's natural axes of variation" are the same thing — the covariance matrix is the data's shape, so its eigenvectors are the shape's principal directions.
Common misconception: "You have to learn heavy linear algebra to use PCA." You should understand that PCA = eigenvectors of the covariance ordered by eigenvalue — that's the concept. But computing them is a one-line library call (np.linalg.eig or, better, the SVD). The deep ideas are intuitive (max-variance directions = natural axes); the mechanics are automated. Don't let "eigenvector" intimidate you — it just means "the matrix's natural stretch direction."
What are the principal components of a dataset, mathematically?

Chapter 5: The PCA Lab

Let's put the whole pipeline in motion. Below, PCA runs live on data you control: it centers the points, computes the covariance, finds the principal components, and — the payoff — projects the data onto the top component, reducing 2D to 1D. Watch each 2D point collapse onto the principal line, becoming a single number, while losing almost nothing.

Things to try:
  • Toggle the projection to watch every point slide onto PC1 — that's 2D→1D compression. See how little moves when the data is elongated (little is lost).
  • Drag points to reshape the cloud and watch the principal components instantly re-align and the "variance explained" update.
  • Make the cloud round (no clear long axis) — now PC1 explains only ~50%, and projecting to 1D loses a lot. PCA is only powerful when the data has a dominant direction.
  • Make it a thin line — PC1 explains ~99%, and the 1D projection is nearly lossless. That's the ideal case for dimensionality reduction.
PCA Lab — project 2D down to 1D

Drag points to reshape the data. The arrows are the principal components; the bar shows how much variance each explains. Toggle Project to 1D to collapse every point onto PC1 (orange line) — dimensionality reduction, live.

PC1 explains of variance

(No quiz — the lab is the test. If you can predict, by looking at a cloud's shape, roughly what percent of variance PC1 will explain — near 100% for a thin cloud, near 50% for a round one — you understand what PCA measures and when it helps.)

Chapter 6: Reduce, Reconstruct, and Choose k

PCA gives you a new set of axes (the principal components). The actual reduction is simple: to compress a point, keep only its coordinates along the top k components and drop the rest. A point that lived in d dimensions is now described by just k numbers — its projections onto the most informative directions.

The magic is that you can reconstruct an approximation of the original from those k numbers: rebuild the point as a combination of the kept components. The difference between the original and the reconstruction is the reconstruction error — and it's exactly the variance you threw away by dropping the minor components. This reveals PCA's beautiful dual nature: maximizing the variance you keep is the very same thing as minimizing the reconstruction error. Two views, one algorithm.

Keep k components, reconstruct, see the error

3D-ish data shown in 2D. Slide k (how many components to keep). At k=1 each point is forced onto the PC1 line — big compression, visible reconstruction error (the gap from each original point to its reconstruction). At k=2 (full) the reconstruction is perfect. Watch the error shrink as you keep more components, and the variance-explained climb.

components kept (k)1
variance kept: · reconstruction error:

Choosing k: the scree plot

How many components should you keep? The same elbow idea from k-means returns, here called a scree plot: plot the variance explained by each component, in order. The early components capture a lot; later ones capture progressively less, until they're just noise. You keep enough components to capture a target fraction of the total variance — 95% or 99% is typical — or you look for the "elbow" where the curve flattens. Often a handful of components capture nearly all the variance of data with hundreds of features — which is exactly why PCA compresses so dramatically.

Common misconception: "Keeping more components is always safer." Keeping all components reconstructs perfectly but achieves no compression and no noise reduction — you've done nothing. The point is to drop the low-variance components, which are usually noise. Keeping them back in re-introduces the noise you wanted gone. The skill is keeping enough to preserve the signal (high variance) while dropping enough to shed the noise (low variance) — the scree plot's elbow is where that balance lives.
PCA's two equivalent views are "maximize variance kept" and what?

Chapter 7: PCA in the Wild

PCA isn't just a theoretical tidy-up — it's one of the most-used tools in data science, and the same eigenvector trick powers all of these applications.

Visualization

You can't plot 50-dimensional data, but you can reduce it to its top 2 or 3 components and plot those. Suddenly you can see your data — which cars are similar, which customers cluster together, whether there are distinct groups at all. PCA is the standard first step for laying eyes on high-dimensional data, often paired with clustering (run k-means on the 2D projection).

Compression

Representing each point with k numbers instead of d is literal compression. A famous example: eigenfaces. Each face image of 100×100 pixels is a 10,000-dimensional vector. PCA on a collection of faces finds that just a few dozen "eigenface" directions — combinations of pixels capturing how faces actually vary — reconstruct any face remarkably well. Store ~50 numbers instead of 10,000, and you can still recognize the person. The kept components capture the meaningful variation between faces; the dropped ones were lighting quirks and noise.

Noise reduction & preprocessing

Because noise tends to live in the low-variance directions, dropping those directions cleans the data — reconstructing from the top components alone filters out the noise (this is the RC-pilot "karma" intuition: PCA recovers the true signal from noisy measurements). And as a preprocessing step before supervised learning, reducing dimensions speeds up training and — recalling the bias-variance lesson — shrinks the hypothesis space, helping prevent overfitting on high-dimensional inputs.

One algorithm, many faces. Compression, visualization, noise removal, speed-up, overfitting control — all from the single act of projecting onto the top eigenvectors of the covariance. That's the hallmark of a foundational technique: a simple core idea (keep the high-variance directions) that pays off in a dozen different ways. When you reach for "let me reduce the dimensionality first," PCA is almost always what you reach for.
Common misconception: "PCA always helps before supervised learning." Not always — PCA keeps the directions of greatest variance, but variance isn't always relevance. It's possible (though uncommon) that the signal predicting your label lives in a low-variance direction that PCA discards, hurting accuracy. PCA is unsupervised — it doesn't know your labels — so it optimizes for variance, not predictiveness. Usually a great default; occasionally it throws away exactly what you needed. Validate, don't assume.
Why does PCA act as a noise-reduction technique?

Chapter 8: Connections & Cheat Sheet

You've turned "too many redundant numbers" into a precise, one-eigenvector-calculation method for finding the few directions that matter. PCA is the workhorse of dimensionality reduction, and you now understand it from the geometry up.

The whole lesson on one page

ConceptWhat it means
GoalFind a few directions that capture most of the data's variation; reduce dimensions, keep information.
Center (mandatory)Subtract the mean so PCA sees variation, not position. Scale features too if units differ.
Principal componentThe direction of maximum projected variance; the next is the max variance perpendicular to it; etc.
Variance in direction uuTΣu, where Σ is the covariance matrix (the data's shape).
The punchlinePrincipal components = eigenvectors of Σ, ordered by eigenvalue (= variance captured). Automatically orthogonal.
ReduceProject onto the top k components: x (d-dim) → y (k-dim).
ReconstructRebuild from k components; error = discarded variance. Max-variance-kept = min-reconstruction-error.
Choose kScree plot / variance explained; keep enough for ~95–99% of variance.
UsesVisualization, compression (eigenfaces), noise reduction, preprocessing.

PCA in code

python
import numpy as np
def pca(X, k):
    Xc = X - X.mean(0)                 # 1. CENTER (mandatory!)
    Sigma = np.cov(Xc, rowvar=False)        # 2. covariance matrix
    vals, vecs = np.linalg.eigh(Sigma)     # 3. eigenvectors + eigenvalues
    order = np.argsort(vals)[::-1]        # 4. sort by variance, largest first
    U = vecs[:, order[:k]]                 # top-k principal components
    Y = Xc @ U                             # 5. REDUCE: project to k dims
    X_recon = Y @ U.T + X.mean(0)         # reconstruct (approx)
    var_explained = vals[order[:k]].sum() / vals.sum()
    return Y, U, var_explained

# sklearn (uses the numerically-stable SVD under the hood):
from sklearn.decomposition import PCA
p = PCA(n_components=2).fit(X)
print(p.explained_variance_ratio_)      # variance each component captures

Where to go next

You can now teach this. Real data often lives near a low-dimensional subspace because features are redundant. PCA finds it: center the data, build the covariance matrix, and take its top eigenvectors — these are the principal components, the orthogonal directions of maximum variance, ordered by how much variance (eigenvalue) they capture. Project onto the top k to compress; reconstruct to recover most of the original; the variance you keep is the information you keep. One eigenvector calculation, a dozen applications. The foundation of dimensionality reduction.

"The whole is simpler than the sum of its parts." — Willard Gibbs. PCA finds that simpler whole — the few essential directions hiding inside a thicket of correlated measurements.