Classical ML · CS229

k-Means Clustering

Finding hidden groups in data when nobody tells you the groups exist. The simplest, most-used unsupervised algorithm — a two-step dance of assign and average that always settles, but not always where you'd hope.

Prerequisites: Basic geometry (distance between points) + the idea of an average. That's genuinely it.
9
Chapters
7+
Simulations
0
Assumed Knowledge

Chapter 0: Finding Groups Nobody Labeled

Every algorithm in this course so far has been supervised — you had inputs and the right answers, and learned to map one to the other. Predict the price (you knew past prices). Classify the email (you had labeled spam). But a huge amount of real data comes with no labels at all. You have a pile of customers and their purchase histories, but nobody told you which "type" each one is. You have a million photos, but no categories. You just have... data, and a hunch that it has structure.

This is unsupervised learning: finding patterns in data that has no answer key. And the most fundamental unsupervised task is clustering — automatically grouping data points so that similar ones land together and dissimilar ones stay apart. Discover that your customers fall into three natural segments. Find that your photos cluster into beaches, faces, and food. Nobody defined those groups; the algorithm has to find them in the geometry of the data itself.

The data has groups — but which point belongs to which?

Here's a cloud of unlabeled points. Your eye instantly sees a few natural blobs — but to the computer, every point is just an identical gray dot with coordinates. There's no column saying "group A" or "group B." The challenge: write an algorithm that recovers the grouping your eye sees, using nothing but the positions. Press Reveal to see the groups a clustering algorithm would aim to find.

Your eye sees the structure instantly. Can an algorithm?

The trouble is that "similar" and "group" are slippery when no one defines them. How does a machine decide where one cluster ends and another begins, with no labels to learn from? The breakthrough idea behind k-means — the algorithm we'll build — is disarmingly simple: a cluster is defined by its center. Pick a few center points, assign every data point to its nearest center, then move each center to the middle of the points that chose it. Repeat. That's the whole idea, and from it emerges one of the most widely-used algorithms on Earth.

The plan. Ch.1: the two-step dance (assign to nearest center, move center to the average) that is k-means. Ch.2: the distortion function it secretly minimizes. Ch.3: why it always converges (it's coordinate descent). Ch.4: a full clustering lab. Ch.5: the catch — local minima, and how restarts escape them. Ch.6: choosing how many clusters (the elbow). Ch.7: where k-means fails (and what comes next). From "data with no answer key" to a working group-finder you understand completely.
Common misconception: "Clustering finds the 'correct' groups in the data." There often is no single correct answer — clustering is exploratory. The same customers could sensibly be grouped by spending, by age, or by location, giving different clusters, all valid. k-means finds a grouping that's geometrically cohesive given your choice of how many clusters and what counts as "close"; it doesn't reveal a hidden truth. Clustering is a lens, not an oracle.
What makes clustering an unsupervised learning problem?

Chapter 1: The Two-Step Dance

Here is the entire k-means algorithm. It's so simple you could run it by hand on paper. First, decide how many clusters you want — call it k — and drop k centroids (cluster centers) onto the data at random positions. Then repeat two steps until nothing changes:

Step A — Assign
For every data point, find the nearest centroid (by straight-line distance) and assign the point to that centroid's cluster. "Paint" each point the color of its closest center.
Step B — Update
For every centroid, move it to the average position (the mean) of all the points currently assigned to it. The center relocates to the middle of its flock.
↻ repeat until assignments stop changing

That's it. Assign points to centers; move centers to the middle of their points; repeat. The two steps chase each other — moving the centers changes which points are nearest, which changes where the centers should be — until they reach a happy agreement and stop. Watch the dance unfold:

Step through the assign / update loop

The crosses are centroids, the dots are data. Press Assign to paint each point the color of its nearest centroid (you'll see the boundaries snap into place). Press Update to slide each centroid to the mean of its points. Alternate them and watch the centroids march into the heart of each cluster, the coloring sharpening each round.

phase: ready

A worked step, by hand

Let's do one update by hand to demystify "move to the mean." Suppose a centroid currently owns three points at positions (2, 1), (4, 3), and (3, 2). The new centroid position is just the average of each coordinate:

# new centroid = mean of assigned points
x-coordinate: (2 + 4 + 3) / 3 = 9 / 3 = 3.0
y-coordinate: (1 + 3 + 2) / 3 = 6 / 3 = 2.0
# the centroid moves to (3.0, 2.0) — the center of mass of its flock

No calculus, no gradients — just averaging. The centroid literally becomes the center of gravity of the points that chose it. And the assign step is even simpler: for each point, compute its distance to each centroid and pick the smallest. Two operations a child could do, looped until they agree.

Common misconception: "The centroids are real data points." Usually not — a centroid is the average of its assigned points, which is typically a brand-new position that no actual data point sits on (the average of (0,0) and (2,2) is (1,1), even if nothing is at (1,1)). The centroid is an invented "prototype" for the cluster, a synthetic representative. (A cousin algorithm, k-medoids, does force centers to be real points — useful when an average doesn't make sense.)
In the "update" step of k-means, where does each centroid move?

Chapter 2: Distortion — The Number k-Means Minimizes

The two-step dance feels like a heuristic — a clever trick. But underneath, k-means is doing something principled: it's minimizing a specific cost, exactly like every other algorithm in this course. Naming that cost reveals what "good clustering" actually means to k-means.

The cost is called the distortion, and it measures how tightly the points hug their assigned centroids. For each point, take its distance to its own centroid, square it, and add all of these up across every point:

J = ∑i ∥ x(i) − μc(i)2

In words: the total squared distance from every point to the center of its cluster. (Here x(i) is point i, and μc(i) is the centroid that point i was assigned to.) Small distortion means every point is close to its center — tight, cohesive clusters. Large distortion means points are scattered far from their centers — a poor grouping. So k-means' goal, stated honestly, is: place the centroids and assign the points so that the total squared distance is as small as possible.

Why squared distance, again? The same reasons as least squares back in linear regression: squaring makes all distances positive (no cancellation) and is smooth and mathematically convenient — and critically, it's what makes the "move to the mean" step optimal. The mean is precisely the point that minimizes the sum of squared distances to a set of points. So "move each centroid to the mean of its flock" isn't a guess; it's the exact best move for shrinking the distortion, given the current assignments. The two steps and the cost fit together perfectly.

Watch the distortion as you cluster

Each thin line connects a point to its assigned centroid — the distortion J is the total of these squared lengths. Step through assign and update, and watch the lines shorten and J drop. A good clustering is one where these connector lines are as short as possible — points snug against their centers.

distortion J =
Common misconception: "Lower distortion always means a better clustering." Lower distortion means tighter clusters for a given k, but it's not a universal quality score. Crucially, adding more clusters always lowers distortion (in the extreme, one centroid per point gives zero distortion and zero insight). So you can't pick the number of clusters by minimizing distortion — it would always say "more clusters." Distortion compares clusterings at the same k; choosing k itself needs a different idea (Chapter 6).
What does the distortion function J that k-means minimizes actually measure?

Chapter 3: Why It Always Converges

A natural worry: this assign-update loop could bounce around forever, never settling. It doesn't — k-means is guaranteed to converge, and the reason is elegant. Both steps only ever decrease the distortion J, never increase it. Since J can't go below zero, a quantity that only decreases must eventually stop. Let's see why each step shrinks J.

This is an instance of a powerful technique called coordinate descent: when a cost depends on two sets of variables, you minimize over one set while holding the other fixed, then swap, and repeat. Here the two "coordinates" are the assignments (which point goes to which cluster) and the centroids (where the centers sit):

Two steps, each provably reducing J, alternating forever. J marches monotonically downhill and converges. That's the entire convergence proof, and it's why k-means is so reliable in practice — it can't diverge or blow up the way gradient descent can with a bad learning rate. There's no learning rate at all; each step takes the exact optimal move for its half of the problem.

Distortion only goes down

Run k-means and watch the distortion curve. Every assign step and every update step steps J down — never up. The curve is a monotone staircase descending to a flat plateau, where assignments stop changing and the algorithm has converged. Press Run to watch it settle.

iterations: 0 · J =
Common misconception: "Converging means k-means found the best possible clustering." Converging only means J stopped decreasing — it reached a bottom, not necessarily the bottom. Because the distortion is non-convex (it has many local minima), k-means can converge to a clustering that's locally optimal but globally mediocre. Guaranteed convergence and guaranteed good are different promises; k-means makes only the first. That gap is the entire subject of the next chapter but one.
Why is k-means guaranteed to converge?

Chapter 4: The Clustering Lab

Everything in one playground. Below is a full k-means engine: choose how many clusters, scatter the data, and watch the algorithm find the groups in real time. Step through it slowly to see each assign and update, or press play and watch the centroids glide into place. This is where the algorithm stops being a description and becomes something you can feel.

Things to try:
  • Run on the default blobs and watch the centroids snap to the cluster centers, the coloring crystallizing.
  • Change k to more or fewer than the natural number of groups — see what k-means does when you ask for the "wrong" number (it splits or merges blobs to fit your k).
  • Click to add points or drag them, then re-run — watch the clustering reshape around your edits.
  • Reset the init a few times — notice you sometimes get a different final clustering from a different random start. (That's the cliffhanger for the next chapter.)
k-Means Lab

Click empty space to add a point. Set k, press Run, and watch the assign/update loop converge. The crosses are centroids; points are colored by their current cluster.

clusters k3
iter 0 · J ·

(No quiz — the lab is the test. If you can predict, before pressing Run, roughly where the centroids will end up, you understand k-means. And if you've noticed that different starts sometimes give different answers, you've discovered the algorithm's one real weakness on your own.)

Chapter 5: Local Minima — The Catch, and the Cure

Here's the uncomfortable truth you may have already glimpsed: k-means doesn't always find the best clustering. It always converges — but where it converges depends on where the centroids started. A bad random initialization can trap the algorithm in a poor clustering that it can't escape, because every local move only makes things worse (even though a totally different arrangement would be far better).

This happens because the distortion function is non-convex — it's a bumpy landscape with many valleys, not a single smooth bowl. Coordinate descent rolls into the nearest valley and stops, and the nearest valley depends entirely on the starting point. Land in a good valley and you get a great clustering; land in a bad one and you get nonsense like one centroid hogging two real clusters while another splits a single cluster in half.

Same data, different starts, different answers

The data has three clear blobs. Press Random restart repeatedly: each run starts the centroids in new random spots and converges — but sometimes to the right answer (one centroid per blob, low J) and sometimes to a bad one (centroids splitting or merging blobs, high J). The best J so far tracker shows why the cure is simply: try many times, keep the best.

this run J = · best J so far =

The fix is beautifully pragmatic: run k-means many times from different random starts, and keep the clustering with the lowest distortion. Each run is cheap, and with a handful of restarts you're very likely to hit a good valley at least once. This "best of many restarts" is standard practice — it's why real implementations default to 10 runs. A smarter trick, k-means++, chooses the initial centroids to be spread out (each new one placed far from existing ones), which dramatically reduces the chance of a bad start. Most libraries use k-means++ by default, so you rarely see the worst failures — but they're lurking, and knowing they exist is what keeps you from trusting a single run blindly.

Common misconception: "If k-means gave me a result, that's the clustering of my data." It's one clustering, from one random start — possibly a bad local minimum. Always run multiple restarts (or trust that your library did) and compare distortions. A single run that happened to start badly can produce a confidently wrong grouping that looks authoritative. The discipline of restarts is what separates a reliable result from a lucky (or unlucky) one.
k-means always converges, yet can still produce a poor clustering. Why, and what's the standard fix?

Chapter 6: Choosing k — The Elbow

So far we've assumed you know how many clusters to look for. But k is something you must choose, and it's often the hardest part — the data doesn't come with a sticky note saying "there are 3 groups." How do you pick k honestly?

The tempting idea — "try every k and pick the one with lowest distortion" — fails, and we already know why from Chapter 2: distortion always decreases as k increases. More centroids means every point can be closer to some center, so J keeps dropping, all the way to zero when k equals the number of points (each point is its own cluster). Minimizing distortion would always tell you to use as many clusters as you have points — useless.

The trick is to look not at the distortion itself, but at how fast it drops. Plot distortion against k. At first, adding clusters helps a lot — going from 1 to 2 to 3 clusters on three-blob data slices the distortion dramatically, because you're separating genuinely distinct groups. But once you have enough clusters to cover the real groups, adding more only helps a little — you're now just subdividing already-tight blobs. That transition, where the steep drop suddenly flattens into a gentle slope, makes a visible bend in the curve: the elbow. The k at the elbow is usually the right number of clusters.

The elbow method

For data with a true number of groups, this plots the converged distortion at each k. Watch for the elbow — the k where the curve stops plunging and starts crawling. Before the elbow, each extra cluster buys a big drop (separating real groups); after it, extra clusters barely help (just splitting tight blobs). Slide the true number of groups in the data and watch the elbow move to match.

true groups in data3
elbow detected at k =

The elbow is a heuristic, not a theorem — sometimes the bend is fuzzy, and judgment is required. Other methods exist (the silhouette score, which measures how well-separated the clusters are, or the gap statistic), and often the right k comes from domain knowledge rather than any plot ("we want exactly 5 customer tiers because marketing has 5 campaigns"). But the elbow is the first thing to reach for, and it captures the key intuition: the best k is where adding more clusters stops paying off.

Common misconception: "There's a mathematically correct value of k that the data determines." Usually not — k is a modeling choice, and reasonable people (and methods) can disagree. The elbow may be ambiguous; the silhouette may suggest a different k; your application may demand a specific number regardless. Clustering is exploratory, so k is a dial you set based on the plot, the method, and what you'll do with the clusters — not a hidden constant waiting to be discovered.
Why can't you choose the number of clusters k by simply picking the k with the lowest distortion?

Chapter 7: When It Breaks — The Spherical Assumption

k-means is fast, simple, and reliable — but it has a strong hidden assumption baked into that "nearest centroid" rule, and when the assumption is violated, k-means fails in characteristic ways. Knowing them tells you when to reach for something else.

Because a point is assigned to whichever centroid is closest by straight-line distance, k-means implicitly carves the space into regions that are spherical and equal-sized around each centroid. This means k-means quietly assumes your clusters are roughly round blobs of similar size and density. When they're not, it stumbles:

k-means on the wrong shapes

Toggle between round blobs (where k-means shines) and awkward shapes — two elongated clusters and two concentric rings. Run k-means and watch it succeed on the blobs but carve the elongated and ring data into nonsensical straight-edged pieces. Its "nearest center" rule simply can't represent these shapes.

The fix points straight to the next lesson. k-means uses hard assignments (each point belongs fully to one cluster) and assumes round, equal clusters. The Gaussian Mixture Model — the very next topic — relaxes both: it gives each point a soft, probabilistic membership across clusters, and it models each cluster as a full Gaussian with its own shape, size, and orientation. GMM handles elongated and unequal clusters that defeat k-means. In fact, k-means turns out to be a special, simplified case of the GMM. The next lesson is k-means, leveled up.
Common misconception: "k-means failing on weird shapes means it's a bad algorithm." It means it's the wrong tool for those shapes — not a bad tool. For the enormous number of real problems where clusters are roughly round and comparable in size (and they often are, especially in high dimensions), k-means is fast, robust, and excellent. Knowing its assumption lets you recognize the cases where it'll mislead you, and reach for GMM, spectral clustering, or DBSCAN instead. Right tool, right shape.
What shape of clusters does k-means implicitly assume, and when does it fail?

Chapter 8: Connections & Cheat Sheet

You've gone from "data with no answer key" to a complete understanding of the world's most-used clustering algorithm — what it does, why it converges, where it gets stuck, and when it's the wrong tool. That's real unsupervised-learning fluency.

The whole lesson on one page

ConceptWhat it means
ClusteringUnsupervised: group unlabeled points by similarity, using only their positions.
CentroidA cluster's center — the mean of its assigned points (usually not a real data point).
Assign stepEach point joins its nearest centroid's cluster.
Update stepEach centroid moves to the mean of its assigned points.
Distortion JTotal squared distance from points to their centroids — what k-means minimizes (at fixed k).
ConvergenceBoth steps only decrease J (coordinate descent) → guaranteed to converge.
Local minimaJ is non-convex → bad initialization can trap a poor clustering. Fix: many restarts, keep lowest J; or k-means++ init.
Choosing kDistortion always falls with k → use the elbow (where the drop flattens), silhouette, or domain knowledge.
LimitationAssumes spherical, equal-sized clusters; fails on elongated/unequal/non-convex shapes.

k-means from scratch

python
import numpy as np
def kmeans(X, k, iters=100):
    # init: k random data points as starting centroids
    centroids = X[np.random.choice(len(X), k, replace=False)]
    for _ in range(iters):
        # ASSIGN: each point → index of nearest centroid
        dists = ((X[:, None, :] - centroids[None, :, :])**2).sum(2)
        labels = dists.argmin(1)
        # UPDATE: each centroid → mean of its assigned points
        new = np.array([X[labels==j].mean(0) for j in range(k)])
        if np.allclose(new, centroids): break   # converged
        centroids = new
    return labels, centroids

# sklearn (uses k-means++ init and 10 restarts by default):
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3, n_init=10).fit(X)
print(km.labels_, km.inertia_)   # inertia_ = the distortion J

Where to go next

You can now teach this. Clustering finds groups in unlabeled data. k-means defines clusters by centers and alternates two steps: assign each point to its nearest center, move each center to the mean of its points. This is coordinate descent on the distortion (total squared distance), so it always converges — but to a local minimum, so you run many restarts and keep the best. Choose k with the elbow. And remember its blind spot: it assumes round, equal blobs, which is why GMM comes next. The foundation of unsupervised learning, fully in hand.

"The goal is to turn data into information, and information into insight." — Carly Fiorina. k-means turns a featureless cloud of points into named groups — the first, indispensable step from raw data toward understanding.