Finding hidden groups in data when nobody tells you the groups exist. The simplest, most-used unsupervised algorithm — a two-step dance of assign and average that always settles, but not always where you'd hope.
Every algorithm in this course so far has been supervised — you had inputs and the right answers, and learned to map one to the other. Predict the price (you knew past prices). Classify the email (you had labeled spam). But a huge amount of real data comes with no labels at all. You have a pile of customers and their purchase histories, but nobody told you which "type" each one is. You have a million photos, but no categories. You just have... data, and a hunch that it has structure.
This is unsupervised learning: finding patterns in data that has no answer key. And the most fundamental unsupervised task is clustering — automatically grouping data points so that similar ones land together and dissimilar ones stay apart. Discover that your customers fall into three natural segments. Find that your photos cluster into beaches, faces, and food. Nobody defined those groups; the algorithm has to find them in the geometry of the data itself.
Here's a cloud of unlabeled points. Your eye instantly sees a few natural blobs — but to the computer, every point is just an identical gray dot with coordinates. There's no column saying "group A" or "group B." The challenge: write an algorithm that recovers the grouping your eye sees, using nothing but the positions. Press Reveal to see the groups a clustering algorithm would aim to find.
The trouble is that "similar" and "group" are slippery when no one defines them. How does a machine decide where one cluster ends and another begins, with no labels to learn from? The breakthrough idea behind k-means — the algorithm we'll build — is disarmingly simple: a cluster is defined by its center. Pick a few center points, assign every data point to its nearest center, then move each center to the middle of the points that chose it. Repeat. That's the whole idea, and from it emerges one of the most widely-used algorithms on Earth.
Here is the entire k-means algorithm. It's so simple you could run it by hand on paper. First, decide how many clusters you want — call it k — and drop k centroids (cluster centers) onto the data at random positions. Then repeat two steps until nothing changes:
That's it. Assign points to centers; move centers to the middle of their points; repeat. The two steps chase each other — moving the centers changes which points are nearest, which changes where the centers should be — until they reach a happy agreement and stop. Watch the dance unfold:
The crosses are centroids, the dots are data. Press Assign to paint each point the color of its nearest centroid (you'll see the boundaries snap into place). Press Update to slide each centroid to the mean of its points. Alternate them and watch the centroids march into the heart of each cluster, the coloring sharpening each round.
Let's do one update by hand to demystify "move to the mean." Suppose a centroid currently owns three points at positions (2, 1), (4, 3), and (3, 2). The new centroid position is just the average of each coordinate:
# new centroid = mean of assigned points x-coordinate: (2 + 4 + 3) / 3 = 9 / 3 = 3.0 y-coordinate: (1 + 3 + 2) / 3 = 6 / 3 = 2.0 # the centroid moves to (3.0, 2.0) — the center of mass of its flock
No calculus, no gradients — just averaging. The centroid literally becomes the center of gravity of the points that chose it. And the assign step is even simpler: for each point, compute its distance to each centroid and pick the smallest. Two operations a child could do, looped until they agree.
The two-step dance feels like a heuristic — a clever trick. But underneath, k-means is doing something principled: it's minimizing a specific cost, exactly like every other algorithm in this course. Naming that cost reveals what "good clustering" actually means to k-means.
The cost is called the distortion, and it measures how tightly the points hug their assigned centroids. For each point, take its distance to its own centroid, square it, and add all of these up across every point:
In words: the total squared distance from every point to the center of its cluster. (Here x(i) is point i, and μc(i) is the centroid that point i was assigned to.) Small distortion means every point is close to its center — tight, cohesive clusters. Large distortion means points are scattered far from their centers — a poor grouping. So k-means' goal, stated honestly, is: place the centroids and assign the points so that the total squared distance is as small as possible.
Each thin line connects a point to its assigned centroid — the distortion J is the total of these squared lengths. Step through assign and update, and watch the lines shorten and J drop. A good clustering is one where these connector lines are as short as possible — points snug against their centers.
A natural worry: this assign-update loop could bounce around forever, never settling. It doesn't — k-means is guaranteed to converge, and the reason is elegant. Both steps only ever decrease the distortion J, never increase it. Since J can't go below zero, a quantity that only decreases must eventually stop. Let's see why each step shrinks J.
This is an instance of a powerful technique called coordinate descent: when a cost depends on two sets of variables, you minimize over one set while holding the other fixed, then swap, and repeat. Here the two "coordinates" are the assignments (which point goes to which cluster) and the centroids (where the centers sit):
Two steps, each provably reducing J, alternating forever. J marches monotonically downhill and converges. That's the entire convergence proof, and it's why k-means is so reliable in practice — it can't diverge or blow up the way gradient descent can with a bad learning rate. There's no learning rate at all; each step takes the exact optimal move for its half of the problem.
Run k-means and watch the distortion curve. Every assign step and every update step steps J down — never up. The curve is a monotone staircase descending to a flat plateau, where assignments stop changing and the algorithm has converged. Press Run to watch it settle.
Everything in one playground. Below is a full k-means engine: choose how many clusters, scatter the data, and watch the algorithm find the groups in real time. Step through it slowly to see each assign and update, or press play and watch the centroids glide into place. This is where the algorithm stops being a description and becomes something you can feel.
Click empty space to add a point. Set k, press Run, and watch the assign/update loop converge. The crosses are centroids; points are colored by their current cluster.
(No quiz — the lab is the test. If you can predict, before pressing Run, roughly where the centroids will end up, you understand k-means. And if you've noticed that different starts sometimes give different answers, you've discovered the algorithm's one real weakness on your own.)
Here's the uncomfortable truth you may have already glimpsed: k-means doesn't always find the best clustering. It always converges — but where it converges depends on where the centroids started. A bad random initialization can trap the algorithm in a poor clustering that it can't escape, because every local move only makes things worse (even though a totally different arrangement would be far better).
This happens because the distortion function is non-convex — it's a bumpy landscape with many valleys, not a single smooth bowl. Coordinate descent rolls into the nearest valley and stops, and the nearest valley depends entirely on the starting point. Land in a good valley and you get a great clustering; land in a bad one and you get nonsense like one centroid hogging two real clusters while another splits a single cluster in half.
The data has three clear blobs. Press Random restart repeatedly: each run starts the centroids in new random spots and converges — but sometimes to the right answer (one centroid per blob, low J) and sometimes to a bad one (centroids splitting or merging blobs, high J). The best J so far tracker shows why the cure is simply: try many times, keep the best.
The fix is beautifully pragmatic: run k-means many times from different random starts, and keep the clustering with the lowest distortion. Each run is cheap, and with a handful of restarts you're very likely to hit a good valley at least once. This "best of many restarts" is standard practice — it's why real implementations default to 10 runs. A smarter trick, k-means++, chooses the initial centroids to be spread out (each new one placed far from existing ones), which dramatically reduces the chance of a bad start. Most libraries use k-means++ by default, so you rarely see the worst failures — but they're lurking, and knowing they exist is what keeps you from trusting a single run blindly.
So far we've assumed you know how many clusters to look for. But k is something you must choose, and it's often the hardest part — the data doesn't come with a sticky note saying "there are 3 groups." How do you pick k honestly?
The tempting idea — "try every k and pick the one with lowest distortion" — fails, and we already know why from Chapter 2: distortion always decreases as k increases. More centroids means every point can be closer to some center, so J keeps dropping, all the way to zero when k equals the number of points (each point is its own cluster). Minimizing distortion would always tell you to use as many clusters as you have points — useless.
The trick is to look not at the distortion itself, but at how fast it drops. Plot distortion against k. At first, adding clusters helps a lot — going from 1 to 2 to 3 clusters on three-blob data slices the distortion dramatically, because you're separating genuinely distinct groups. But once you have enough clusters to cover the real groups, adding more only helps a little — you're now just subdividing already-tight blobs. That transition, where the steep drop suddenly flattens into a gentle slope, makes a visible bend in the curve: the elbow. The k at the elbow is usually the right number of clusters.
For data with a true number of groups, this plots the converged distortion at each k. Watch for the elbow — the k where the curve stops plunging and starts crawling. Before the elbow, each extra cluster buys a big drop (separating real groups); after it, extra clusters barely help (just splitting tight blobs). Slide the true number of groups in the data and watch the elbow move to match.
The elbow is a heuristic, not a theorem — sometimes the bend is fuzzy, and judgment is required. Other methods exist (the silhouette score, which measures how well-separated the clusters are, or the gap statistic), and often the right k comes from domain knowledge rather than any plot ("we want exactly 5 customer tiers because marketing has 5 campaigns"). But the elbow is the first thing to reach for, and it captures the key intuition: the best k is where adding more clusters stops paying off.
k-means is fast, simple, and reliable — but it has a strong hidden assumption baked into that "nearest centroid" rule, and when the assumption is violated, k-means fails in characteristic ways. Knowing them tells you when to reach for something else.
Because a point is assigned to whichever centroid is closest by straight-line distance, k-means implicitly carves the space into regions that are spherical and equal-sized around each centroid. This means k-means quietly assumes your clusters are roughly round blobs of similar size and density. When they're not, it stumbles:
Toggle between round blobs (where k-means shines) and awkward shapes — two elongated clusters and two concentric rings. Run k-means and watch it succeed on the blobs but carve the elongated and ring data into nonsensical straight-edged pieces. Its "nearest center" rule simply can't represent these shapes.
You've gone from "data with no answer key" to a complete understanding of the world's most-used clustering algorithm — what it does, why it converges, where it gets stuck, and when it's the wrong tool. That's real unsupervised-learning fluency.
| Concept | What it means |
|---|---|
| Clustering | Unsupervised: group unlabeled points by similarity, using only their positions. |
| Centroid | A cluster's center — the mean of its assigned points (usually not a real data point). |
| Assign step | Each point joins its nearest centroid's cluster. |
| Update step | Each centroid moves to the mean of its assigned points. |
| Distortion J | Total squared distance from points to their centroids — what k-means minimizes (at fixed k). |
| Convergence | Both steps only decrease J (coordinate descent) → guaranteed to converge. |
| Local minima | J is non-convex → bad initialization can trap a poor clustering. Fix: many restarts, keep lowest J; or k-means++ init. |
| Choosing k | Distortion always falls with k → use the elbow (where the drop flattens), silhouette, or domain knowledge. |
| Limitation | Assumes spherical, equal-sized clusters; fails on elongated/unequal/non-convex shapes. |
python import numpy as np def kmeans(X, k, iters=100): # init: k random data points as starting centroids centroids = X[np.random.choice(len(X), k, replace=False)] for _ in range(iters): # ASSIGN: each point → index of nearest centroid dists = ((X[:, None, :] - centroids[None, :, :])**2).sum(2) labels = dists.argmin(1) # UPDATE: each centroid → mean of its assigned points new = np.array([X[labels==j].mean(0) for j in range(k)]) if np.allclose(new, centroids): break # converged centroids = new return labels, centroids # sklearn (uses k-means++ init and 10 restarts by default): from sklearn.cluster import KMeans km = KMeans(n_clusters=3, n_init=10).fit(X) print(km.labels_, km.inertia_) # inertia_ = the distortion J
"The goal is to turn data into information, and information into insight." — Carly Fiorina. k-means turns a featureless cloud of points into named groups — the first, indispensable step from raw data toward understanding.