k-Means Clustering — From Absolute Zero to Mastery

Chapter 0: Finding Groups Nobody Labeled

Every algorithm in this course so far has been supervised — you had inputs and the right answers, and learned to map one to the other. Predict the price (you knew past prices). Classify the email (you had labeled spam). But a huge amount of real data comes with no labels at all. You have a pile of customers and their purchase histories, but nobody told you which "type" each one is. You have a million photos, but no categories. You just have... data, and a hunch that it has structure.

This is unsupervised learning: finding patterns in data that has no answer key. And the most fundamental unsupervised task is clustering — automatically grouping data points so that similar ones land together and dissimilar ones stay apart. Discover that your customers fall into three natural segments. Find that your photos cluster into beaches, faces, and food. Nobody defined those groups; the algorithm has to find them in the geometry of the data itself.

The data has groups — but which point belongs to which?

Here's a cloud of unlabeled points. Your eye instantly sees a few natural blobs — but to the computer, every point is just an identical gray dot with coordinates. There's no column saying "group A" or "group B." The challenge: write an algorithm that recovers the grouping your eye sees, using nothing but the positions. Press Reveal to see the groups a clustering algorithm would aim to find.

Your eye sees the structure instantly. Can an algorithm?

The trouble is that "similar" and "group" are slippery when no one defines them. How does a machine decide where one cluster ends and another begins, with no labels to learn from? The breakthrough idea behind k-means — the algorithm we'll build — is disarmingly simple: a cluster is defined by its center. Pick a few center points, assign every data point to its nearest center, then move each center to the middle of the points that chose it. Repeat. That's the whole idea, and from it emerges one of the most widely-used algorithms on Earth.

The plan. Ch.1: the two-step dance (assign to nearest center, move center to the average) that is k-means. Ch.2: the distortion function it secretly minimizes. Ch.3: why it always converges (it's coordinate descent). Ch.4: a full clustering lab. Ch.5: the catch — local minima, and how restarts escape them. Ch.6: choosing how many clusters (the elbow). Ch.7: where k-means fails (and what comes next). From "data with no answer key" to a working group-finder you understand completely.

Common misconception: "Clustering finds the 'correct' groups in the data." There often is no single correct answer — clustering is exploratory. The same customers could sensibly be grouped by spending, by age, or by location, giving different clusters, all valid. k-means finds a grouping that's geometrically cohesive given your choice of how many clusters and what counts as "close"; it doesn't reveal a hidden truth. Clustering is a lens, not an oracle.

What makes clustering an unsupervised learning problem?

It uses a neural network. It requires more data than supervised learning. The data has no labels — the algorithm must find structure (groups) using only the data points themselves, with no answer key. It can only be done by humans.

Chapter 1: The Two-Step Dance

Here is the entire k-means algorithm. It's so simple you could run it by hand on paper. First, decide how many clusters you want — call it k — and drop k centroids (cluster centers) onto the data at random positions. Then repeat two steps until nothing changes:

Step A — Assign

For every data point, find the nearest centroid (by straight-line distance) and assign the point to that centroid's cluster. "Paint" each point the color of its closest center.

↓

Step B — Update

For every centroid, move it to the average position (the mean) of all the points currently assigned to it. The center relocates to the middle of its flock.

↻ repeat until assignments stop changing

That's it. Assign points to centers; move centers to the middle of their points; repeat. The two steps chase each other — moving the centers changes which points are nearest, which changes where the centers should be — until they reach a happy agreement and stop. Watch the dance unfold:

Step through the assign / update loop

The crosses are centroids, the dots are data. Press Assign to paint each point the color of its nearest centroid (you'll see the boundaries snap into place). Press Update to slide each centroid to the mean of its points. Alternate them and watch the centroids march into the heart of each cluster, the coloring sharpening each round.

phase: ready

A worked step, by hand

Let's do one update by hand to demystify "move to the mean." Suppose a centroid currently owns three points at positions (2, 1), (4, 3), and (3, 2). The new centroid position is just the average of each coordinate:

# new centroid = mean of assigned points
x-coordinate: (2 + 4 + 3) / 3 = 9 / 3 = 3.0
y-coordinate: (1 + 3 + 2) / 3 = 6 / 3 = 2.0
# the centroid moves to (3.0, 2.0) — the center of mass of its flock

No calculus, no gradients — just averaging. The centroid literally becomes the center of gravity of the points that chose it. And the assign step is even simpler: for each point, compute its distance to each centroid and pick the smallest. Two operations a child could do, looped until they agree.

Common misconception: "The centroids are real data points." Usually not — a centroid is the average of its assigned points, which is typically a brand-new position that no actual data point sits on (the average of (0,0) and (2,2) is (1,1), even if nothing is at (1,1)). The centroid is an invented "prototype" for the cluster, a synthetic representative. (A cousin algorithm, k-medoids, does force centers to be real points — useful when an average doesn't make sense.)

In the "update" step of k-means, where does each centroid move?

To a random new location. To the mean (average position) of all the data points currently assigned to it. To the nearest data point. It doesn't move; only the points move.

Chapter 2: Distortion — The Number k-Means Minimizes

The two-step dance feels like a heuristic — a clever trick. But underneath, k-means is doing something principled: it's minimizing a specific cost, exactly like every other algorithm in this course. Naming that cost reveals what "good clustering" actually means to k-means.

The cost is called the distortion, and it measures how tightly the points hug their assigned centroids. For each point, take its distance to its own centroid, square it, and add all of these up across every point:

J = ∑_i ∥ x⁽ⁱ⁾ − μ_c(i) ∥²

In words: the total squared distance from every point to the center of its cluster. (Here x⁽ⁱ⁾ is point i, and μ_c(i) is the centroid that point i was assigned to.) Small distortion means every point is close to its center — tight, cohesive clusters. Large distortion means points are scattered far from their centers — a poor grouping. So k-means' goal, stated honestly, is: place the centroids and assign the points so that the total squared distance is as small as possible.

Why squared distance, again? The same reasons as least squares back in linear regression: squaring makes all distances positive (no cancellation) and is smooth and mathematically convenient — and critically, it's what makes the "move to the mean" step optimal. The mean is precisely the point that minimizes the sum of squared distances to a set of points. So "move each centroid to the mean of its flock" isn't a guess; it's the exact best move for shrinking the distortion, given the current assignments. The two steps and the cost fit together perfectly.

Watch the distortion as you cluster

Each thin line connects a point to its assigned centroid — the distortion J is the total of these squared lengths. Step through assign and update, and watch the lines shorten and J drop. A good clustering is one where these connector lines are as short as possible — points snug against their centers.

distortion J = —

Common misconception: "Lower distortion always means a better clustering." Lower distortion means tighter clusters for a given k, but it's not a universal quality score. Crucially, adding more clusters always lowers distortion (in the extreme, one centroid per point gives zero distortion and zero insight). So you can't pick the number of clusters by minimizing distortion — it would always say "more clusters." Distortion compares clusterings at the same k; choosing k itself needs a different idea (Chapter 6).

What does the distortion function J that k-means minimizes actually measure?

The number of clusters. The total squared distance from every point to the centroid of its assigned cluster — small J means tight, cohesive clusters. The distance between centroids. The number of data points.

Chapter 3: Why It Always Converges

A natural worry: this assign-update loop could bounce around forever, never settling. It doesn't — k-means is guaranteed to converge, and the reason is elegant. Both steps only ever decrease the distortion J, never increase it. Since J can't go below zero, a quantity that only decreases must eventually stop. Let's see why each step shrinks J.

This is an instance of a powerful technique called coordinate descent: when a cost depends on two sets of variables, you minimize over one set while holding the other fixed, then swap, and repeat. Here the two "coordinates" are the assignments (which point goes to which cluster) and the centroids (where the centers sit):

The assign step minimizes J over assignments, holding centroids fixed. With the centroids frozen, the best thing each point can do to reduce its squared distance is go to its nearest centroid — which is exactly what the assign step does. So J can only drop (or stay equal).
The update step minimizes J over centroids, holding assignments fixed. With the assignments frozen, the position that minimizes the total squared distance to a group of points is their mean — which is exactly what the update step computes. So again J can only drop.

Two steps, each provably reducing J, alternating forever. J marches monotonically downhill and converges. That's the entire convergence proof, and it's why k-means is so reliable in practice — it can't diverge or blow up the way gradient descent can with a bad learning rate. There's no learning rate at all; each step takes the exact optimal move for its half of the problem.

Distortion only goes down

Run k-means and watch the distortion curve. Every assign step and every update step steps J down — never up. The curve is a monotone staircase descending to a flat plateau, where assignments stop changing and the algorithm has converged. Press Run to watch it settle.

iterations: 0 · J = —

Common misconception: "Converging means k-means found the best possible clustering." Converging only means J stopped decreasing — it reached a bottom, not necessarily the bottom. Because the distortion is non-convex (it has many local minima), k-means can converge to a clustering that's locally optimal but globally mediocre. Guaranteed convergence and guaranteed good are different promises; k-means makes only the first. That gap is the entire subject of the next chapter but one.

Why is k-means guaranteed to converge?

Because it uses a small learning rate. Because both steps (assign and update) only ever decrease the distortion J, and a quantity that only decreases and is bounded below must eventually stop — it's coordinate descent on J. Because it runs for a fixed number of steps. It isn't guaranteed to converge.

Chapter 4: The Clustering Lab

Everything in one playground. Below is a full k-means engine: choose how many clusters, scatter the data, and watch the algorithm find the groups in real time. Step through it slowly to see each assign and update, or press play and watch the centroids glide into place. This is where the algorithm stops being a description and becomes something you can feel.

Things to try:

Run on the default blobs and watch the centroids snap to the cluster centers, the coloring crystallizing.
Change k to more or fewer than the natural number of groups — see what k-means does when you ask for the "wrong" number (it splits or merges blobs to fit your k).
Click to add points or drag them, then re-run — watch the clustering reshape around your edits.
Reset the init a few times — notice you sometimes get a different final clustering from a different random start. (That's the cliffhanger for the next chapter.)

k-Means Lab

Click empty space to add a point. Set k, press Run, and watch the assign/update loop converge. The crosses are centroids; points are colored by their current cluster.

clusters k3

iter 0 · J — ·

(No quiz — the lab is the test. If you can predict, before pressing Run, roughly where the centroids will end up, you understand k-means. And if you've noticed that different starts sometimes give different answers, you've discovered the algorithm's one real weakness on your own.)

Chapter 5: Local Minima — The Catch, and the Cure

Here's the uncomfortable truth you may have already glimpsed: k-means doesn't always find the best clustering. It always converges — but where it converges depends on where the centroids started. A bad random initialization can trap the algorithm in a poor clustering that it can't escape, because every local move only makes things worse (even though a totally different arrangement would be far better).

This happens because the distortion function is non-convex — it's a bumpy landscape with many valleys, not a single smooth bowl. Coordinate descent rolls into the nearest valley and stops, and the nearest valley depends entirely on the starting point. Land in a good valley and you get a great clustering; land in a bad one and you get nonsense like one centroid hogging two real clusters while another splits a single cluster in half.

Same data, different starts, different answers

The data has three clear blobs. Press Random restart repeatedly: each run starts the centroids in new random spots and converges — but sometimes to the right answer (one centroid per blob, low J) and sometimes to a bad one (centroids splitting or merging blobs, high J). The best J so far tracker shows why the cure is simply: try many times, keep the best.

this run J = — · best J so far = —

The fix is beautifully pragmatic: run k-means many times from different random starts, and keep the clustering with the lowest distortion. Each run is cheap, and with a handful of restarts you're very likely to hit a good valley at least once. This "best of many restarts" is standard practice — it's why real implementations default to 10 runs. A smarter trick, k-means++, chooses the initial centroids to be spread out (each new one placed far from existing ones), which dramatically reduces the chance of a bad start. Most libraries use k-means++ by default, so you rarely see the worst failures — but they're lurking, and knowing they exist is what keeps you from trusting a single run blindly.

Common misconception: "If k-means gave me a result, that's the clustering of my data." It's one clustering, from one random start — possibly a bad local minimum. Always run multiple restarts (or trust that your library did) and compare distortions. A single run that happened to start badly can produce a confidently wrong grouping that looks authoritative. The discipline of restarts is what separates a reliable result from a lucky (or unlucky) one.

k-means always converges, yet can still produce a poor clustering. Why, and what's the standard fix?

It converges too slowly; the fix is a bigger learning rate. The distortion is non-convex, so it can get stuck in a local minimum depending on the random initialization; the fix is to run many random restarts and keep the lowest-distortion result (or use k-means++ init). It needs labeled data; the fix is to add labels. It always finds the global optimum, so there's no problem.

Chapter 6: Choosing k — The Elbow

So far we've assumed you know how many clusters to look for. But k is something you must choose, and it's often the hardest part — the data doesn't come with a sticky note saying "there are 3 groups." How do you pick k honestly?

The tempting idea — "try every k and pick the one with lowest distortion" — fails, and we already know why from Chapter 2: distortion always decreases as k increases. More centroids means every point can be closer to some center, so J keeps dropping, all the way to zero when k equals the number of points (each point is its own cluster). Minimizing distortion would always tell you to use as many clusters as you have points — useless.

The trick is to look not at the distortion itself, but at how fast it drops. Plot distortion against k. At first, adding clusters helps a lot — going from 1 to 2 to 3 clusters on three-blob data slices the distortion dramatically, because you're separating genuinely distinct groups. But once you have enough clusters to cover the real groups, adding more only helps a little — you're now just subdividing already-tight blobs. That transition, where the steep drop suddenly flattens into a gentle slope, makes a visible bend in the curve: the elbow. The k at the elbow is usually the right number of clusters.

The elbow method

For data with a true number of groups, this plots the converged distortion at each k. Watch for the elbow — the k where the curve stops plunging and starts crawling. Before the elbow, each extra cluster buys a big drop (separating real groups); after it, extra clusters barely help (just splitting tight blobs). Slide the true number of groups in the data and watch the elbow move to match.

true groups in data3

elbow detected at k = —

The elbow is a heuristic, not a theorem — sometimes the bend is fuzzy, and judgment is required. Other methods exist (the silhouette score, which measures how well-separated the clusters are, or the gap statistic), and often the right k comes from domain knowledge rather than any plot ("we want exactly 5 customer tiers because marketing has 5 campaigns"). But the elbow is the first thing to reach for, and it captures the key intuition: the best k is where adding more clusters stops paying off.

Common misconception: "There's a mathematically correct value of k that the data determines." Usually not — k is a modeling choice, and reasonable people (and methods) can disagree. The elbow may be ambiguous; the silhouette may suggest a different k; your application may demand a specific number regardless. Clustering is exploratory, so k is a dial you set based on the plot, the method, and what you'll do with the clusters — not a hidden constant waiting to be discovered.

Why can't you choose the number of clusters k by simply picking the k with the lowest distortion?

Because distortion can't be computed for different k. Because distortion always decreases as k increases (reaching zero when every point is its own cluster), so it always favors more clusters — you look for the "elbow" where the drop flattens instead. Because distortion increases with k. Because k must always equal 3.

Chapter 7: When It Breaks — The Spherical Assumption

k-means is fast, simple, and reliable — but it has a strong hidden assumption baked into that "nearest centroid" rule, and when the assumption is violated, k-means fails in characteristic ways. Knowing them tells you when to reach for something else.

Because a point is assigned to whichever centroid is closest by straight-line distance, k-means implicitly carves the space into regions that are spherical and equal-sized around each centroid. This means k-means quietly assumes your clusters are roughly round blobs of similar size and density. When they're not, it stumbles:

Elongated or stretched clusters: a long, thin cluster gets sliced crosswise into chunks, because points at its far ends are closer to a different centroid than to their own cluster's center.
Clusters of very different sizes: a big sprawling cluster and a tiny dense one — k-means tends to steal the small cluster's points or split the big one, because it can't represent different spreads.
Non-convex shapes: two interleaved crescents or concentric rings — hopeless. "Nearest center" can never wrap around a curve, so k-means cuts straight across them.

k-means on the wrong shapes

Toggle between round blobs (where k-means shines) and awkward shapes — two elongated clusters and two concentric rings. Run k-means and watch it succeed on the blobs but carve the elongated and ring data into nonsensical straight-edged pieces. Its "nearest center" rule simply can't represent these shapes.

—

The fix points straight to the next lesson. k-means uses hard assignments (each point belongs fully to one cluster) and assumes round, equal clusters. The Gaussian Mixture Model — the very next topic — relaxes both: it gives each point a soft, probabilistic membership across clusters, and it models each cluster as a full Gaussian with its own shape, size, and orientation. GMM handles elongated and unequal clusters that defeat k-means. In fact, k-means turns out to be a special, simplified case of the GMM. The next lesson is k-means, leveled up.

Common misconception: "k-means failing on weird shapes means it's a bad algorithm." It means it's the wrong tool for those shapes — not a bad tool. For the enormous number of real problems where clusters are roughly round and comparable in size (and they often are, especially in high dimensions), k-means is fast, robust, and excellent. Knowing its assumption lets you recognize the cases where it'll mislead you, and reach for GMM, spectral clustering, or DBSCAN instead. Right tool, right shape.

What shape of clusters does k-means implicitly assume, and when does it fail?

It assumes square clusters; it fails on round ones. It assumes roughly spherical, equal-sized clusters (from the nearest-centroid rule), so it fails on elongated, very unequal, or non-convex shapes like rings. It makes no assumptions and works on any shape. It only works on one-dimensional data.

Chapter 8: Connections & Cheat Sheet

You've gone from "data with no answer key" to a complete understanding of the world's most-used clustering algorithm — what it does, why it converges, where it gets stuck, and when it's the wrong tool. That's real unsupervised-learning fluency.

The whole lesson on one page

Concept	What it means
Clustering	Unsupervised: group unlabeled points by similarity, using only their positions.
Centroid	A cluster's center — the mean of its assigned points (usually not a real data point).
Assign step	Each point joins its nearest centroid's cluster.
Update step	Each centroid moves to the mean of its assigned points.
Distortion J	Total squared distance from points to their centroids — what k-means minimizes (at fixed k).
Convergence	Both steps only decrease J (coordinate descent) → guaranteed to converge.
Local minima	J is non-convex → bad initialization can trap a poor clustering. Fix: many restarts, keep lowest J; or k-means++ init.
Choosing k	Distortion always falls with k → use the elbow (where the drop flattens), silhouette, or domain knowledge.
Limitation	Assumes spherical, equal-sized clusters; fails on elongated/unequal/non-convex shapes.

k-means from scratch

python
import numpy as np
def kmeans(X, k, iters=100):
    # init: k random data points as starting centroids
    centroids = X[np.random.choice(len(X), k, replace=False)]
    for _ in range(iters):
        # ASSIGN: each point → index of nearest centroid
        dists = ((X[:, None, :] - centroids[None, :, :])**2).sum(2)
        labels = dists.argmin(1)
        # UPDATE: each centroid → mean of its assigned points
        new = np.array([X[labels==j].mean(0) for j in range(k)])
        if np.allclose(new, centroids): break   # converged
        centroids = new
    return labels, centroids

# sklearn (uses k-means++ init and 10 restarts by default):
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3, n_init=10).fit(X)
print(km.labels_, km.inertia_)   # inertia_ = the distortion J

Where to go next

EM & Gaussian Mixture Models (next) — "soft k-means": probabilistic memberships and full elliptical clusters with their own shapes. Handles what k-means can't, and reveals k-means as a special case.
Generative Learning — the Gaussian modeling there is exactly what GMM clustering uses, but without labels.
Linear Regression — the "squared distance + move to the mean" math is the same least-squares idea, here in the unsupervised setting.
Vector quantization & image compression — a beautiful application: cluster an image's colors into k centroids, then store each pixel as a cluster index. k-means as a compressor.

You can now teach this. Clustering finds groups in unlabeled data. k-means defines clusters by centers and alternates two steps: assign each point to its nearest center, move each center to the mean of its points. This is coordinate descent on the distortion (total squared distance), so it always converges — but to a local minimum, so you run many restarts and keep the best. Choose k with the elbow. And remember its blind spot: it assumes round, equal blobs, which is why GMM comes next. The foundation of unsupervised learning, fully in hand.

"The goal is to turn data into information, and information into insight." — Carly Fiorina. k-means turns a featureless cloud of points into named groups — the first, indispensable step from raw data toward understanding.