k-means, leveled up. Soft, probabilistic clustering with stretchable elliptical groups — powered by a chicken-and-egg algorithm so elegant it became one of the most important ideas in all of statistics.
The k-means lesson ended on a confession: k-means fails on elongated and overlapping clusters, because its "nearest center" rule can only carve out round, equal blobs. But there's a second, subtler problem we glossed over — one that bothers us even when the clusters are round.
Consider a point sitting right on the border between two clusters. k-means forces it to pick exactly one — a hard assignment. But that's a lie about what we actually know. The honest answer is "this point is about 55% cluster A and 45% cluster B — I'm genuinely unsure." k-means erases that uncertainty, slamming the point fully into one cluster and pretending it's certain. For points deep inside a cluster, fine. For the many points near boundaries, we're throwing away real information.
Drag the orange point between the two clusters. The hard assignment bar (k-means) snaps fully to whichever cluster is nearer — 100%/0%, flipping abruptly as you cross the midpoint. The soft assignment bar (what we want) smoothly reflects genuine membership — 60/40, 50/50, 45/55 — capturing the uncertainty k-means destroys.
We want a clustering method that does two things k-means can't. First, soft assignments: every point gets a probability of belonging to each cluster, not a forced single choice. Second, flexible cluster shapes: each cluster modeled as a full Gaussian with its own size, stretch, and tilt — so elongated and unequal clusters fit naturally. The method that delivers both is the Gaussian Mixture Model (GMM), and the algorithm that trains it — Expectation-Maximization, or EM — is one of the most beautiful and widely-used ideas in machine learning.
To cluster probabilistically, we first tell a story about how the data was generated — just like the generative-learning lesson. The story is called a Gaussian Mixture Model, and it imagines each data point being born in two steps:
Repeat for every point, and you get data that looks like several overlapping elliptical blobs of different sizes — exactly the kind of data k-means struggles with. The crucial twist: we see the points, but not which cluster each one came from. That cluster identity — the hidden label z — is a latent variable: real, but unobserved. Our whole job is to work backwards from the visible points to recover the k Gaussians and the hidden labels.
Press Sample to generate points from a 3-Gaussian mixture, watching them fall: each point first picks a cluster (weighted by size), then lands inside that cluster's elliptical blob. The contours show the three true Gaussians. This is the generative story — and clustering is the inverse: given only the dots, recover the blobs.
This generative view is powerful because it tells us exactly what we're estimating: the mixing weights φj (how big each cluster is), the means μj (where each is centered), and the covariances Σj (each cluster's shape). Find those, and you've not just clustered the data — you've built a full probabilistic model that can score how likely any new point is, generate synthetic data, and report calibrated cluster memberships. It's clustering and density estimation in one.
Here's what makes fitting a mixture model genuinely tricky — and the trap reveals the elegant idea that escapes it. We want to find the Gaussians (the φ's, μ's, Σ's) and recover the hidden labels z. But these two unknowns are locked in a circular dependency.
So each unknown is easy if you have the other. But we have neither — we start with just a cloud of unlabeled points. It's a perfect chicken-and-egg deadlock: to find the labels we need the Gaussians, and to find the Gaussians we need the labels.
The Expectation-Maximization (EM) algorithm breaks the deadlock with a move you've now seen several times in this course: guess, then refine. Start with random Gaussians. Then alternate two steps, each one using the current best guess of one unknown to improve the other:
That's the whole idea, and it should feel deeply familiar — it's the exact same "alternate between two easy sub-problems" structure as k-means (assign, then update). In fact, as we'll see in Chapter 6, k-means is EM with the soft guesses hardened to 0-or-1. EM is k-means that kept the probabilities.
Let's build the first half of EM. The E-step (Expectation) answers: given my current guess of the Gaussians, how likely is each point to have come from each cluster? The answer is a set of soft memberships called responsibilities — for point i and cluster j, the responsibility wij is the probability that point i was generated by cluster j.
We compute it with Bayes' rule — the exact same tool from the generative-learning lesson. Two ingredients decide how much cluster j "claims" a point:
Multiply prior × likelihood for each cluster, then normalize so the responsibilities across all clusters sum to 1. That normalization is what makes them honest probabilities — "this point is 70% cluster A, 25% cluster B, 5% cluster C." A point dead-center in one blob gets responsibility ≈1 for it; a point on a boundary splits its responsibility between the neighbors, exactly the soft, uncertain answer we wanted in Chapter 0.
Drag the white query point around the three fixed Gaussians. The bars show its responsibilities — the probability it belongs to each cluster, from Bayes' rule (bigger/closer Gaussians claim more). Move it into a blob's heart and one bar dominates; move it to a no-man's-land between blobs and the responsibilities split. The point is "painted" by a blend of cluster colors weighted by responsibility.
Notice this is precisely the soft version of k-means' assign step. k-means said "go to your single nearest center." The E-step says "distribute yourself across all centers, weighted by how probable each is." Same role — figure out cluster membership — but k-means commits to one answer while EM keeps the full distribution. And because it uses the full Gaussian (not just distance to the center), it naturally accounts for each cluster's shape and size: a point can be far from a stretched cluster's center yet still firmly belong, if it lies along the cluster's long axis.
Now the second half. The M-step (Maximization) takes the soft responsibilities from the E-step and uses them to update each Gaussian. It pretends the soft memberships are correct and asks: given these weighted points, what's the best Gaussian for each cluster? The answer is a beautiful weighted version of the same count-and-average we used for hard labels.
For each cluster j, every point contributes to it in proportion to its responsibility wij for that cluster. A point that's 90% cluster A counts as 0.9 of a point toward A's statistics and 0.1 toward B's. Then:
Compare these to the hard-label formulas from the generative-learning lesson and they're identical — except the crisp 0/1 "is this point in cluster j?" indicator is replaced by the soft responsibility wij ∈ [0,1]. That's the entire difference between EM and supervised Gaussian fitting: EM uses fractional memberships because it doesn't actually know the labels, only its current best guess of them. Each M-step move is the exact maximum-likelihood update given those soft guesses.
Points are softly colored by their current responsibilities. Press M-step and watch each Gaussian's ellipse relocate and reshape — its center slides to the weighted mean of its points, and its ellipse stretches and tilts to the weighted spread. Press E-step to re-soften the colors under the new Gaussians, then M-step again. The two steps chase each other into a perfect fit.
Everything together, live. Below is a full GMM trained by EM. Watch the Gaussians begin as random circles and, through alternating E and M steps, flow outward — stretching, tilting, and sliding until each ellipse wraps snugly around a real cluster. This is where you see EM do what k-means cannot.
Each ellipse is one Gaussian (mean + covariance); points are softly colored by responsibility. Press Run to alternate E and M steps to convergence, or Step through them. Switch the data shape to see GMM handle what k-means can't.
(No quiz — the lab is the test. If you can watch an ellipse and predict which way it'll stretch on the next M-step, you understand EM. Notice how the soft coloring in overlap regions never fully commits — that honesty about uncertainty is the whole point.)
We've hinted at it repeatedly; now let's make it precise. k-means is a special case of EM. They're not cousins — one is literally the other with two simplifications. Seeing this unifies everything you've learned across these two lessons.
Take EM for Gaussian mixtures and make two changes. First, harden the responsibilities: instead of soft probabilities, force each point to give responsibility 1 to its most-probable cluster and 0 to all others. Second, freeze the covariances: force every cluster to be a fixed-size sphere (identity covariance), so the Gaussian density depends only on distance to the center. Make those two changes, and the E-step becomes "assign each point to its nearest center" and the M-step becomes "move each center to the mean of its assigned points." That's exactly k-means.
| k-means | EM / GMM | |
|---|---|---|
| Assignment | Hard (one cluster, 0/1) | Soft (probabilities) |
| Cluster shape | Fixed spheres | Full ellipses (own size, stretch, tilt) |
| "Assign" step | Nearest centroid | E-step: Bayes' responsibilities |
| "Update" step | Mean of assigned points | M-step: weighted mean & covariance |
| Output | A label per point | A probability distribution + full density model |
| Handles overlap/stretch | No | Yes |
Left: k-means (hard, spherical). Right: GMM/EM (soft, elliptical). Run both on the same data and compare. On round blobs they agree. Switch to elongated clusters and watch k-means slice them wrongly while GMM's ellipses elongate to fit — the visual proof that EM is k-means with the constraints lifted.
EM feels like a heuristic — guess, refine, repeat — but it carries a real guarantee: every iteration increases (or holds) the likelihood of the data. The model gets strictly more probable each round until it settles. Let's understand why, without drowning in algebra.
The quantity EM is climbing is the log-likelihood — how probable the observed data is under the current Gaussians. We can't maximize it directly (that's the chicken-and-egg problem). So EM does something clever: at each step it builds a lower bound — a simpler function that sits underneath the true log-likelihood and touches it at the current parameters. The E-step constructs this bound (using the responsibilities); the M-step jumps to the top of the bound. Because the bound touches the true objective where you started and you climbed the bound, you must have climbed the true objective too.
The mathematical tool that builds the bound is Jensen's inequality — a fact about convex functions stating that "the average of a function is at least the function of the average" (for convex functions; the reverse for concave). Applied to the log (a concave function), it lets us swap a hard-to-handle "log of a sum" for an easy-to-maximize "sum of logs," producing exactly the bound EM optimizes. You don't need the derivation to use EM, but the upshot is solid: EM is principled hill-climbing on the data likelihood, guaranteed never to go downhill.
Run EM and watch the data's log-likelihood climb with every E+M iteration — a monotone ascent to a plateau where the model has converged. Just like k-means' distortion went only down, EM's likelihood goes only up. (And just like k-means, it can settle on a local optimum — so multiple restarts still matter.)
Two practical consequences follow. First, like k-means, EM can get stuck in a local optimum — the likelihood surface is bumpy, and a bad initialization lands in a mediocre valley. The cure is the same: run multiple random restarts (often initialized with k-means!) and keep the highest-likelihood result. Second, EM is a general framework, not just a clustering trick: any model with hidden variables — mixtures, hidden Markov models, topic models, missing-data problems — can be trained with the same E-step/M-step recipe. Learning EM here unlocks a huge swath of probabilistic modeling.
You've turned k-means' rigid "pick one" into a full probabilistic model — soft memberships, flexible shapes, and a convergence guarantee — and met EM, a framework that reaches far beyond clustering. That's a major step into modern probabilistic machine learning.
| Concept | What it means |
|---|---|
| Gaussian Mixture Model | Data = a blend of k Gaussians; each point picks a cluster (weight φj) then draws from that cluster's N(μj, Σj). |
| Latent variable z | The hidden "which cluster" label — real but unobserved. |
| Chicken-and-egg | Gaussians are easy if you know z; z is easy if you know the Gaussians; you know neither. |
| E-step | Soft-guess z: responsibilities wij = Bayes' (prior × Gaussian density), normalized. |
| M-step | Re-fit each Gaussian: responsibility-weighted mixing weight, mean, and covariance. |
| Convergence | Each iteration increases the log-likelihood (Jensen's inequality) → converges, but to a local optimum (use restarts). |
| k-means link | k-means = EM with hard (0/1) responsibilities + fixed spherical covariances. |
| EM is general | Any latent-variable model (HMMs, topic models, missing data) trains with the same E/M recipe. |
python import numpy as np from scipy.stats import multivariate_normal def em_gmm(X, k, iters=100): n, d = X.shape # init: random means, identity covariances, equal weights mu = X[np.random.choice(n, k, replace=False)] Sigma = [np.eye(d) for _ in range(k)]; phi = np.ones(k)/k for _ in range(iters): # E-STEP: responsibilities = prior × Gaussian density, normalized W = np.array([phi[j]*multivariate_normal(mu[j],Sigma[j]).pdf(X) for j in range(k)]).T W /= W.sum(1, keepdims=True) # M-STEP: weighted re-estimation Nk = W.sum(0); phi = Nk/n mu = (W.T @ X) / Nk[:,None] for j in range(k): diff = X - mu[j] Sigma[j] = (W[:,j,None]*diff).T @ diff / Nk[j] + 1e-6*np.eye(d) return phi, mu, Sigma, W # sklearn (k-means init + restarts built in): from sklearn.mixture import GaussianMixture gmm = GaussianMixture(n_components=3, n_init=5).fit(X) gmm.predict_proba(X_new) # soft responsibilities for new points
"When you have eliminated the impossible, whatever remains, however improbable, must be the truth." — Sherlock Holmes. EM never eliminates — it keeps every possibility alive with a probability, and lets the data slowly sharpen the improbable into the clear.