Classical ML · CS229

EM & Gaussian Mixture Models

k-means, leveled up. Soft, probabilistic clustering with stretchable elliptical groups — powered by a chicken-and-egg algorithm so elegant it became one of the most important ideas in all of statistics.

Prerequisites: k-Means + the Gaussian & Bayes' rule. We combine them.
9
Chapters
7+
Simulations
0
Assumed Knowledge

Chapter 0: When "Pick One" Throws Away the Truth

The k-means lesson ended on a confession: k-means fails on elongated and overlapping clusters, because its "nearest center" rule can only carve out round, equal blobs. But there's a second, subtler problem we glossed over — one that bothers us even when the clusters are round.

Consider a point sitting right on the border between two clusters. k-means forces it to pick exactly one — a hard assignment. But that's a lie about what we actually know. The honest answer is "this point is about 55% cluster A and 45% cluster B — I'm genuinely unsure." k-means erases that uncertainty, slamming the point fully into one cluster and pretending it's certain. For points deep inside a cluster, fine. For the many points near boundaries, we're throwing away real information.

The border point: a hard pick vs. an honest answer

Drag the orange point between the two clusters. The hard assignment bar (k-means) snaps fully to whichever cluster is nearer — 100%/0%, flipping abruptly as you cross the midpoint. The soft assignment bar (what we want) smoothly reflects genuine membership — 60/40, 50/50, 45/55 — capturing the uncertainty k-means destroys.

hard:    soft:

We want a clustering method that does two things k-means can't. First, soft assignments: every point gets a probability of belonging to each cluster, not a forced single choice. Second, flexible cluster shapes: each cluster modeled as a full Gaussian with its own size, stretch, and tilt — so elongated and unequal clusters fit naturally. The method that delivers both is the Gaussian Mixture Model (GMM), and the algorithm that trains it — Expectation-Maximization, or EM — is one of the most beautiful and widely-used ideas in machine learning.

The plan. Ch.1: the mixture model — data as a blend of several Gaussians, with a hidden "which one" label. Ch.2: the chicken-and-egg problem that makes it hard. Ch.3: the E-step (soft-guess the hidden labels). Ch.4: the M-step (update the Gaussians using those guesses). Ch.5: a live EM lab. Ch.6: EM vs k-means, revealing k-means as a special case. Ch.7: why EM converges. From k-means' rigid "pick one" to a probabilistic model that captures what your data is really doing.
Common misconception: "Soft assignments are just hard assignments with extra decimal places — you round them at the end anyway." The soft probabilities carry information that changes the fit itself, not just the final labels. A border point contributes partially to both clusters' means and shapes during training, which lets overlapping clusters be estimated correctly — something impossible when every point is forced wholly into one. The softness isn't cosmetic; it's what makes the model work where k-means breaks.
What is the key limitation of k-means' "hard assignment" that GMM fixes with soft assignments?

Chapter 1: The Mixture Model — A Recipe for the Data

To cluster probabilistically, we first tell a story about how the data was generated — just like the generative-learning lesson. The story is called a Gaussian Mixture Model, and it imagines each data point being born in two steps:

Step 1 — Pick a cluster
Roll a weighted die with k sides to choose which cluster this point belongs to. Cluster j is chosen with probability φj (the mixing weight — how big that cluster is). This choice is the hidden label z.
Step 2 — Draw from its Gaussian
Now draw the point's position from cluster j's own Gaussian — a blob with its own center μj and its own shape Σj (size, stretch, tilt). The point lands somewhere in that blob.

Repeat for every point, and you get data that looks like several overlapping elliptical blobs of different sizes — exactly the kind of data k-means struggles with. The crucial twist: we see the points, but not which cluster each one came from. That cluster identity — the hidden label z — is a latent variable: real, but unobserved. Our whole job is to work backwards from the visible points to recover the k Gaussians and the hidden labels.

Sampling from a mixture of Gaussians

Press Sample to generate points from a 3-Gaussian mixture, watching them fall: each point first picks a cluster (weighted by size), then lands inside that cluster's elliptical blob. The contours show the three true Gaussians. This is the generative story — and clustering is the inverse: given only the dots, recover the blobs.

points80

This generative view is powerful because it tells us exactly what we're estimating: the mixing weights φj (how big each cluster is), the means μj (where each is centered), and the covariances Σj (each cluster's shape). Find those, and you've not just clustered the data — you've built a full probabilistic model that can score how likely any new point is, generate synthetic data, and report calibrated cluster memberships. It's clustering and density estimation in one.

Common misconception: "A mixture of Gaussians can only model blob-shaped data." Each component is a single ellipse, but a mixture of enough Gaussians can approximate almost any distribution — curves, multiple modes, weird shapes — by tiling it with overlapping ellipses. With enough components, GMMs are universal density approximators. The single-blob intuition is just the k=1 case; the power comes from the mixture.
In a Gaussian Mixture Model, what is the "latent variable" z?

Chapter 2: The Chicken-and-Egg Problem

Here's what makes fitting a mixture model genuinely tricky — and the trap reveals the elegant idea that escapes it. We want to find the Gaussians (the φ's, μ's, Σ's) and recover the hidden labels z. But these two unknowns are locked in a circular dependency.

If we knew the labels z — which cluster each point came from — finding the Gaussians would be trivial. Just gather the points of each cluster and compute their mixing weight (fraction of points), mean (average position), and covariance (their spread). It's exactly the closed-form Gaussian-fitting from the generative-learning lesson, with z playing the role of the labels.
If we knew the Gaussians — each cluster's center and shape — finding the labels would be trivial. Just use Bayes' rule: for each point, compute the probability it came from each cluster (prior × how well that Gaussian explains it), and there's its membership.

So each unknown is easy if you have the other. But we have neither — we start with just a cloud of unlabeled points. It's a perfect chicken-and-egg deadlock: to find the labels we need the Gaussians, and to find the Gaussians we need the labels.

The Expectation-Maximization (EM) algorithm breaks the deadlock with a move you've now seen several times in this course: guess, then refine. Start with random Gaussians. Then alternate two steps, each one using the current best guess of one unknown to improve the other:

E-step (Expectation)
Pretend the current Gaussians are right, and soft-guess the labels: compute each point's probability of belonging to each cluster (Bayes' rule). These soft guesses are the "responsibilities."
M-step (Maximization)
Pretend the soft labels are right, and re-fit the Gaussians: update each cluster's weight, mean, and shape using the responsibility-weighted points.
↻ repeat — each step improves the fit

That's the whole idea, and it should feel deeply familiar — it's the exact same "alternate between two easy sub-problems" structure as k-means (assign, then update). In fact, as we'll see in Chapter 6, k-means is EM with the soft guesses hardened to 0-or-1. EM is k-means that kept the probabilities.

Common misconception: "EM is just a heuristic that happens to work." It's far more principled than that — EM is a rigorous algorithm for maximum-likelihood estimation when there are hidden variables, and (Chapter 7) it's guaranteed to increase the data's likelihood every iteration. The "guess and refine" framing makes it intuitive, but underneath it's optimizing exactly the quantity you'd want: the probability of the observed data under the model. It earns its convergence guarantee.
Why can't we just directly fit a Gaussian Mixture Model in one shot?

Chapter 3: The E-Step — Soft-Guessing the Labels

Let's build the first half of EM. The E-step (Expectation) answers: given my current guess of the Gaussians, how likely is each point to have come from each cluster? The answer is a set of soft memberships called responsibilities — for point i and cluster j, the responsibility wij is the probability that point i was generated by cluster j.

We compute it with Bayes' rule — the exact same tool from the generative-learning lesson. Two ingredients decide how much cluster j "claims" a point:

Multiply prior × likelihood for each cluster, then normalize so the responsibilities across all clusters sum to 1. That normalization is what makes them honest probabilities — "this point is 70% cluster A, 25% cluster B, 5% cluster C." A point dead-center in one blob gets responsibility ≈1 for it; a point on a boundary splits its responsibility between the neighbors, exactly the soft, uncertain answer we wanted in Chapter 0.

Responsibilities: how each cluster claims a point

Drag the white query point around the three fixed Gaussians. The bars show its responsibilities — the probability it belongs to each cluster, from Bayes' rule (bigger/closer Gaussians claim more). Move it into a blob's heart and one bar dominates; move it to a no-man's-land between blobs and the responsibilities split. The point is "painted" by a blend of cluster colors weighted by responsibility.

responsibilities w = []

Notice this is precisely the soft version of k-means' assign step. k-means said "go to your single nearest center." The E-step says "distribute yourself across all centers, weighted by how probable each is." Same role — figure out cluster membership — but k-means commits to one answer while EM keeps the full distribution. And because it uses the full Gaussian (not just distance to the center), it naturally accounts for each cluster's shape and size: a point can be far from a stretched cluster's center yet still firmly belong, if it lies along the cluster's long axis.

Common misconception: "The responsibility is just based on which centroid is closest." It's based on the full Gaussian density, which folds in the cluster's size, stretch, and orientation — not raw distance. A point can be closer to centroid A by straight-line distance yet have higher responsibility for B, if B is a large, sprawling cluster whose Gaussian still assigns the point decent density while A is a tiny tight one. Accounting for cluster shape is exactly what lets GMM handle data k-means can't.
In the E-step, how is each point's responsibility (soft membership) for a cluster computed?

Chapter 4: The M-Step — Re-Fitting the Gaussians

Now the second half. The M-step (Maximization) takes the soft responsibilities from the E-step and uses them to update each Gaussian. It pretends the soft memberships are correct and asks: given these weighted points, what's the best Gaussian for each cluster? The answer is a beautiful weighted version of the same count-and-average we used for hard labels.

For each cluster j, every point contributes to it in proportion to its responsibility wij for that cluster. A point that's 90% cluster A counts as 0.9 of a point toward A's statistics and 0.1 toward B's. Then:

Compare these to the hard-label formulas from the generative-learning lesson and they're identical — except the crisp 0/1 "is this point in cluster j?" indicator is replaced by the soft responsibility wij ∈ [0,1]. That's the entire difference between EM and supervised Gaussian fitting: EM uses fractional memberships because it doesn't actually know the labels, only its current best guess of them. Each M-step move is the exact maximum-likelihood update given those soft guesses.

One M-step: ellipses snap to their points

Points are softly colored by their current responsibilities. Press M-step and watch each Gaussian's ellipse relocate and reshape — its center slides to the weighted mean of its points, and its ellipse stretches and tilts to the weighted spread. Press E-step to re-soften the colors under the new Gaussians, then M-step again. The two steps chase each other into a perfect fit.

The full loop, in one breath. E-step: with the current Gaussians, softly color every point by Bayes' rule. M-step: with those colors, re-fit each Gaussian as the weighted mean and weighted spread of its points. Each step strictly improves the model's fit to the data, and repeating them drives the Gaussians to lock onto the true clusters — shapes, sizes, overlaps and all. It's the same assign-update rhythm as k-means, but every quantity is soft and every cluster is a full, flexible ellipse.
Common misconception: "The M-step covariance update is just bookkeeping — the means are what matter." The covariances are the thing that makes GMM more powerful than k-means. By letting each cluster learn its own shape, size, and orientation, the M-step covariance update is what allows GMM to fit the stretched, unequal, overlapping clusters that defeat k-means' rigid spheres. Drop the covariance flexibility (force every cluster to be a fixed-size circle) and you collapse GMM back down to k-means.
How does the M-step update differ from the supervised (known-label) Gaussian fitting formulas?

Chapter 5: The EM Lab

Everything together, live. Below is a full GMM trained by EM. Watch the Gaussians begin as random circles and, through alternating E and M steps, flow outward — stretching, tilting, and sliding until each ellipse wraps snugly around a real cluster. This is where you see EM do what k-means cannot.

Things to try:
  • Run and watch the ellipses grow from tiny circles into shapes that match each cluster's stretch and tilt — the soft coloring sharpening as confidence grows.
  • Try the elongated data — the very shape that broke k-means. Watch GMM's ellipses elongate to fit it perfectly, where k-means could only cut straight across.
  • Try overlapping clusters — points in the overlap region stay softly two-colored, honestly uncertain, instead of being arbitrarily split.
  • Step slowly through E and M to see each half of the dance in isolation.
EM for Gaussian Mixtures — the ellipses come alive

Each ellipse is one Gaussian (mean + covariance); points are softly colored by responsibility. Press Run to alternate E and M steps to convergence, or Step through them. Switch the data shape to see GMM handle what k-means can't.

iter 0 · log-lik

(No quiz — the lab is the test. If you can watch an ellipse and predict which way it'll stretch on the next M-step, you understand EM. Notice how the soft coloring in overlap regions never fully commits — that honesty about uncertainty is the whole point.)

Chapter 6: EM vs. k-Means — The Same Algorithm, Softened

We've hinted at it repeatedly; now let's make it precise. k-means is a special case of EM. They're not cousins — one is literally the other with two simplifications. Seeing this unifies everything you've learned across these two lessons.

Take EM for Gaussian mixtures and make two changes. First, harden the responsibilities: instead of soft probabilities, force each point to give responsibility 1 to its most-probable cluster and 0 to all others. Second, freeze the covariances: force every cluster to be a fixed-size sphere (identity covariance), so the Gaussian density depends only on distance to the center. Make those two changes, and the E-step becomes "assign each point to its nearest center" and the M-step becomes "move each center to the mean of its assigned points." That's exactly k-means.

k-meansEM / GMM
AssignmentHard (one cluster, 0/1)Soft (probabilities)
Cluster shapeFixed spheresFull ellipses (own size, stretch, tilt)
"Assign" stepNearest centroidE-step: Bayes' responsibilities
"Update" stepMean of assigned pointsM-step: weighted mean & covariance
OutputA label per pointA probability distribution + full density model
Handles overlap/stretchNoYes
Same data, two algorithms, side by side

Left: k-means (hard, spherical). Right: GMM/EM (soft, elliptical). Run both on the same data and compare. On round blobs they agree. Switch to elongated clusters and watch k-means slice them wrongly while GMM's ellipses elongate to fit — the visual proof that EM is k-means with the constraints lifted.

Why bother with the harder algorithm? k-means is faster and simpler, and when clusters really are round, well-separated blobs, it's the right tool. Reach for GMM/EM when you need what k-means throws away: soft memberships (calibrated "how sure?" per point), flexible shapes (elongated or unequal clusters), or a full density model (to score new points or generate data). The cost is more computation and more parameters to estimate; the payoff is honesty and flexibility. Knowing both, and when each fits, is the mark of fluency.
Common misconception: "Since GMM is more general, it's always the better choice." More general means more parameters (each cluster's full covariance), which means it needs more data to estimate reliably and is more prone to overfitting and to degenerate solutions (a Gaussian collapsing onto a single point). With little data or genuinely spherical clusters, k-means' simplicity is a virtue, not a limitation. Generality is a tool with a cost — spend it where it buys you something.
What two simplifications turn EM for Gaussian mixtures into exactly k-means?

Chapter 7: Why EM Converges

EM feels like a heuristic — guess, refine, repeat — but it carries a real guarantee: every iteration increases (or holds) the likelihood of the data. The model gets strictly more probable each round until it settles. Let's understand why, without drowning in algebra.

The quantity EM is climbing is the log-likelihood — how probable the observed data is under the current Gaussians. We can't maximize it directly (that's the chicken-and-egg problem). So EM does something clever: at each step it builds a lower bound — a simpler function that sits underneath the true log-likelihood and touches it at the current parameters. The E-step constructs this bound (using the responsibilities); the M-step jumps to the top of the bound. Because the bound touches the true objective where you started and you climbed the bound, you must have climbed the true objective too.

The mathematical tool that builds the bound is Jensen's inequality — a fact about convex functions stating that "the average of a function is at least the function of the average" (for convex functions; the reverse for concave). Applied to the log (a concave function), it lets us swap a hard-to-handle "log of a sum" for an easy-to-maximize "sum of logs," producing exactly the bound EM optimizes. You don't need the derivation to use EM, but the upshot is solid: EM is principled hill-climbing on the data likelihood, guaranteed never to go downhill.

The log-likelihood only goes up

Run EM and watch the data's log-likelihood climb with every E+M iteration — a monotone ascent to a plateau where the model has converged. Just like k-means' distortion went only down, EM's likelihood goes only up. (And just like k-means, it can settle on a local optimum — so multiple restarts still matter.)

iter 0 · log-lik

Two practical consequences follow. First, like k-means, EM can get stuck in a local optimum — the likelihood surface is bumpy, and a bad initialization lands in a mediocre valley. The cure is the same: run multiple random restarts (often initialized with k-means!) and keep the highest-likelihood result. Second, EM is a general framework, not just a clustering trick: any model with hidden variables — mixtures, hidden Markov models, topic models, missing-data problems — can be trained with the same E-step/M-step recipe. Learning EM here unlocks a huge swath of probabilistic modeling.

Common misconception: "EM converging to higher likelihood each step means it finds the global best model." Monotonic improvement guarantees you reach a peak, not the peak — the log-likelihood is non-convex with multiple local maxima, exactly like k-means' distortion. EM climbs reliably but only to the nearest summit. That's why, despite its convergence guarantee, you still run several initializations and compare. Guaranteed-uphill is not the same as guaranteed-best.
What does EM's convergence guarantee actually promise?

Chapter 8: Connections & Cheat Sheet

You've turned k-means' rigid "pick one" into a full probabilistic model — soft memberships, flexible shapes, and a convergence guarantee — and met EM, a framework that reaches far beyond clustering. That's a major step into modern probabilistic machine learning.

The whole lesson on one page

ConceptWhat it means
Gaussian Mixture ModelData = a blend of k Gaussians; each point picks a cluster (weight φj) then draws from that cluster's N(μj, Σj).
Latent variable zThe hidden "which cluster" label — real but unobserved.
Chicken-and-eggGaussians are easy if you know z; z is easy if you know the Gaussians; you know neither.
E-stepSoft-guess z: responsibilities wij = Bayes' (prior × Gaussian density), normalized.
M-stepRe-fit each Gaussian: responsibility-weighted mixing weight, mean, and covariance.
ConvergenceEach iteration increases the log-likelihood (Jensen's inequality) → converges, but to a local optimum (use restarts).
k-means linkk-means = EM with hard (0/1) responsibilities + fixed spherical covariances.
EM is generalAny latent-variable model (HMMs, topic models, missing data) trains with the same E/M recipe.

GMM in code

python
import numpy as np
from scipy.stats import multivariate_normal

def em_gmm(X, k, iters=100):
    n, d = X.shape
    # init: random means, identity covariances, equal weights
    mu = X[np.random.choice(n, k, replace=False)]
    Sigma = [np.eye(d) for _ in range(k)]; phi = np.ones(k)/k
    for _ in range(iters):
        # E-STEP: responsibilities = prior × Gaussian density, normalized
        W = np.array([phi[j]*multivariate_normal(mu[j],Sigma[j]).pdf(X) for j in range(k)]).T
        W /= W.sum(1, keepdims=True)
        # M-STEP: weighted re-estimation
        Nk = W.sum(0); phi = Nk/n
        mu = (W.T @ X) / Nk[:,None]
        for j in range(k):
            diff = X - mu[j]
            Sigma[j] = (W[:,j,None]*diff).T @ diff / Nk[j] + 1e-6*np.eye(d)
    return phi, mu, Sigma, W

# sklearn (k-means init + restarts built in):
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=3, n_init=5).fit(X)
gmm.predict_proba(X_new)   # soft responsibilities for new points

Where to go next

You can now teach this. A GMM models data as a blend of Gaussians with a hidden cluster label. EM breaks the chicken-and-egg deadlock: the E-step soft-guesses each point's cluster membership (Bayes' responsibilities), the M-step re-fits each Gaussian as the responsibility-weighted mean and covariance. It climbs the data likelihood every step (to a local optimum), and k-means is just EM with hard assignments and fixed spheres. Soft, elliptical, probabilistic clustering — and a framework that powers half of probabilistic ML.

"When you have eliminated the impossible, whatever remains, however improbable, must be the truth." — Sherlock Holmes. EM never eliminates — it keeps every possibility alive with a probability, and lets the data slowly sharpen the improbable into the clear.