Instead of drawing a line between the classes, model what each class looks like — then ask which one most likely produced your data. The other half of classification, and the engine behind the spam filter that defined an era.
Suppose you're building a classifier to tell elephants from dogs, from a few measurements like height and weight. Everything you've learned so far — logistic regression, the perceptron — takes one approach: look at all the animals at once and find the dividing line that best separates elephants from dogs. To classify a new animal, check which side of the line it falls on. This is discriminative learning: it learns the boundary directly.
But there's a completely different way to think about it — arguably a more human one. First, study only the elephants and build a mental picture of what a typical elephant looks like: big, heavy, this range of proportions. Then, separately, study the dogs and build a picture of a typical dog. To classify a new animal, hold it up against both pictures and ask: does this look more like my elephant model, or my dog model? You never drew a boundary at all. This is generative learning: it learns what each class looks like.
Why bother with a second philosophy? Three reasons we'll make concrete. Data efficiency: if you have strong, correct assumptions about what each class looks like, the generative approach can learn from far fewer examples. Insight: a model of each class can generate new synthetic examples and tells you why something was classified the way it was. And speed: the most famous generative classifier, Naive Bayes, is so simple and fast it ran the world's spam filters for two decades and is still the textbook "first thing to try."
Two classes of animals (blue dogs, orange elephants). Toggle between the discriminative view — a single dividing line — and the generative view — a separate blob (a fitted Gaussian) modeling each class. Both classify the same way in the end, but they think about the problem completely differently.
The generative approach builds a model of each class — but our actual goal is to predict the class given the features. We need a bridge from "what does each class look like" to "which class is this." That bridge is Bayes' rule, one of the most important formulas in all of statistics.
A generative model gives us two ingredients. First, the class prior — how common each class is before we look at any features. (If 30% of animals in our world are elephants, the prior for elephant is 0.3.) Second, the class-conditional — how likely the features are given a class. (How likely is "height 3 metres, weight 5 tonnes" if the animal is an elephant? Very. If it's a dog? Essentially zero.)
Bayes' rule combines these into what we actually want — the probability of the class given the features:
Read it as a recipe. To find how probable "elephant" is given the measurements, multiply how well elephants explain these measurements (the class-conditional) by how common elephants are (the prior), then normalize. Do the same for "dog," and pick whichever comes out larger. The numerator is "evidence for this class times how often we expect it"; the denominator just rescales so the probabilities sum to one.
A new animal's measurement gives a likelihood under each class model (how well each class explains it). Slide the class prior for elephants and watch the posterior (the final probability) shift. Even a strong likelihood can be overruled by a strong prior — that's Bayes balancing evidence against expectation.
To build a generative model for continuous features (height, weight), we need a way to describe "what a typical elephant looks like" as a probability distribution over those features. The natural tool is the bell curve — but generalized from one dimension to many. That's the multivariate Gaussian, and it's the heart of the GDA model in the next chapter.
In one dimension, a Gaussian needs two numbers: a mean (where the bump is centered) and a variance (how wide it is). In two or more dimensions, those generalize:
The result is an elliptical hill of probability: tallest at the mean, fading outward, stretched and tilted by the covariance. Round blob = features are independent and equally spread. Tilted, stretched ellipse = features are correlated. Get a feel for it:
The contour rings are a 2D Gaussian's "blueprint" — a top-down view of the probability hill. Slide the spread to make it bigger or smaller, and the correlation to tilt it. Positive correlation stretches it along the diagonal (tall→heavy); negative tilts it the other way. This single shape is how GDA will model an entire class.
Now we assemble the pieces into our first generative classifier: Gaussian Discriminant Analysis, or GDA. The idea is exactly the elephant/dog blueprint from Chapter 0, made precise. We model each class as a multivariate Gaussian blob, and we use Bayes' rule to classify. Three assumptions define the model:
That last detail — both classes share one covariance — matters enormously, and we'll see why in the next chapter. For now: two blobs of the same shape and orientation, sitting at two different locations.
Here's the delight of GDA: unlike logistic regression, there's no iteration, no gradient descent, no learning rate. The maximum-likelihood fit is a closed-form recipe you could do by hand:
That's it. Count, average, average, measure-spread. A few passes over the data and you're done — no optimization loop at all. This is part of why generative models can be so data-efficient: each parameter is a simple statistic with a clear meaning.
Drag any point to move it; press Fit to recompute. GDA places a Gaussian blob (contours) over each class — same shape, two centers — and the white line is the resulting decision boundary where the two classes are equally probable. Watch the blobs and the boundary update as you reshape the data.
Here's a result that should genuinely surprise you. We built GDA from a completely different philosophy than logistic regression — modeling each class as a Gaussian rather than drawing a boundary. Yet if you take a trained GDA model and ask it the discriminative question — "what's the probability of class 1 given the features?" — and work through the Bayes' rule algebra, out pops:
The sigmoid. Exactly the logistic regression form, with some weight vector θ that's a specific function of GDA's means, prior, and covariance. The reason traces back to the shared covariance assumption: because both Gaussians have the same shape, the curved parts of their densities cancel when you take the ratio, leaving a linear function of x inside the exponential — and that's precisely what makes the boundary a straight line and the posterior a sigmoid.
That one-way street is the whole story of when to use which. GDA's Gaussian assumption implies a logistic posterior — if the data really is two shared-covariance Gaussians, the boundary is logistic. But the reverse is false: data with a logistic boundary need not be Gaussian. Many different generative stories (Gaussian, Poisson, and more) all lead to the same logistic posterior. So logistic regression assumes less — it commits only to the boundary shape, staying agnostic about the actual distribution of each class. GDA assumes more — it commits to the full Gaussian shape of each class. That difference in how much they assume is exactly the trade-off we unpack next.
Move the query point across the GDA model and watch its class-1 probability. Trace the value as you sweep left-to-right: it follows the smooth S-curve of a sigmoid, hitting one-half exactly on the straight decision boundary. The generative blobs and the discriminative sigmoid are the same model, seen two ways.
We now have two classifiers that produce the same shape of answer but make very different bets. Which should you reach for? The answer is one of the most important practical lessons in machine learning, and it's all about how much you assume.
If the data truly is two shared-covariance Gaussians, GDA is making exactly the right assumption — and it exploits that to learn from less data. It's asymptotically efficient: in the limit, no algorithm estimates the class probabilities better. With strong correct assumptions, you need fewer examples, because you're not asking the data to discover the shape — you've told it the shape. For small datasets where you're confident the classes are roughly Gaussian, GDA is the better bet.
But fit Gaussians to data that isn't Gaussian — skewed, multi-modal, heavy-tailed — and GDA is confidently wrong. Logistic regression, by committing only to a linear boundary and nothing about the class shapes, is far more robust. It doesn't care whether each class is Gaussian, Poisson, or some weird blob — as long as a line can separate them reasonably, it finds a good one. With large datasets and uncertain assumptions, logistic regression is the safer, and usually better, choice. This is why, in practice, logistic regression is used far more often than GDA.
Toggle the data between Gaussian (clean blobs) and non-Gaussian (skewed/curved clusters). Watch both boundaries: GDA (from the fitted blobs) and logistic (from the labels). On Gaussian data they nearly agree. On non-Gaussian data, GDA's blobs warp and its boundary drifts wrong, while logistic regression keeps separating the classes sensibly.
| GDA (generative) | Logistic (discriminative) | |
|---|---|---|
| Assumes | Each class is Gaussian (shared Σ) | Only that the boundary is linear |
| Fitting | Closed form: count & average | Iterative: gradient ascent |
| If assumption right | Best — data-efficient | Good, needs more data |
| If assumption wrong | Confidently misled | Robust — still works |
| Best when | Small data, classes ~Gaussian | Large data, unsure of shape |
GDA handled continuous features with Gaussians. But the most famous generative classifier tackles a different beast: discrete features, especially text. Meet Naive Bayes, the algorithm that filtered spam for the entire internet through the 2000s. Let's build a spam filter and discover why it needs to be "naive."
We represent an email as a long list of yes/no features — one per word in our vocabulary. Feature j is 1 if word j appears in the email, 0 if not. With a vocabulary of 50,000 words, each email is a 50,000-long vector of zeros and ones. To classify generatively, we need to model how spam generates these word-vectors, and how non-spam does — the class-conditionals from Chapter 1.
Here's the catastrophe. A full model of "which combinations of 50,000 words appear together" would need to assign a probability to every possible word-vector — and there are two-to-the-fifty-thousand of them. That's a number with over 15,000 digits. More parameters than atoms in the universe, by an absurd margin. We cannot possibly estimate that from any dataset. The honest, correlation-aware model is computationally impossible.
So we make a wildly strong simplifying assumption, the Naive Bayes assumption: given the class, the words are independent of each other. Knowing an email is spam, the presence of "buy" tells you nothing about whether "price" also appears. This is plainly false — spam words travel in packs — which is exactly why it's called "naive." But it collapses the impossible model into a trivial one: instead of modeling all word combinations, we just model each word separately. The probability of a whole email given its class becomes the product of each word's individual probability given the class.
Now each parameter is dead simple to estimate, by counting: the probability of word j appearing in spam is just the fraction of spam emails that contain word j. Count the spam emails with "viagra," divide by the total spam emails. Do this for every word, for both classes, plus the overall spam rate. That's the entire training procedure — a few counts. To classify a new email, multiply the relevant per-word probabilities together with the class prior (Bayes' rule from Chapter 1) and compare spam vs. non-spam.
Let's build the actual thing. Below is a working Naive Bayes spam filter, trained on a tiny corpus of emails. Toggle which words appear in an incoming email, and watch the classifier compute the probability of spam in real time — multiplying each word's evidence by the prior, exactly the Bayes' rule recipe from Chapter 1. You can see each word vote.
Click words to add/remove them from the incoming email. The gauge shows P(spam | words); the bars show how each present word votes (orange = toward spam, teal = toward ham). All probabilities were learned by counting word frequencies in a labeled corpus.
That gauge is the entire algorithm: a prior (the base spam rate), times a product of per-word likelihoods (each learned by counting), normalized by Bayes' rule. No gradient descent, no iterations, no tuning. It trains in one pass over the data and classifies in microseconds. For two decades, a version of exactly this guarded billions of inboxes.
(No quiz — the lab is the test. If you can predict which way the gauge moves before toggling a word, you understand Naive Bayes.)
Our spam filter has a hidden landmine, and it's worth understanding because it teaches a deep statistical lesson. Suppose a brand-new word appears in an incoming email — say "neurips" — that never showed up in any training email, spam or ham. What probability does our counting procedure assign to it?
Zero. The fraction of spam emails containing "neurips" is zero (none did), and the fraction of ham emails is also zero. And here's the disaster: because we multiply all the per-word probabilities together, a single zero anywhere in the product makes the entire result zero — for both classes. The classifier computes zero divided by zero and throws up its hands. One unseen word and the whole prediction collapses.
The cure is delightfully simple, called Laplace smoothing (or add-one smoothing). Before counting, imagine you'd already seen every word once in every class. Concretely: add 1 to every word's count, and adjust the denominator to keep things a valid probability. Now nothing has probability zero — every word gets a tiny, non-zero share, just in case. The data still dominates for words you've seen plenty of times; smoothing only matters for the rare and unseen, gently nudging their zero up to "small but possible."
A word appears in 0 of the spam emails. Slide its count and toggle Laplace smoothing. Without smoothing, a count of zero gives a probability of exactly zero (which zeros out the whole product, shown by the broken gauge). With smoothing, even an unseen word gets a small positive probability, and the classifier keeps working.
You've now seen both halves of classification. Discriminative models (logistic regression) draw boundaries; generative models (GDA, Naive Bayes) model each class and flip with Bayes. Knowing both — and when each shines — is what makes you fluent rather than just functional.
| Idea | What it says |
|---|---|
| Discriminative | Model P(label | features) directly — learn the boundary (logistic regression). |
| Generative | Model P(features | class) and P(class), then use Bayes' rule to get P(class | features). |
| Bayes' rule | posterior ∝ likelihood × prior. Ignore the denominator when picking a winner. |
| Multivariate Gaussian | Mean vector (center) + covariance matrix (shape, including feature correlations). |
| GDA | Each class is a shared-covariance Gaussian. Fit by counting & averaging. Boundary is linear; posterior is a sigmoid. |
| GDA vs logistic | GDA assumes more (Gaussian) → data-efficient if right, misled if wrong. Logistic is robust. |
| Naive Bayes | Discrete features (words). Assume conditional independence given class → product of per-word probabilities. Fit by counting. |
| Laplace smoothing | Add 1 to every count so no event has probability zero. Prevents one unseen word from zeroing the product. |
python import numpy as np def train_naive_bayes(X, y): # X: n×d binary word-presence matrix # Laplace-smoothed P(word j = 1 | class), and class prior phi_spam = (X[y==1].sum(0) + 1) / (np.sum(y==1) + 2) phi_ham = (X[y==0].sum(0) + 1) / (np.sum(y==0) + 2) prior = y.mean() return phi_spam, phi_ham, prior def predict(x, phi_spam, phi_ham, prior): # work in log-space: products of many small probabilities underflow to 0 ls = np.log(prior) + (x*np.log(phi_spam) + (1-x)*np.log(1-phi_spam)).sum() lh = np.log(1-prior) + (x*np.log(phi_ham) + (1-x)*np.log(1-phi_ham)).sum() return 1 if ls > lh else 0 # sklearn: GaussianNB for GDA-like continuous, MultinomialNB/BernoulliNB for text from sklearn.naive_bayes import BernoulliNB BernoulliNB().fit(X, y) # Laplace smoothing built in (alpha=1)
np.log trick everywhere probabilities get multiplied in bulk."It is the mark of a truly intelligent person to be moved by statistics." — George Bernard Shaw. Naive Bayes is statistics moving billions of emails into the right folder, one humble count at a time.