Classical ML · CS229

Generative Learning: GDA & Naive Bayes

Instead of drawing a line between the classes, model what each class looks like — then ask which one most likely produced your data. The other half of classification, and the engine behind the spam filter that defined an era.

Prerequisites: Logistic Regression + basic probability. We build Bayes from scratch.
10
Chapters
7+
Simulations
0
Assumed Knowledge

Chapter 0: Two Ways to Tell an Elephant from a Dog

Suppose you're building a classifier to tell elephants from dogs, from a few measurements like height and weight. Everything you've learned so far — logistic regression, the perceptron — takes one approach: look at all the animals at once and find the dividing line that best separates elephants from dogs. To classify a new animal, check which side of the line it falls on. This is discriminative learning: it learns the boundary directly.

But there's a completely different way to think about it — arguably a more human one. First, study only the elephants and build a mental picture of what a typical elephant looks like: big, heavy, this range of proportions. Then, separately, study the dogs and build a picture of a typical dog. To classify a new animal, hold it up against both pictures and ask: does this look more like my elephant model, or my dog model? You never drew a boundary at all. This is generative learning: it learns what each class looks like.

The two philosophies in one line. A discriminative model learns the probability of the label given the features — it draws the boundary. A generative model learns how each class generates its features — it builds a model of each class, then uses Bayes' rule to flip that around into a prediction. Logistic regression is discriminative. The two algorithms in this lesson — GDA and Naive Bayes — are generative.

Why bother with a second philosophy? Three reasons we'll make concrete. Data efficiency: if you have strong, correct assumptions about what each class looks like, the generative approach can learn from far fewer examples. Insight: a model of each class can generate new synthetic examples and tells you why something was classified the way it was. And speed: the most famous generative classifier, Naive Bayes, is so simple and fast it ran the world's spam filters for two decades and is still the textbook "first thing to try."

Boundary vs. blueprints: the same data, two views

Two classes of animals (blue dogs, orange elephants). Toggle between the discriminative view — a single dividing line — and the generative view — a separate blob (a fitted Gaussian) modeling each class. Both classify the same way in the end, but they think about the problem completely differently.

Common misconception: "Generative models are obviously better because they understand the data more deeply." Not always — 'understanding' each class means making stronger assumptions about its shape, and if those assumptions are wrong, the model is confidently misled. As we'll see in Chapter 5, discriminative logistic regression is more robust precisely because it assumes less. Stronger assumptions are a double-edged sword: a huge advantage when right, a liability when wrong.
What fundamentally distinguishes a generative classifier (like GDA) from a discriminative one (like logistic regression)?

Chapter 1: Bayes' Rule — Flipping the Question Around

The generative approach builds a model of each class — but our actual goal is to predict the class given the features. We need a bridge from "what does each class look like" to "which class is this." That bridge is Bayes' rule, one of the most important formulas in all of statistics.

A generative model gives us two ingredients. First, the class prior — how common each class is before we look at any features. (If 30% of animals in our world are elephants, the prior for elephant is 0.3.) Second, the class-conditional — how likely the features are given a class. (How likely is "height 3 metres, weight 5 tonnes" if the animal is an elephant? Very. If it's a dog? Essentially zero.)

Bayes' rule combines these into what we actually want — the probability of the class given the features:

P(class | features) = P(features | class) · P(class) / P(features)

Read it as a recipe. To find how probable "elephant" is given the measurements, multiply how well elephants explain these measurements (the class-conditional) by how common elephants are (the prior), then normalize. Do the same for "dog," and pick whichever comes out larger. The numerator is "evidence for this class times how often we expect it"; the denominator just rescales so the probabilities sum to one.

The shortcut that makes this practical. To decide between classes, we compare their scores — and the denominator P(features) is the same for every class (it doesn't mention the class at all). So it can't change which class wins. We can ignore it entirely and just compare the numerators: P(features | class) × P(class). Classifying becomes "for each class, multiply the class-conditional by the prior, and take the biggest." No expensive normalization needed to pick a winner.
Bayes' rule in action: prior × likelihood

A new animal's measurement gives a likelihood under each class model (how well each class explains it). Slide the class prior for elephants and watch the posterior (the final probability) shift. Even a strong likelihood can be overruled by a strong prior — that's Bayes balancing evidence against expectation.

prior P(elephant)0.30
likelihood ratio3.0
posterior P(elephant | features) =
Common misconception: "The class with the higher likelihood P(features|class) always wins." Only if the priors are equal. A class that explains the data twice as well but is ten times rarer will usually lose — Bayes multiplies likelihood by prior. This is why a positive result on a test for a rare disease can still mean you're probably fine: the tiny prior outweighs the test's evidence. Generative classifiers get this balance right automatically; forgetting the prior is the classic Bayesian blunder.
When using Bayes' rule to decide which class an example belongs to, why can we ignore the denominator P(features)?

Chapter 2: The Multivariate Gaussian — A Blob in Many Dimensions

To build a generative model for continuous features (height, weight), we need a way to describe "what a typical elephant looks like" as a probability distribution over those features. The natural tool is the bell curve — but generalized from one dimension to many. That's the multivariate Gaussian, and it's the heart of the GDA model in the next chapter.

In one dimension, a Gaussian needs two numbers: a mean (where the bump is centered) and a variance (how wide it is). In two or more dimensions, those generalize:

The result is an elliptical hill of probability: tallest at the mean, fading outward, stretched and tilted by the covariance. Round blob = features are independent and equally spread. Tilted, stretched ellipse = features are correlated. Get a feel for it:

Shaping a 2D Gaussian

The contour rings are a 2D Gaussian's "blueprint" — a top-down view of the probability hill. Slide the spread to make it bigger or smaller, and the correlation to tilt it. Positive correlation stretches it along the diagonal (tall→heavy); negative tilts it the other way. This single shape is how GDA will model an entire class.

spread1.0
correlation0.0
Covariance shape:
Common misconception: "The covariance matrix just stores how spread out each feature is." It stores more: the off-diagonal entries encode how features move together. Ignore them (force the blob to stay axis-aligned) and you lose the ability to represent "tall elephants are also heavy." That correlation is often the most informative thing about a class — and, foreshadowing, the Naive Bayes algorithm later will deliberately throw it away for the sake of speed, which is exactly why it's called "naive."
In a multivariate Gaussian, what do the off-diagonal entries of the covariance matrix represent?

Chapter 3: Gaussian Discriminant Analysis

Now we assemble the pieces into our first generative classifier: Gaussian Discriminant Analysis, or GDA. The idea is exactly the elephant/dog blueprint from Chapter 0, made precise. We model each class as a multivariate Gaussian blob, and we use Bayes' rule to classify. Three assumptions define the model:

That last detail — both classes share one covariance — matters enormously, and we'll see why in the next chapter. For now: two blobs of the same shape and orientation, sitting at two different locations.

Fitting it is just counting and averaging

Here's the delight of GDA: unlike logistic regression, there's no iteration, no gradient descent, no learning rate. The maximum-likelihood fit is a closed-form recipe you could do by hand:

That's it. Count, average, average, measure-spread. A few passes over the data and you're done — no optimization loop at all. This is part of why generative models can be so data-efficient: each parameter is a simple statistic with a clear meaning.

GDA: fit a blob to each class

Drag any point to move it; press Fit to recompute. GDA places a Gaussian blob (contours) over each class — same shape, two centers — and the white line is the resulting decision boundary where the two classes are equally probable. Watch the blobs and the boundary update as you reshape the data.

accuracy:
Common misconception: "Each class should get its own covariance shape — forcing them to share one is a hack." Sharing the covariance is a deliberate modeling choice with a beautiful payoff: it makes the decision boundary perfectly straight (next chapter). If you let each class have its own covariance, you get a related model (Quadratic Discriminant Analysis) whose boundary is a curve — more flexible, but needing more data and losing the clean link to logistic regression. Shared covariance is the elegant default.
How are the parameters of a GDA model fit to the training data?

Chapter 4: The Sigmoid Returns — GDA's Hidden Link to Logistic Regression

Here's a result that should genuinely surprise you. We built GDA from a completely different philosophy than logistic regression — modeling each class as a Gaussian rather than drawing a boundary. Yet if you take a trained GDA model and ask it the discriminative question — "what's the probability of class 1 given the features?" — and work through the Bayes' rule algebra, out pops:

P(y = 1 | x) = 1 / (1 + e−θTx)

The sigmoid. Exactly the logistic regression form, with some weight vector θ that's a specific function of GDA's means, prior, and covariance. The reason traces back to the shared covariance assumption: because both Gaussians have the same shape, the curved parts of their densities cancel when you take the ratio, leaving a linear function of x inside the exponential — and that's precisely what makes the boundary a straight line and the posterior a sigmoid.

Two roads, one destination — sometimes. Start from "each class is a shared-covariance Gaussian" (generative) and you arrive at a sigmoid posterior with a linear boundary. Start from "the log-odds are linear in x" (discriminative logistic regression) and you also get a sigmoid with a linear boundary. The same prediction shape, reached from opposite directions. But — and this is the crucial asymmetry — the implication only runs one way.

That one-way street is the whole story of when to use which. GDA's Gaussian assumption implies a logistic posterior — if the data really is two shared-covariance Gaussians, the boundary is logistic. But the reverse is false: data with a logistic boundary need not be Gaussian. Many different generative stories (Gaussian, Poisson, and more) all lead to the same logistic posterior. So logistic regression assumes less — it commits only to the boundary shape, staying agnostic about the actual distribution of each class. GDA assumes more — it commits to the full Gaussian shape of each class. That difference in how much they assume is exactly the trade-off we unpack next.

GDA's posterior is a sigmoid

Move the query point across the GDA model and watch its class-1 probability. Trace the value as you sweep left-to-right: it follows the smooth S-curve of a sigmoid, hitting one-half exactly on the straight decision boundary. The generative blobs and the discriminative sigmoid are the same model, seen two ways.

query position0.0
P(class 1 | query) =
Common misconception: "Since GDA gives a sigmoid too, GDA and logistic regression always produce the same boundary." They give the same functional form (sigmoid, linear boundary) but generally different boundaries on the same data, because they fit θ differently — GDA via class statistics, logistic regression via likelihood on the labels directly. Same shape of answer, different specific answer. Which one's better depends entirely on whether the Gaussian assumption holds.
When you compute P(y=1 | x) from a trained GDA model, what form does it take, and what's the key asymmetry with logistic regression?

Chapter 5: GDA vs. Logistic — The Assumption Trade-off

We now have two classifiers that produce the same shape of answer but make very different bets. Which should you reach for? The answer is one of the most important practical lessons in machine learning, and it's all about how much you assume.

When the Gaussian assumption is correct: GDA wins

If the data truly is two shared-covariance Gaussians, GDA is making exactly the right assumption — and it exploits that to learn from less data. It's asymptotically efficient: in the limit, no algorithm estimates the class probabilities better. With strong correct assumptions, you need fewer examples, because you're not asking the data to discover the shape — you've told it the shape. For small datasets where you're confident the classes are roughly Gaussian, GDA is the better bet.

When the assumption is wrong: logistic regression wins

But fit Gaussians to data that isn't Gaussian — skewed, multi-modal, heavy-tailed — and GDA is confidently wrong. Logistic regression, by committing only to a linear boundary and nothing about the class shapes, is far more robust. It doesn't care whether each class is Gaussian, Poisson, or some weird blob — as long as a line can separate them reasonably, it finds a good one. With large datasets and uncertain assumptions, logistic regression is the safer, and usually better, choice. This is why, in practice, logistic regression is used far more often than GDA.

Right assumption vs. wrong assumption

Toggle the data between Gaussian (clean blobs) and non-Gaussian (skewed/curved clusters). Watch both boundaries: GDA (from the fitted blobs) and logistic (from the labels). On Gaussian data they nearly agree. On non-Gaussian data, GDA's blobs warp and its boundary drifts wrong, while logistic regression keeps separating the classes sensibly.

GDA acc · logistic acc
GDA (generative)Logistic (discriminative)
AssumesEach class is Gaussian (shared Σ)Only that the boundary is linear
FittingClosed form: count & averageIterative: gradient ascent
If assumption rightBest — data-efficientGood, needs more data
If assumption wrongConfidently misledRobust — still works
Best whenSmall data, classes ~GaussianLarge data, unsure of shape
Common misconception: "More assumptions are always bad — you should use the most flexible model." Assumptions aren't inherently bad; correct assumptions are pure advantage, letting you learn from less data. The danger is wrong assumptions. The skill is matching your model's assumptions to what you actually know about the problem — strong assumptions when you're confident, weak ones when you're not. That judgment, not raw flexibility, is what separates good practitioners.
You have a small dataset and you're confident the two classes are roughly Gaussian blobs. Which classifier is the better bet, and why?

Chapter 6: Naive Bayes — Generative Learning for Words

GDA handled continuous features with Gaussians. But the most famous generative classifier tackles a different beast: discrete features, especially text. Meet Naive Bayes, the algorithm that filtered spam for the entire internet through the 2000s. Let's build a spam filter and discover why it needs to be "naive."

We represent an email as a long list of yes/no features — one per word in our vocabulary. Feature j is 1 if word j appears in the email, 0 if not. With a vocabulary of 50,000 words, each email is a 50,000-long vector of zeros and ones. To classify generatively, we need to model how spam generates these word-vectors, and how non-spam does — the class-conditionals from Chapter 1.

The combinatorial wall

Here's the catastrophe. A full model of "which combinations of 50,000 words appear together" would need to assign a probability to every possible word-vector — and there are two-to-the-fifty-thousand of them. That's a number with over 15,000 digits. More parameters than atoms in the universe, by an absurd margin. We cannot possibly estimate that from any dataset. The honest, correlation-aware model is computationally impossible.

The naive assumption that rescues everything

So we make a wildly strong simplifying assumption, the Naive Bayes assumption: given the class, the words are independent of each other. Knowing an email is spam, the presence of "buy" tells you nothing about whether "price" also appears. This is plainly false — spam words travel in packs — which is exactly why it's called "naive." But it collapses the impossible model into a trivial one: instead of modeling all word combinations, we just model each word separately. The probability of a whole email given its class becomes the product of each word's individual probability given the class.

Now each parameter is dead simple to estimate, by counting: the probability of word j appearing in spam is just the fraction of spam emails that contain word j. Count the spam emails with "viagra," divide by the total spam emails. Do this for every word, for both classes, plus the overall spam rate. That's the entire training procedure — a few counts. To classify a new email, multiply the relevant per-word probabilities together with the class prior (Bayes' rule from Chapter 1) and compare spam vs. non-spam.

"Naive" is a feature, not just a bug. The independence assumption is false, yet Naive Bayes works remarkably well in practice. Why? Because for classification we only need to get the winner right, not the exact probabilities. Even with miscalibrated probabilities from the false assumption, the spam class usually still comes out ahead for spam emails. Trading correctness-of-probability for simplicity-and-speed is often a brilliant deal — Naive Bayes is the patron saint of "good enough, instantly."
Common misconception: "The Naive Bayes assumption says all the words are independent, period." It says they're conditionally independent — independent given the class label. That's a much weaker, more reasonable claim. "Buy" and "price" are correlated overall (both appear in commercial emails), but the assumption only requires that once you know an email is spam, they provide independent evidence. Still not true, but far less crazy than claiming raw independence.
What does the "naive" assumption in Naive Bayes actually assume, and why is it necessary?

Chapter 7: The Spam Filter Lab

Let's build the actual thing. Below is a working Naive Bayes spam filter, trained on a tiny corpus of emails. Toggle which words appear in an incoming email, and watch the classifier compute the probability of spam in real time — multiplying each word's evidence by the prior, exactly the Bayes' rule recipe from Chapter 1. You can see each word vote.

What to watch:
  • Toggle spammy words (free, money, winner) on — the spam probability shoots up, each word adding its evidence.
  • Toggle hammy words (meeting, report, lunch) on — the probability drops back toward "not spam."
  • Mix them — watch the classifier weigh competing evidence, the product of probabilities tilting toward whichever class the words favor.
  • Notice each word's contribution bar: how strongly that single word pushes toward spam or ham, learned purely from counting.
Naive Bayes Spam Filter — toggle the words

Click words to add/remove them from the incoming email. The gauge shows P(spam | words); the bars show how each present word votes (orange = toward spam, teal = toward ham). All probabilities were learned by counting word frequencies in a labeled corpus.

P(spam | this email) =

That gauge is the entire algorithm: a prior (the base spam rate), times a product of per-word likelihoods (each learned by counting), normalized by Bayes' rule. No gradient descent, no iterations, no tuning. It trains in one pass over the data and classifies in microseconds. For two decades, a version of exactly this guarded billions of inboxes.

(No quiz — the lab is the test. If you can predict which way the gauge moves before toggling a word, you understand Naive Bayes.)

Chapter 8: Laplace Smoothing — The Zero That Breaks Everything

Our spam filter has a hidden landmine, and it's worth understanding because it teaches a deep statistical lesson. Suppose a brand-new word appears in an incoming email — say "neurips" — that never showed up in any training email, spam or ham. What probability does our counting procedure assign to it?

Zero. The fraction of spam emails containing "neurips" is zero (none did), and the fraction of ham emails is also zero. And here's the disaster: because we multiply all the per-word probabilities together, a single zero anywhere in the product makes the entire result zero — for both classes. The classifier computes zero divided by zero and throws up its hands. One unseen word and the whole prediction collapses.

The deeper lesson: it is statistically reckless to declare an event impossible (probability exactly zero) just because you haven't seen it yet in a finite sample. Your training set is a tiny window on the world; absence of evidence isn't evidence of absence. Assigning a hard zero to "never observed" is overconfidence of the worst kind — and the multiplication ruthlessly punishes it.

The fix: pretend you saw everything once

The cure is delightfully simple, called Laplace smoothing (or add-one smoothing). Before counting, imagine you'd already seen every word once in every class. Concretely: add 1 to every word's count, and adjust the denominator to keep things a valid probability. Now nothing has probability zero — every word gets a tiny, non-zero share, just in case. The data still dominates for words you've seen plenty of times; smoothing only matters for the rare and unseen, gently nudging their zero up to "small but possible."

A zero probability, before and after smoothing

A word appears in 0 of the spam emails. Slide its count and toggle Laplace smoothing. Without smoothing, a count of zero gives a probability of exactly zero (which zeros out the whole product, shown by the broken gauge). With smoothing, even an unseen word gets a small positive probability, and the classifier keeps working.

word count in spam0
estimated probability =
Common misconception: "Laplace smoothing is a hack to avoid a divide-by-zero error." It's principled, not a kludge — it's the optimal estimator under a reasonable prior belief that all outcomes are a priori possible. It encodes genuine humility: "I haven't seen this, but I won't bet my life it can never happen." Almost every real text classifier uses it, and it's the simplest example of the Bayesian idea of a prior regularizing your estimates away from overconfident extremes.
Why does an unseen word break an un-smoothed Naive Bayes classifier, and how does Laplace smoothing fix it?

Chapter 9: Connections & Cheat Sheet

You've now seen both halves of classification. Discriminative models (logistic regression) draw boundaries; generative models (GDA, Naive Bayes) model each class and flip with Bayes. Knowing both — and when each shines — is what makes you fluent rather than just functional.

The whole lesson on one page

IdeaWhat it says
DiscriminativeModel P(label | features) directly — learn the boundary (logistic regression).
GenerativeModel P(features | class) and P(class), then use Bayes' rule to get P(class | features).
Bayes' ruleposterior ∝ likelihood × prior. Ignore the denominator when picking a winner.
Multivariate GaussianMean vector (center) + covariance matrix (shape, including feature correlations).
GDAEach class is a shared-covariance Gaussian. Fit by counting & averaging. Boundary is linear; posterior is a sigmoid.
GDA vs logisticGDA assumes more (Gaussian) → data-efficient if right, misled if wrong. Logistic is robust.
Naive BayesDiscrete features (words). Assume conditional independence given class → product of per-word probabilities. Fit by counting.
Laplace smoothingAdd 1 to every count so no event has probability zero. Prevents one unseen word from zeroing the product.

Naive Bayes from scratch

python
import numpy as np
def train_naive_bayes(X, y):     # X: n×d binary word-presence matrix
    # Laplace-smoothed P(word j = 1 | class), and class prior
    phi_spam = (X[y==1].sum(0) + 1) / (np.sum(y==1) + 2)
    phi_ham  = (X[y==0].sum(0) + 1) / (np.sum(y==0) + 2)
    prior = y.mean()
    return phi_spam, phi_ham, prior

def predict(x, phi_spam, phi_ham, prior):
    # work in log-space: products of many small probabilities underflow to 0
    ls = np.log(prior)   + (x*np.log(phi_spam) + (1-x)*np.log(1-phi_spam)).sum()
    lh = np.log(1-prior) + (x*np.log(phi_ham)  + (1-x)*np.log(1-phi_ham)).sum()
    return 1 if ls > lh else 0

# sklearn: GaussianNB for GDA-like continuous, MultinomialNB/BernoulliNB for text
from sklearn.naive_bayes import BernoulliNB
BernoulliNB().fit(X, y)   # Laplace smoothing built in (alpha=1)
One detail that matters in practice: multiplying thousands of small probabilities underflows to zero on a real computer. So real Naive Bayes works in log-space — adding logarithms instead of multiplying probabilities. Same math, no underflow. You'll see this np.log trick everywhere probabilities get multiplied in bulk.

Where to go next

You can now teach this. Generative models learn what each class looks like and flip with Bayes' rule. GDA fits a shared-covariance Gaussian to each class by counting and averaging, and — surprise — its posterior is a sigmoid, linking it to logistic regression, though it assumes more. Naive Bayes assumes words are conditionally independent given the class, turning an impossible model into a product of counts, and Laplace smoothing keeps one unseen word from breaking everything. The other half of classification, mastered.

"It is the mark of a truly intelligent person to be moved by statistics." — George Bernard Shaw. Naive Bayes is statistics moving billions of emails into the right folder, one humble count at a time.