Generative Learning: GDA & Naive Bayes — From Absolute Zero to Mastery

Chapter 0: Two Ways to Tell an Elephant from a Dog

Suppose you're building a classifier to tell elephants from dogs, from a few measurements like height and weight. Everything you've learned so far — logistic regression, the perceptron — takes one approach: look at all the animals at once and find the dividing line that best separates elephants from dogs. To classify a new animal, check which side of the line it falls on. This is discriminative learning: it learns the boundary directly.

But there's a completely different way to think about it — arguably a more human one. First, study only the elephants and build a mental picture of what a typical elephant looks like: big, heavy, this range of proportions. Then, separately, study the dogs and build a picture of a typical dog. To classify a new animal, hold it up against both pictures and ask: does this look more like my elephant model, or my dog model? You never drew a boundary at all. This is generative learning: it learns what each class looks like.

The two philosophies in one line. A discriminative model learns the probability of the label given the features — it draws the boundary. A generative model learns how each class generates its features — it builds a model of each class, then uses Bayes' rule to flip that around into a prediction. Logistic regression is discriminative. The two algorithms in this lesson — GDA and Naive Bayes — are generative.

Why bother with a second philosophy? Three reasons we'll make concrete. Data efficiency: if you have strong, correct assumptions about what each class looks like, the generative approach can learn from far fewer examples. Insight: a model of each class can generate new synthetic examples and tells you why something was classified the way it was. And speed: the most famous generative classifier, Naive Bayes, is so simple and fast it ran the world's spam filters for two decades and is still the textbook "first thing to try."

Boundary vs. blueprints: the same data, two views

Two classes of animals (blue dogs, orange elephants). Toggle between the discriminative view — a single dividing line — and the generative view — a separate blob (a fitted Gaussian) modeling each class. Both classify the same way in the end, but they think about the problem completely differently.

Common misconception: "Generative models are obviously better because they understand the data more deeply." Not always — 'understanding' each class means making stronger assumptions about its shape, and if those assumptions are wrong, the model is confidently misled. As we'll see in Chapter 5, discriminative logistic regression is more robust precisely because it assumes less. Stronger assumptions are a double-edged sword: a huge advantage when right, a liability when wrong.

What fundamentally distinguishes a generative classifier (like GDA) from a discriminative one (like logistic regression)?

Generative models are always more accurate. Generative models don't use any probability. A discriminative model learns the boundary (label given features) directly; a generative model learns what each class looks like (features given label), then uses Bayes' rule to classify. Generative models can only handle two classes.

Chapter 1: Bayes' Rule — Flipping the Question Around

The generative approach builds a model of each class — but our actual goal is to predict the class given the features. We need a bridge from "what does each class look like" to "which class is this." That bridge is Bayes' rule, one of the most important formulas in all of statistics.

A generative model gives us two ingredients. First, the class prior — how common each class is before we look at any features. (If 30% of animals in our world are elephants, the prior for elephant is 0.3.) Second, the class-conditional — how likely the features are given a class. (How likely is "height 3 metres, weight 5 tonnes" if the animal is an elephant? Very. If it's a dog? Essentially zero.)

Bayes' rule combines these into what we actually want — the probability of the class given the features:

P(class | features) = P(features | class) · P(class) / P(features)

Read it as a recipe. To find how probable "elephant" is given the measurements, multiply how well elephants explain these measurements (the class-conditional) by how common elephants are (the prior), then normalize. Do the same for "dog," and pick whichever comes out larger. The numerator is "evidence for this class times how often we expect it"; the denominator just rescales so the probabilities sum to one.

The shortcut that makes this practical. To decide between classes, we compare their scores — and the denominator P(features) is the same for every class (it doesn't mention the class at all). So it can't change which class wins. We can ignore it entirely and just compare the numerators: P(features | class) × P(class). Classifying becomes "for each class, multiply the class-conditional by the prior, and take the biggest." No expensive normalization needed to pick a winner.

Bayes' rule in action: prior × likelihood

A new animal's measurement gives a likelihood under each class model (how well each class explains it). Slide the class prior for elephants and watch the posterior (the final probability) shift. Even a strong likelihood can be overruled by a strong prior — that's Bayes balancing evidence against expectation.

prior P(elephant)0.30

likelihood ratio3.0

posterior P(elephant | features) = —

Common misconception: "The class with the higher likelihood P(features|class) always wins." Only if the priors are equal. A class that explains the data twice as well but is ten times rarer will usually lose — Bayes multiplies likelihood by prior. This is why a positive result on a test for a rare disease can still mean you're probably fine: the tiny prior outweighs the test's evidence. Generative classifiers get this balance right automatically; forgetting the prior is the classic Bayesian blunder.

When using Bayes' rule to decide which class an example belongs to, why can we ignore the denominator P(features)?

Because it's always equal to 1. Because it's the same for every class, so it can't change which class has the larger score — we just compare P(features|class)×P(class). Because the denominator is always zero. Because the prior cancels it out.

Chapter 2: The Multivariate Gaussian — A Blob in Many Dimensions

To build a generative model for continuous features (height, weight), we need a way to describe "what a typical elephant looks like" as a probability distribution over those features. The natural tool is the bell curve — but generalized from one dimension to many. That's the multivariate Gaussian, and it's the heart of the GDA model in the next chapter.

In one dimension, a Gaussian needs two numbers: a mean (where the bump is centered) and a variance (how wide it is). In two or more dimensions, those generalize:

The mean vector replaces the single mean — it's the center of the blob, one coordinate per feature. "The average elephant is 3m tall and 5 tonnes."
The covariance matrix replaces the single variance — it describes the blob's shape: how wide it is in each direction, and how the features tilt together. Its diagonal entries are the spread along each axis; its off-diagonal entries capture correlation — whether tall elephants also tend to be heavy (positive correlation tilts the blob diagonally).

The result is an elliptical hill of probability: tallest at the mean, fading outward, stretched and tilted by the covariance. Round blob = features are independent and equally spread. Tilted, stretched ellipse = features are correlated. Get a feel for it:

Shaping a 2D Gaussian

The contour rings are a 2D Gaussian's "blueprint" — a top-down view of the probability hill. Slide the spread to make it bigger or smaller, and the correlation to tilt it. Positive correlation stretches it along the diagonal (tall→heavy); negative tilts it the other way. This single shape is how GDA will model an entire class.

spread1.0

correlation0.0

Covariance shape: —

Common misconception: "The covariance matrix just stores how spread out each feature is." It stores more: the off-diagonal entries encode how features move together. Ignore them (force the blob to stay axis-aligned) and you lose the ability to represent "tall elephants are also heavy." That correlation is often the most informative thing about a class — and, foreshadowing, the Naive Bayes algorithm later will deliberately throw it away for the sake of speed, which is exactly why it's called "naive."

In a multivariate Gaussian, what do the off-diagonal entries of the covariance matrix represent?

The mean of each feature. The number of data points. The correlation between features — how they vary together, which tilts and stretches the probability blob diagonally. Nothing; only the diagonal matters.

Chapter 3: Gaussian Discriminant Analysis

Now we assemble the pieces into our first generative classifier: Gaussian Discriminant Analysis, or GDA. The idea is exactly the elephant/dog blueprint from Chapter 0, made precise. We model each class as a multivariate Gaussian blob, and we use Bayes' rule to classify. Three assumptions define the model:

The label y is a coin flip with some bias — a Bernoulli with prior probability φ of being class 1. (This is the class prior.)
The features of class 0 come from a Gaussian centered at mean μ₀.
The features of class 1 come from a Gaussian centered at a different mean μ₁ — but, crucially, with the same covariance matrix Σ as class 0.

That last detail — both classes share one covariance — matters enormously, and we'll see why in the next chapter. For now: two blobs of the same shape and orientation, sitting at two different locations.

Fitting it is just counting and averaging

Here's the delight of GDA: unlike logistic regression, there's no iteration, no gradient descent, no learning rate. The maximum-likelihood fit is a closed-form recipe you could do by hand:

The prior φ is just the fraction of training examples that are class 1. Count them, divide by the total.
Each mean μ is just the average of the feature vectors within that class. Add up all the elephants' measurements, divide by the number of elephants.
The shared covariance Σ is the average "spread" of points around their own class mean, pooled across both classes.

That's it. Count, average, average, measure-spread. A few passes over the data and you're done — no optimization loop at all. This is part of why generative models can be so data-efficient: each parameter is a simple statistic with a clear meaning.

GDA: fit a blob to each class

Drag any point to move it; press Fit to recompute. GDA places a Gaussian blob (contours) over each class — same shape, two centers — and the white line is the resulting decision boundary where the two classes are equally probable. Watch the blobs and the boundary update as you reshape the data.

accuracy: —

Common misconception: "Each class should get its own covariance shape — forcing them to share one is a hack." Sharing the covariance is a deliberate modeling choice with a beautiful payoff: it makes the decision boundary perfectly straight (next chapter). If you let each class have its own covariance, you get a related model (Quadratic Discriminant Analysis) whose boundary is a curve — more flexible, but needing more data and losing the clean link to logistic regression. Shared covariance is the elegant default.

How are the parameters of a GDA model fit to the training data?

By gradient descent with a tuned learning rate. By a closed-form recipe: the prior is the class fraction, each mean is the class average, and the covariance is the pooled spread — just counting and averaging, no iteration. By trying every possible boundary. By running logistic regression first.

Chapter 4: The Sigmoid Returns — GDA's Hidden Link to Logistic Regression

Here's a result that should genuinely surprise you. We built GDA from a completely different philosophy than logistic regression — modeling each class as a Gaussian rather than drawing a boundary. Yet if you take a trained GDA model and ask it the discriminative question — "what's the probability of class 1 given the features?" — and work through the Bayes' rule algebra, out pops:

P(y = 1 | x) = 1 / (1 + e^{−θ^Tx})

The sigmoid. Exactly the logistic regression form, with some weight vector θ that's a specific function of GDA's means, prior, and covariance. The reason traces back to the shared covariance assumption: because both Gaussians have the same shape, the curved parts of their densities cancel when you take the ratio, leaving a linear function of x inside the exponential — and that's precisely what makes the boundary a straight line and the posterior a sigmoid.

Two roads, one destination — sometimes. Start from "each class is a shared-covariance Gaussian" (generative) and you arrive at a sigmoid posterior with a linear boundary. Start from "the log-odds are linear in x" (discriminative logistic regression) and you also get a sigmoid with a linear boundary. The same prediction shape, reached from opposite directions. But — and this is the crucial asymmetry — the implication only runs one way.

That one-way street is the whole story of when to use which. GDA's Gaussian assumption implies a logistic posterior — if the data really is two shared-covariance Gaussians, the boundary is logistic. But the reverse is false: data with a logistic boundary need not be Gaussian. Many different generative stories (Gaussian, Poisson, and more) all lead to the same logistic posterior. So logistic regression assumes less — it commits only to the boundary shape, staying agnostic about the actual distribution of each class. GDA assumes more — it commits to the full Gaussian shape of each class. That difference in how much they assume is exactly the trade-off we unpack next.

GDA's posterior is a sigmoid

Move the query point across the GDA model and watch its class-1 probability. Trace the value as you sweep left-to-right: it follows the smooth S-curve of a sigmoid, hitting one-half exactly on the straight decision boundary. The generative blobs and the discriminative sigmoid are the same model, seen two ways.

query position0.0

P(class 1 | query) = —

Common misconception: "Since GDA gives a sigmoid too, GDA and logistic regression always produce the same boundary." They give the same functional form (sigmoid, linear boundary) but generally different boundaries on the same data, because they fit θ differently — GDA via class statistics, logistic regression via likelihood on the labels directly. Same shape of answer, different specific answer. Which one's better depends entirely on whether the Gaussian assumption holds.

When you compute P(y=1 | x) from a trained GDA model, what form does it take, and what's the key asymmetry with logistic regression?

A straight line; the two models are identical in every way. A sigmoid — GDA's Gaussian assumption implies a logistic posterior, but a logistic posterior does NOT imply Gaussian data, so GDA assumes more. A quadratic; GDA never relates to logistic regression. A uniform distribution.

Chapter 5: GDA vs. Logistic — The Assumption Trade-off

We now have two classifiers that produce the same shape of answer but make very different bets. Which should you reach for? The answer is one of the most important practical lessons in machine learning, and it's all about how much you assume.

When the Gaussian assumption is correct: GDA wins

If the data truly is two shared-covariance Gaussians, GDA is making exactly the right assumption — and it exploits that to learn from less data. It's asymptotically efficient: in the limit, no algorithm estimates the class probabilities better. With strong correct assumptions, you need fewer examples, because you're not asking the data to discover the shape — you've told it the shape. For small datasets where you're confident the classes are roughly Gaussian, GDA is the better bet.

When the assumption is wrong: logistic regression wins

But fit Gaussians to data that isn't Gaussian — skewed, multi-modal, heavy-tailed — and GDA is confidently wrong. Logistic regression, by committing only to a linear boundary and nothing about the class shapes, is far more robust. It doesn't care whether each class is Gaussian, Poisson, or some weird blob — as long as a line can separate them reasonably, it finds a good one. With large datasets and uncertain assumptions, logistic regression is the safer, and usually better, choice. This is why, in practice, logistic regression is used far more often than GDA.

Right assumption vs. wrong assumption

Toggle the data between Gaussian (clean blobs) and non-Gaussian (skewed/curved clusters). Watch both boundaries: GDA (from the fitted blobs) and logistic (from the labels). On Gaussian data they nearly agree. On non-Gaussian data, GDA's blobs warp and its boundary drifts wrong, while logistic regression keeps separating the classes sensibly.

GDA acc — · logistic acc —

	GDA (generative)	Logistic (discriminative)
Assumes	Each class is Gaussian (shared Σ)	Only that the boundary is linear
Fitting	Closed form: count & average	Iterative: gradient ascent
If assumption right	Best — data-efficient	Good, needs more data
If assumption wrong	Confidently misled	Robust — still works
Best when	Small data, classes ~Gaussian	Large data, unsure of shape

Common misconception: "More assumptions are always bad — you should use the most flexible model." Assumptions aren't inherently bad; correct assumptions are pure advantage, letting you learn from less data. The danger is wrong assumptions. The skill is matching your model's assumptions to what you actually know about the problem — strong assumptions when you're confident, weak ones when you're not. That judgment, not raw flexibility, is what separates good practitioners.

You have a small dataset and you're confident the two classes are roughly Gaussian blobs. Which classifier is the better bet, and why?

Logistic regression, because it's always better. GDA — its Gaussian assumption is correct here, making it asymptotically efficient and able to learn well from less data. Neither can handle small data. It doesn't matter; they give identical results.

Chapter 6: Naive Bayes — Generative Learning for Words

GDA handled continuous features with Gaussians. But the most famous generative classifier tackles a different beast: discrete features, especially text. Meet Naive Bayes, the algorithm that filtered spam for the entire internet through the 2000s. Let's build a spam filter and discover why it needs to be "naive."

We represent an email as a long list of yes/no features — one per word in our vocabulary. Feature j is 1 if word j appears in the email, 0 if not. With a vocabulary of 50,000 words, each email is a 50,000-long vector of zeros and ones. To classify generatively, we need to model how spam generates these word-vectors, and how non-spam does — the class-conditionals from Chapter 1.

The combinatorial wall

Here's the catastrophe. A full model of "which combinations of 50,000 words appear together" would need to assign a probability to every possible word-vector — and there are two-to-the-fifty-thousand of them. That's a number with over 15,000 digits. More parameters than atoms in the universe, by an absurd margin. We cannot possibly estimate that from any dataset. The honest, correlation-aware model is computationally impossible.

The naive assumption that rescues everything

So we make a wildly strong simplifying assumption, the Naive Bayes assumption: given the class, the words are independent of each other. Knowing an email is spam, the presence of "buy" tells you nothing about whether "price" also appears. This is plainly false — spam words travel in packs — which is exactly why it's called "naive." But it collapses the impossible model into a trivial one: instead of modeling all word combinations, we just model each word separately. The probability of a whole email given its class becomes the product of each word's individual probability given the class.

Now each parameter is dead simple to estimate, by counting: the probability of word j appearing in spam is just the fraction of spam emails that contain word j. Count the spam emails with "viagra," divide by the total spam emails. Do this for every word, for both classes, plus the overall spam rate. That's the entire training procedure — a few counts. To classify a new email, multiply the relevant per-word probabilities together with the class prior (Bayes' rule from Chapter 1) and compare spam vs. non-spam.

"Naive" is a feature, not just a bug. The independence assumption is false, yet Naive Bayes works remarkably well in practice. Why? Because for classification we only need to get the winner right, not the exact probabilities. Even with miscalibrated probabilities from the false assumption, the spam class usually still comes out ahead for spam emails. Trading correctness-of-probability for simplicity-and-speed is often a brilliant deal — Naive Bayes is the patron saint of "good enough, instantly."

Common misconception: "The Naive Bayes assumption says all the words are independent, period." It says they're conditionally independent — independent given the class label. That's a much weaker, more reasonable claim. "Buy" and "price" are correlated overall (both appear in commercial emails), but the assumption only requires that once you know an email is spam, they provide independent evidence. Still not true, but far less crazy than claiming raw independence.

What does the "naive" assumption in Naive Bayes actually assume, and why is it necessary?

That all emails are spam. That words are conditionally independent given the class — necessary because modeling all word combinations would need an astronomically impossible number of parameters. That there are only two words in the language. That the features are continuous.

Chapter 7: The Spam Filter Lab

Let's build the actual thing. Below is a working Naive Bayes spam filter, trained on a tiny corpus of emails. Toggle which words appear in an incoming email, and watch the classifier compute the probability of spam in real time — multiplying each word's evidence by the prior, exactly the Bayes' rule recipe from Chapter 1. You can see each word vote.

What to watch:

Toggle spammy words (free, money, winner) on — the spam probability shoots up, each word adding its evidence.
Toggle hammy words (meeting, report, lunch) on — the probability drops back toward "not spam."
Mix them — watch the classifier weigh competing evidence, the product of probabilities tilting toward whichever class the words favor.
Notice each word's contribution bar: how strongly that single word pushes toward spam or ham, learned purely from counting.

Naive Bayes Spam Filter — toggle the words

Click words to add/remove them from the incoming email. The gauge shows P(spam | words); the bars show how each present word votes (orange = toward spam, teal = toward ham). All probabilities were learned by counting word frequencies in a labeled corpus.

P(spam | this email) = — → —

That gauge is the entire algorithm: a prior (the base spam rate), times a product of per-word likelihoods (each learned by counting), normalized by Bayes' rule. No gradient descent, no iterations, no tuning. It trains in one pass over the data and classifies in microseconds. For two decades, a version of exactly this guarded billions of inboxes.

(No quiz — the lab is the test. If you can predict which way the gauge moves before toggling a word, you understand Naive Bayes.)

Chapter 8: Laplace Smoothing — The Zero That Breaks Everything

Our spam filter has a hidden landmine, and it's worth understanding because it teaches a deep statistical lesson. Suppose a brand-new word appears in an incoming email — say "neurips" — that never showed up in any training email, spam or ham. What probability does our counting procedure assign to it?

Zero. The fraction of spam emails containing "neurips" is zero (none did), and the fraction of ham emails is also zero. And here's the disaster: because we multiply all the per-word probabilities together, a single zero anywhere in the product makes the entire result zero — for both classes. The classifier computes zero divided by zero and throws up its hands. One unseen word and the whole prediction collapses.

The deeper lesson: it is statistically reckless to declare an event impossible (probability exactly zero) just because you haven't seen it yet in a finite sample. Your training set is a tiny window on the world; absence of evidence isn't evidence of absence. Assigning a hard zero to "never observed" is overconfidence of the worst kind — and the multiplication ruthlessly punishes it.

The fix: pretend you saw everything once

The cure is delightfully simple, called Laplace smoothing (or add-one smoothing). Before counting, imagine you'd already seen every word once in every class. Concretely: add 1 to every word's count, and adjust the denominator to keep things a valid probability. Now nothing has probability zero — every word gets a tiny, non-zero share, just in case. The data still dominates for words you've seen plenty of times; smoothing only matters for the rare and unseen, gently nudging their zero up to "small but possible."

A zero probability, before and after smoothing

A word appears in 0 of the spam emails. Slide its count and toggle Laplace smoothing. Without smoothing, a count of zero gives a probability of exactly zero (which zeros out the whole product, shown by the broken gauge). With smoothing, even an unseen word gets a small positive probability, and the classifier keeps working.

word count in spam0

estimated probability = —

Common misconception: "Laplace smoothing is a hack to avoid a divide-by-zero error." It's principled, not a kludge — it's the optimal estimator under a reasonable prior belief that all outcomes are a priori possible. It encodes genuine humility: "I haven't seen this, but I won't bet my life it can never happen." Almost every real text classifier uses it, and it's the simplest example of the Bayesian idea of a prior regularizing your estimates away from overconfident extremes.

Why does an unseen word break an un-smoothed Naive Bayes classifier, and how does Laplace smoothing fix it?

It doesn't break anything; unseen words are ignored. An unseen word gets probability zero, and since probabilities are multiplied, one zero makes the whole product zero; Laplace smoothing adds 1 to every count so nothing is exactly zero. The classifier runs slower; smoothing speeds it up. Unseen words get probability one, overwhelming everything; smoothing lowers it.

Chapter 9: Connections & Cheat Sheet

You've now seen both halves of classification. Discriminative models (logistic regression) draw boundaries; generative models (GDA, Naive Bayes) model each class and flip with Bayes. Knowing both — and when each shines — is what makes you fluent rather than just functional.

The whole lesson on one page

Idea	What it says
Discriminative	Model P(label \| features) directly — learn the boundary (logistic regression).
Generative	Model P(features \| class) and P(class), then use Bayes' rule to get P(class \| features).
Bayes' rule	posterior ∝ likelihood × prior. Ignore the denominator when picking a winner.
Multivariate Gaussian	Mean vector (center) + covariance matrix (shape, including feature correlations).
GDA	Each class is a shared-covariance Gaussian. Fit by counting & averaging. Boundary is linear; posterior is a sigmoid.
GDA vs logistic	GDA assumes more (Gaussian) → data-efficient if right, misled if wrong. Logistic is robust.
Naive Bayes	Discrete features (words). Assume conditional independence given class → product of per-word probabilities. Fit by counting.
Laplace smoothing	Add 1 to every count so no event has probability zero. Prevents one unseen word from zeroing the product.

Naive Bayes from scratch

python
import numpy as np
def train_naive_bayes(X, y):     # X: n×d binary word-presence matrix
    # Laplace-smoothed P(word j = 1 | class), and class prior
    phi_spam = (X[y==1].sum(0) + 1) / (np.sum(y==1) + 2)
    phi_ham  = (X[y==0].sum(0) + 1) / (np.sum(y==0) + 2)
    prior = y.mean()
    return phi_spam, phi_ham, prior

def predict(x, phi_spam, phi_ham, prior):
    # work in log-space: products of many small probabilities underflow to 0
    ls = np.log(prior)   + (x*np.log(phi_spam) + (1-x)*np.log(1-phi_spam)).sum()
    lh = np.log(1-prior) + (x*np.log(phi_ham)  + (1-x)*np.log(1-phi_ham)).sum()
    return 1 if ls > lh else 0

# sklearn: GaussianNB for GDA-like continuous, MultinomialNB/BernoulliNB for text
from sklearn.naive_bayes import BernoulliNB
BernoulliNB().fit(X, y)   # Laplace smoothing built in (alpha=1)

One detail that matters in practice: multiplying thousands of small probabilities underflows to zero on a real computer. So real Naive Bayes works in log-space — adding logarithms instead of multiplying probabilities. Same math, no underflow. You'll see this np.log trick everywhere probabilities get multiplied in bulk.

Where to go next

Logistic Regression — the discriminative counterpart; revisit Chapter 5's trade-off now that you've seen both.
Generalized Linear Models — the discriminative-family unification; GDA's sigmoid posterior connects the two worlds.
Multinomial Naive Bayes — a refinement that counts word occurrences (not just presence) and often does even better on text; the natural next step from the Bernoulli model here.
Bayesian distributions — go deeper on priors, posteriors, and the smoothing idea generalized.
Bayes Classifiers (EE269) — the same ideas from a signal-processing angle.

You can now teach this. Generative models learn what each class looks like and flip with Bayes' rule. GDA fits a shared-covariance Gaussian to each class by counting and averaging, and — surprise — its posterior is a sigmoid, linking it to logistic regression, though it assumes more. Naive Bayes assumes words are conditionally independent given the class, turning an impossible model into a product of counts, and Laplace smoothing keeps one unseen word from breaking everything. The other half of classification, mastered.

"It is the mark of a truly intelligent person to be moved by statistics." — George Bernard Shaw. Naive Bayes is statistics moving billions of emails into the right folder, one humble count at a time.