Classical ML · CS229

Generalized Linear Models

One recipe that produces linear regression, logistic regression, softmax, and a model for counting customers — all from a single idea. The framework that explains why two different algorithms shared the exact same update rule.

Prerequisites: Linear Regression + Logistic Regression. We unify them here.
10
Chapters
7+
Simulations
0
Assumed Knowledge

Chapter 0: The Mystery of the Identical Update Rule

You've now built two learning algorithms. Linear regression predicts a continuous number — a house price — by fitting a line and minimizing squared error. Logistic regression predicts a yes/no label by squashing a line through a sigmoid and minimizing cross-entropy. Two different goals, two different hypotheses, two different loss functions, derived from two different assumptions about the data.

And yet. When you derived how to train each one, the exact same update rule fell out: nudge each weight by the learning rate, times the error (target minus prediction), times the feature. Character for character identical. We flagged it as suspicious at the time. Now we explain it.

Coincidences in mathematics are usually a sign you haven't found the deeper structure yet. The identical update rule is not luck. Linear regression and logistic regression are not two unrelated tricks — they are two faces of one underlying family of models. Find that family, and the shared update rule stops being a surprise and becomes inevitable.

But here's a second motivation, more practical. Suppose you're not predicting a price or a yes/no, but a count — how many customers walk into your store in an hour, based on features like the weather and whether there's a sale. Counts are whole numbers, zero or more. Let's see what our two tools do with this, and watch both fail.

Neither tool fits count data

Dots are hours: a feature (say, ad spend) on the horizontal axis, the number of customers who showed up on the vertical. Toggle between a straight line (linear regression) and the data. The line predicts negative customers on the left — impossible — and its constant slope can't capture how counts tend to grow multiplicatively. We need a model built for counts.

line predicts at the left edge

Counts want a model whose output is always positive and grows multiplicatively — the natural choice is the Poisson distribution, the standard model for "how many independent events happen in a window." But we don't have a "Poisson regression" in our toolkit. Do we have to derive a whole new algorithm from scratch, the way we did for linear and logistic regression?

No. By the end of this lesson, you'll be able to write down a regression model for counts — or for almost any kind of target — in about three lines, using a single universal recipe called a Generalized Linear Model, or GLM. Linear regression, logistic regression, softmax, and Poisson regression are all just this one recipe with a different choice of distribution plugged in. Learn the recipe once, and you can invent the right model for any prediction problem on demand. That's the power we're after.

The plan. Ch.1: the exponential family — the secret club of distributions that all share one mathematical form. Ch.2–3: show the Gaussian and Bernoulli are secretly members, and watch the sigmoid fall out of the Bernoulli, no longer arbitrary. Ch.4: the three-line GLM recipe. Ch.5: turn the crank and watch linear and logistic regression pop out. Ch.7: invent Poisson regression live. Ch.8: prove all GLMs share that update rule. The payoff: you stop memorizing models and start deriving them.
Why does the lesson claim linear regression and logistic regression having the identical update rule is significant?

Chapter 1: The Exponential Family — A Secret Club of Distributions

The key that unlocks everything is a realization: many of the probability distributions you know — the Gaussian (bell curve), the Bernoulli (coin flip), the Poisson (counts), and a dozen more — can all be rewritten in one common mathematical form. They look completely different on the surface, but underneath they're built from the same template. That template defines the exponential family.

A distribution belongs to the exponential family if its probability can be written in this shape:

p(y; η) = b(y) · exp( η · T(y) − a(η) )

Don't be intimidated — it's just three labelled parts, and each has a job. Let's name them with their real-world roles:

And b(y) is a base measure that doesn't depend on η — usually a constant or a simple factor. Pick a specific T, a, and b, and you've pinned down a family of distributions; varying η then sweeps through the members of that family.

Why should you care that distributions share a form? Because anything we can prove about the generic form p(y; η) instantly holds for every member — Gaussian, Bernoulli, Poisson, all at once. Derive the training rule for the exponential family in general, and you've simultaneously derived it for linear regression, logistic regression, and any GLM you'll ever build. One proof, infinitely many models. That's the leverage.
One form, three faces

Each of these distributions is a member of the exponential family. Pick one and slide its natural parameter η — watch the same single knob reshape a bell curve, tilt a coin's bias, or stretch a count distribution. Different surfaces, identical underlying machinery.

natural param η0.00
mean =
Common misconception: "The exponential family is some exotic, rarely-used set of distributions." The opposite — it's almost everything you use in practice: Gaussian, Bernoulli, binomial, Poisson, gamma, exponential, beta, Dirichlet, categorical. The distributions that aren't in it (like the heavy-tailed Cauchy, or a uniform on a fixed interval) are the exceptions. The exponential family is the workhorse stable of statistics, which is exactly why building a recipe on top of it covers so much ground.
In the exponential family form, what is the role of the natural parameter η?

Chapter 2: The Gaussian in Disguise

Claims are cheap. Let's actually catch a familiar distribution wearing the exponential-family costume. Start with the Gaussian — the bell curve behind linear regression. To keep the algebra clean, we'll fix its spread to one (recall from the linear regression lesson that the spread didn't affect the best-fit weights anyway, so we lose nothing).

The bell curve's formula has the shape "a constant, times e raised to the power negative one-half times (the data minus the mean), squared." Our job is to algebraically twist that into the exponential-family template — "base, times e to the power (eta times the data, minus the normalizer)." The trick is to expand the square inside the exponent and sort the pieces by which ones touch the data and which touch the mean.

When you multiply out negative one-half times the squared difference, three terms appear: one with only the data squared, one with the data times the mean (the cross term), and one with only the mean squared. Group them by job:

The result, the Gaussian's secret identity:

η = μ  (the mean)  ·  T(y) = y  ·  a(η) = η2/2

The headline is that first equation. For a Gaussian, the natural parameter simply IS the mean. The exotic-sounding η is, in this case, nothing but the average. Hold onto that — it's the reason linear regression's prediction will turn out to be the bare linear model with no squashing function at all.

Reading off the Gaussian's exponential-family parts

Slide the mean. The bell curve slides with it — and because the natural parameter equals the mean for a Gaussian, the η readout below tracks it one-to-one. The normalizer a(η) grows as the curve moves from center, keeping the total area at exactly one.

mean μ (= η)0.60
η =   T(y)=y   a(η) =
Why fixing the spread to one was legitimate. In the linear regression lesson, we proved the noise level dropped out of the best-fit weights entirely — the answer didn't depend on how noisy we thought the data was. Same here: the spread lands in the base measure b(y), which never touches the natural parameter η, and η is the only thing our linear model will control. So setting it to one simplifies the algebra without changing a single prediction.
Common misconception: "Rewriting the Gaussian this way is just notational trickery with no payoff." It has an enormous payoff. Once η equals the mean, and once (next chapter) we connect η to the linear model θTx, the prediction becomes "the mean equals θTx" — which is precisely ordinary linear regression, now derived rather than assumed. The disguise is how the unification happens.
When the Gaussian is written in exponential-family form, what does its natural parameter η turn out to equal?

Chapter 3: The Bernoulli, and Where the Sigmoid Comes From

Now the distribution behind classification: the Bernoulli, a single coin flip that lands 1 with probability φ (phi) and 0 otherwise. Let's catch it in the exponential-family costume too — and this time something wonderful falls out.

The Bernoulli probability can be written as "φ if y is 1, and one-minus-φ if y is 0," which (using the same powers-as-switches trick from the logistic lesson) becomes φ to the y, times one-minus-φ to the one-minus-y. To force this into the exponential form, we do the standard move: write it as e raised to the log of itself, then expand the log of the product into a sum of logs. After sorting the terms, the part multiplying the data y — the natural-parameter slot — reads:

η = log( φ / (1 − φ) )

Stop and recognize this. The natural parameter of the Bernoulli is the log-odds — the logarithm of the probability of yes divided by the probability of no. That's the exact "logit" we met in the logistic regression lesson. The exponential family is telling us, on its own, that the natural way to parameterize a coin flip is by its log-odds, not its raw probability.

Now invert it. We have η in terms of φ; let's solve for φ in terms of η — "given the natural parameter, what's the probability?" Exponentiate both sides, then rearrange to isolate φ:

η = log(φ/(1−φ))  ⇒  eη = φ/(1−φ)  ⇒  φ = eη/(1+eη) = 1/(1 + e−η)

That is the sigmoid function. Exactly. The S-shaped squashing curve we pulled out of thin air in the logistic regression lesson was not an arbitrary choice after all — it is the inverse of the Bernoulli's natural-parameter map. The sigmoid is the function that converts a log-odds back into a probability, and it appears the instant you ask "the Bernoulli is an exponential-family distribution; what's the mean as a function of the natural parameter?" The mystery of "why the sigmoid?" is now solved: because the Bernoulli is in the exponential family, and the sigmoid is what its structure demands.

Log-odds and its inverse, the sigmoid

Left: the probability-to-log-odds map (the natural parameter). Right: its inverse, the sigmoid, mapping the natural parameter back to a probability. Slide the probability φ and watch the dot travel up the log-odds curve; the matching natural parameter feeds the sigmoid on the right and lands right back at φ. They undo each other.

probability φ0.70
φ =  ⟶  log-odds η =  ⟶  sigmoid(η) =
The big reveal. Two lessons ago you might have wondered: of all possible S-curves, why the sigmoid specifically? Here's the answer in one sentence. The sigmoid is the canonical response function of the Bernoulli distribution. It's not a design choice — it's a mathematical consequence of choosing to model a yes/no outcome as a coin flip. Likewise, linear regression's "no squashing at all" is the canonical response of the Gaussian. The activation function is dictated by the distribution, not invented.
Common misconception: "So the sigmoid is the only squashing function you're ever allowed to use for classification." Not quite — it's the canonical one for the Bernoulli, the one the math hands you for free, with the nicest properties (it's why the cross-entropy gradient came out so clean). Other choices exist (the probit, from a Gaussian-threshold story). But the sigmoid is canonical, and now you know precisely why: it's the Bernoulli's natural-parameter inverse.
When the Bernoulli distribution is written in exponential-family form, what is its natural parameter, and what falls out when you invert that relationship?

Chapter 4: The Recipe — Three Lines to Any Model

We've seen the secret club and two of its members. Now the payoff: a recipe that turns "which distribution describes my target?" into "here's the regression model." A Generalized Linear Model is defined by exactly three assumptions. Memorize these three lines and you can build a model for almost any prediction problem.

Assumption 1 — the distribution
Given the input x, the target y follows some exponential-family distribution with natural parameter η. (You pick the distribution to match your data: Gaussian for prices, Bernoulli for yes/no, Poisson for counts.)
Assumption 2 — predict the mean
We want our prediction h(x) to equal the expected value of y given x. The model outputs the average target it expects for this input.
Assumption 3 — go linear
The natural parameter is a linear function of the input: η = θTx. This is the "linear" in Generalized Linear Model — and the one design choice we get to make.

That's the entire framework. Read it as a pipeline: your features get combined into a single number by the linear model (θTx), that number is the natural parameter η, and the distribution's structure then converts η into the predicted mean, which is your output. The only thing that changes from one GLM to the next is which distribution you plug into Assumption 1. Everything else is fixed.

Trace the data flow, because that's the whole idea. Input features go in. The linear model crushes them to one number — the natural parameter. The distribution's "response function" (the sigmoid for Bernoulli, the identity for Gaussian, the exponential for Poisson) maps that number to the predicted mean. Output. Change the distribution, and only the response function in the middle changes; the linear front-end and the mean-prediction back-end stay put. That modularity is why one recipe covers so many models.

The GLM assembly line

Pick a distribution and slide the linear output θTx. Watch the same number flow through the same pipeline — only the middle "response function" box differs — producing a price (identity), a probability (sigmoid), or a count rate (exponential).

linear output θTx0.80
Common misconception: "The third assumption, that η is linear in x, is a deep law of nature." It isn't — it's a design choice, the simplest one we could make, and it's exactly what keeps these models "linear" and easy to train. You could make η a nonlinear function of x (that's essentially what a neural network does — many nonlinear layers producing the natural parameter for a final GLM head). The GLM framework just takes the simplest option: a plain linear combination. Knowing it's a choice, not a law, is what lets you generalize later.
In the GLM recipe, what is the ONE thing you change to go from linear regression to logistic regression to Poisson regression?

Chapter 5: Turning the Crank — Two Old Friends Fall Out

Let's prove the recipe works by deriving our two known models from it — watching linear and logistic regression appear, not as separate inventions, but as two settings of the same machine.

Linear regression = Gaussian GLM

Choose the Gaussian for Assumption 1, because the target (a price) is continuous. Now turn the crank. Assumption 2 says the prediction is the expected value of y — which for a Gaussian is its mean. From Chapter 2, the Gaussian's mean equals its natural parameter η. And Assumption 3 says η equals θTx. Chain them together:

h(x) = mean = η = θTx

The prediction is the bare linear model, no squashing. That is exactly ordinary linear regression. We didn't assume it — it fell out of "the target is Gaussian." The "identity response function" (output equals input) is just what the Gaussian's structure dictates.

Logistic regression = Bernoulli GLM

Now choose the Bernoulli, because the target is yes/no. Assumption 2: the prediction is the expected value of y, which for a Bernoulli is the probability φ. From Chapter 3, φ equals the sigmoid of the natural parameter. Assumption 3: η equals θTx. Chain them:

h(x) = mean = φ = sigmoid(η) = sigmoid(θTx)

The prediction is the linear model passed through a sigmoid. That is exactly logistic regression. Again we didn't assume the sigmoid — choosing "the target is Bernoulli" forced it. Same machine, same crank; only Assumption 1's distribution changed, and out came a completely different-looking model.

One machine, two models — flip the distribution

The same GLM pipeline, side by side. Toggle the distribution and watch the only thing that changes: the response box in the middle. Gaussian gives you a straight prediction line; Bernoulli bends it into the logistic S-curve. Same front-end θTx, same "predict the mean" back-end.

This is the unification, made concrete. Linear regression and logistic regression were never two unrelated algorithms. They are the Gaussian setting and the Bernoulli setting of one recipe. Swap the distribution, turn the crank, get a different model — but the skeleton (linear front-end, predict-the-mean, exponential-family target) never moves. And because the skeleton is shared, so is the training rule. We prove that last claim in Chapter 8.
Deriving logistic regression from the GLM recipe shows that the sigmoid is:

Chapter 6: Link and Response Functions — The Two Translators

We keep using the phrase "response function." Let's name the two translators precisely, because the vocabulary is everywhere in statistics and you'll want to recognize it.

The response function is the map from the natural parameter to the mean — it answers "given η, what's the average target?" It's the box in the middle of our assembly line. For the Gaussian it's the identity (mean equals η, output equals input). For the Bernoulli it's the sigmoid. For the Poisson, as we'll see next chapter, it's the exponential.

The link function is its inverse — the map from the mean back to the natural parameter, answering "given the average target, what natural parameter produced it?" For the Gaussian it's the identity again; for the Bernoulli it's the logit (log-odds); for the Poisson it's the logarithm. The link function is the one that "linearizes" your target: apply the link to the mean, and the result is a plain linear function of x.

DistributionModelResponse (η → mean)Link (mean → η)
GaussianLinear regressionidentityidentity
BernoulliLogistic regressionsigmoidlogit (log-odds)
PoissonPoisson regressionexponentiallogarithm
MultinomialSoftmax regressionsoftmaxlog

When the link function is exactly the one the exponential-family math hands you for free — the inverse of the natural-parameter relationship — it's called the canonical link. Every model in that table uses its canonical link, which is why the math stays clean and the training rule stays simple. You can use a non-canonical link (statisticians sometimes do), but the canonical one is the default and the one that makes everything sing.

A mnemonic that captures it. The link takes your real-world mean and links it up to the linear world where θTx lives. The response responds by bringing the linear number back down to a real-world prediction. Link goes up to linear-land; response comes back down to the answer. They're inverses, and choosing the distribution chooses both.

Common misconception: "The link function is applied to the model's output." Careful — it's the other way around. The response function is applied to θTx to produce the output. The link function is the conceptual inverse that tells you how the mean relates back to the linear predictor; you rarely apply it directly during prediction. Mixing them up is the single most common GLM confusion. Remember: you predict by running the response function forward.
What is the relationship between a GLM's response function and its link function?

Chapter 7: Invent a New Model — Poisson Regression Live

Time to cash in everything. Back in Chapter 0, neither linear nor logistic regression could model the count of customers per hour. Now you have the recipe, so let's build the right model on the spot — without deriving a single thing from scratch.

Three lines. One: counts of independent events are modeled by the Poisson distribution (its single parameter is the average rate of events), and the Poisson is in the exponential family. Two: predict the mean — the expected count. Three: the natural parameter is θTx. The Poisson's response function (the inverse of its log link) is the exponential, so the model is:

h(x) = expected count = eθTx

That's Poisson regression, and you just invented it in one line by turning the crank. Look at what the exponential buys you, and why it's exactly right for counts. The output e-to-something is always positive — you can never predict a negative number of customers, the flaw that killed the straight line in Chapter 0. And because it's exponential, the model says counts grow multiplicatively: each unit of ad spend multiplies the expected turnout by a fixed factor, rather than adding a fixed number. That matches how counts actually behave in the world.

Experiments (each teaches something):
  • Fit Poisson and watch the smooth exponential mean-curve sweep up through the cloud of counts — positive everywhere, curving the way counts really grow.
  • Overlay the linear fit and see it dive below zero on the left (impossible counts) and miss the upward curve.
  • Drag a high-count point and watch the exponential curve respond multiplicatively — it bends, it never goes negative.
Poisson Regression Lab

Drag any point to move it. The orange curve is the fitted Poisson mean (always positive, exponential growth); toggle the linear fit to see it predict impossible negative counts. The Poisson model is the GLM you just derived — fit by the same gradient ascent as logistic regression.

log-rate =
This is the whole point of the lesson. You needed a model for counts, and instead of inventing one from scratch — choosing a loss, deriving a gradient, proving convergence — you picked a distribution and turned a crank. That's the superpower a framework gives you. Predicting time-to-failure? Use the gamma or exponential GLM. Predicting proportions? Beta GLM. The recipe is the same every time. You've graduated from memorizing models to deriving them on demand.

(No quiz — the lab is the test. If you can explain why the exponential response function is the right choice for count data, you understand GLMs.)

Chapter 8: Why They All Share One Update Rule

We opened with a mystery: linear and logistic regression had the identical update rule. Now we can finally solve it — and the solution covers every GLM, including the Poisson model you just built. The secret lives in that humble normalizer, the log-partition function a(η).

Here's the small miracle. For any exponential-family distribution, the log-partition function has a magical property: its derivative gives you the mean. The function whose only job seemed to be "make probabilities sum to one" turns out to secretly encode the distribution's average. (Its second derivative gives the variance — but the mean is what we need here.)

Now follow the consequence. To train any GLM by maximum likelihood, we take the slope of the log-likelihood with respect to each weight. Writing out the exponential-family form, taking the log, and differentiating, the only η-dependent pieces are "η times the data" and "minus a(η)." Differentiating the first gives the data y; differentiating the second gives the derivative of a, which we just said is the mean — and the mean is exactly our prediction h(x). So the slope reduces to:

slope for weight j = (y − h(x)) · xj

There it is. The error, times the feature. For every single GLM. The data minus the predicted mean, scaled by the input. Because the log-partition function's derivative is always the mean, the "predicted" term in the gradient is always the model's output, and the gradient always collapses to this same clean shape. Linear regression, logistic regression, Poisson regression, softmax — one gradient to rule them all.

The mystery, solved. Linear and logistic regression share an update rule not by coincidence, but because both are GLMs, and every GLM's maximum-likelihood gradient is "error times feature." The shared structure (exponential-family target, predict-the-mean, linear natural parameter) forces the shared training rule. This is also why the same gradient-descent code, with only the prediction function swapped, can train any of these models. The unification is complete: one family, one recipe, one gradient.
Common misconception: "Then every GLM is trained by literally identical code." The gradient shape is identical — error times feature — but the word "error" hides the model's prediction h(x), which is computed with that model's own response function (identity, sigmoid, or exponential). So the code differs in exactly one line: how h(x) is computed from θTx. Everything else — the gradient formula, the update step — is genuinely shared. One line of difference, total.
Why do ALL generalized linear models share the update rule "weight += learning-rate × (y − prediction) × feature"?

Chapter 9: Connections & Cheat Sheet

You came in with two algorithms that mysteriously shared a training rule. You leave with a framework that explains the mystery and hands you new models on demand. That's a genuine leap in understanding — from collecting models to generating them.

The whole lesson on one page

IdeaWhat it says
Exponential familyMany distributions share the form b(y)·exp(ηT(y) − a(η)). η = natural parameter, T = sufficient statistic, a = log-partition (normalizer).
GLM Assumption 1Target y is exponential-family given x (you pick the distribution).
GLM Assumption 2Predict the mean: h(x) = E[y | x].
GLM Assumption 3Natural parameter is linear: η = θTx.
Response functionMaps η → mean. Identity (Gaussian), sigmoid (Bernoulli), exp (Poisson), softmax (multinomial).
Link functionThe inverse: maps mean → η. Identity, logit, log, log.
Universal gradient(y − h(x))·xj — because a′(η) = mean. Same for every GLM.

Inventing a GLM in code

python
import numpy as np
# The ONLY thing that changes between GLMs is the response function.
responses = {
    'gaussian': lambda z: z,                 # identity  → linear regression
    'bernoulli': lambda z: 1/(1+np.exp(-z)),   # sigmoid   → logistic regression
    'poisson': lambda z: np.exp(z),         # exp       → Poisson regression
}
def fit_glm(X, y, family, alpha=0.01, iters=3000):
    g = responses[family]
    theta = np.zeros(X.shape[1])
    for _ in range(iters):
        h = g(X @ theta)                  # predicted mean — the ONE line that varies
        theta += alpha * X.T @ (y - h)    # the universal (y − h)·x gradient
    return theta

# statsmodels exposes the whole family directly:
import statsmodels.api as sm
sm.GLM(y, X, family=sm.families.Poisson()).fit()   # swap Poisson→Gaussian→Binomial

Where to go next

You can now teach this. The exponential family is a club of distributions sharing one form. A GLM is three lines: pick the distribution, predict its mean, make the natural parameter linear. Turn the crank and linear regression (Gaussian), logistic regression (Bernoulli), and Poisson regression (counts) all fall out — the sigmoid and exponential are forced, not chosen. And they share one update rule because the log-partition function's derivative is the mean. From two mysterious coincidences to one unifying framework.

"The purpose of models is not to fit the data but to sharpen the questions." — Samuel Karlin. GLMs sharpen one question to a point: what kind of thing is my target, really? Answer that, and the model writes itself.