One recipe that produces linear regression, logistic regression, softmax, and a model for counting customers — all from a single idea. The framework that explains why two different algorithms shared the exact same update rule.
You've now built two learning algorithms. Linear regression predicts a continuous number — a house price — by fitting a line and minimizing squared error. Logistic regression predicts a yes/no label by squashing a line through a sigmoid and minimizing cross-entropy. Two different goals, two different hypotheses, two different loss functions, derived from two different assumptions about the data.
And yet. When you derived how to train each one, the exact same update rule fell out: nudge each weight by the learning rate, times the error (target minus prediction), times the feature. Character for character identical. We flagged it as suspicious at the time. Now we explain it.
But here's a second motivation, more practical. Suppose you're not predicting a price or a yes/no, but a count — how many customers walk into your store in an hour, based on features like the weather and whether there's a sale. Counts are whole numbers, zero or more. Let's see what our two tools do with this, and watch both fail.
Dots are hours: a feature (say, ad spend) on the horizontal axis, the number of customers who showed up on the vertical. Toggle between a straight line (linear regression) and the data. The line predicts negative customers on the left — impossible — and its constant slope can't capture how counts tend to grow multiplicatively. We need a model built for counts.
Counts want a model whose output is always positive and grows multiplicatively — the natural choice is the Poisson distribution, the standard model for "how many independent events happen in a window." But we don't have a "Poisson regression" in our toolkit. Do we have to derive a whole new algorithm from scratch, the way we did for linear and logistic regression?
No. By the end of this lesson, you'll be able to write down a regression model for counts — or for almost any kind of target — in about three lines, using a single universal recipe called a Generalized Linear Model, or GLM. Linear regression, logistic regression, softmax, and Poisson regression are all just this one recipe with a different choice of distribution plugged in. Learn the recipe once, and you can invent the right model for any prediction problem on demand. That's the power we're after.
The key that unlocks everything is a realization: many of the probability distributions you know — the Gaussian (bell curve), the Bernoulli (coin flip), the Poisson (counts), and a dozen more — can all be rewritten in one common mathematical form. They look completely different on the surface, but underneath they're built from the same template. That template defines the exponential family.
A distribution belongs to the exponential family if its probability can be written in this shape:
Don't be intimidated — it's just three labelled parts, and each has a job. Let's name them with their real-world roles:
And b(y) is a base measure that doesn't depend on η — usually a constant or a simple factor. Pick a specific T, a, and b, and you've pinned down a family of distributions; varying η then sweeps through the members of that family.
Each of these distributions is a member of the exponential family. Pick one and slide its natural parameter η — watch the same single knob reshape a bell curve, tilt a coin's bias, or stretch a count distribution. Different surfaces, identical underlying machinery.
Claims are cheap. Let's actually catch a familiar distribution wearing the exponential-family costume. Start with the Gaussian — the bell curve behind linear regression. To keep the algebra clean, we'll fix its spread to one (recall from the linear regression lesson that the spread didn't affect the best-fit weights anyway, so we lose nothing).
The bell curve's formula has the shape "a constant, times e raised to the power negative one-half times (the data minus the mean), squared." Our job is to algebraically twist that into the exponential-family template — "base, times e to the power (eta times the data, minus the normalizer)." The trick is to expand the square inside the exponent and sort the pieces by which ones touch the data and which touch the mean.
When you multiply out negative one-half times the squared difference, three terms appear: one with only the data squared, one with the data times the mean (the cross term), and one with only the mean squared. Group them by job:
The result, the Gaussian's secret identity:
The headline is that first equation. For a Gaussian, the natural parameter simply IS the mean. The exotic-sounding η is, in this case, nothing but the average. Hold onto that — it's the reason linear regression's prediction will turn out to be the bare linear model with no squashing function at all.
Slide the mean. The bell curve slides with it — and because the natural parameter equals the mean for a Gaussian, the η readout below tracks it one-to-one. The normalizer a(η) grows as the curve moves from center, keeping the total area at exactly one.
Now the distribution behind classification: the Bernoulli, a single coin flip that lands 1 with probability φ (phi) and 0 otherwise. Let's catch it in the exponential-family costume too — and this time something wonderful falls out.
The Bernoulli probability can be written as "φ if y is 1, and one-minus-φ if y is 0," which (using the same powers-as-switches trick from the logistic lesson) becomes φ to the y, times one-minus-φ to the one-minus-y. To force this into the exponential form, we do the standard move: write it as e raised to the log of itself, then expand the log of the product into a sum of logs. After sorting the terms, the part multiplying the data y — the natural-parameter slot — reads:
Stop and recognize this. The natural parameter of the Bernoulli is the log-odds — the logarithm of the probability of yes divided by the probability of no. That's the exact "logit" we met in the logistic regression lesson. The exponential family is telling us, on its own, that the natural way to parameterize a coin flip is by its log-odds, not its raw probability.
Now invert it. We have η in terms of φ; let's solve for φ in terms of η — "given the natural parameter, what's the probability?" Exponentiate both sides, then rearrange to isolate φ:
That is the sigmoid function. Exactly. The S-shaped squashing curve we pulled out of thin air in the logistic regression lesson was not an arbitrary choice after all — it is the inverse of the Bernoulli's natural-parameter map. The sigmoid is the function that converts a log-odds back into a probability, and it appears the instant you ask "the Bernoulli is an exponential-family distribution; what's the mean as a function of the natural parameter?" The mystery of "why the sigmoid?" is now solved: because the Bernoulli is in the exponential family, and the sigmoid is what its structure demands.
Left: the probability-to-log-odds map (the natural parameter). Right: its inverse, the sigmoid, mapping the natural parameter back to a probability. Slide the probability φ and watch the dot travel up the log-odds curve; the matching natural parameter feeds the sigmoid on the right and lands right back at φ. They undo each other.
We've seen the secret club and two of its members. Now the payoff: a recipe that turns "which distribution describes my target?" into "here's the regression model." A Generalized Linear Model is defined by exactly three assumptions. Memorize these three lines and you can build a model for almost any prediction problem.
That's the entire framework. Read it as a pipeline: your features get combined into a single number by the linear model (θTx), that number is the natural parameter η, and the distribution's structure then converts η into the predicted mean, which is your output. The only thing that changes from one GLM to the next is which distribution you plug into Assumption 1. Everything else is fixed.
Pick a distribution and slide the linear output θTx. Watch the same number flow through the same pipeline — only the middle "response function" box differs — producing a price (identity), a probability (sigmoid), or a count rate (exponential).
Let's prove the recipe works by deriving our two known models from it — watching linear and logistic regression appear, not as separate inventions, but as two settings of the same machine.
Choose the Gaussian for Assumption 1, because the target (a price) is continuous. Now turn the crank. Assumption 2 says the prediction is the expected value of y — which for a Gaussian is its mean. From Chapter 2, the Gaussian's mean equals its natural parameter η. And Assumption 3 says η equals θTx. Chain them together:
The prediction is the bare linear model, no squashing. That is exactly ordinary linear regression. We didn't assume it — it fell out of "the target is Gaussian." The "identity response function" (output equals input) is just what the Gaussian's structure dictates.
Now choose the Bernoulli, because the target is yes/no. Assumption 2: the prediction is the expected value of y, which for a Bernoulli is the probability φ. From Chapter 3, φ equals the sigmoid of the natural parameter. Assumption 3: η equals θTx. Chain them:
The prediction is the linear model passed through a sigmoid. That is exactly logistic regression. Again we didn't assume the sigmoid — choosing "the target is Bernoulli" forced it. Same machine, same crank; only Assumption 1's distribution changed, and out came a completely different-looking model.
The same GLM pipeline, side by side. Toggle the distribution and watch the only thing that changes: the response box in the middle. Gaussian gives you a straight prediction line; Bernoulli bends it into the logistic S-curve. Same front-end θTx, same "predict the mean" back-end.
We keep using the phrase "response function." Let's name the two translators precisely, because the vocabulary is everywhere in statistics and you'll want to recognize it.
The response function is the map from the natural parameter to the mean — it answers "given η, what's the average target?" It's the box in the middle of our assembly line. For the Gaussian it's the identity (mean equals η, output equals input). For the Bernoulli it's the sigmoid. For the Poisson, as we'll see next chapter, it's the exponential.
The link function is its inverse — the map from the mean back to the natural parameter, answering "given the average target, what natural parameter produced it?" For the Gaussian it's the identity again; for the Bernoulli it's the logit (log-odds); for the Poisson it's the logarithm. The link function is the one that "linearizes" your target: apply the link to the mean, and the result is a plain linear function of x.
| Distribution | Model | Response (η → mean) | Link (mean → η) |
|---|---|---|---|
| Gaussian | Linear regression | identity | identity |
| Bernoulli | Logistic regression | sigmoid | logit (log-odds) |
| Poisson | Poisson regression | exponential | logarithm |
| Multinomial | Softmax regression | softmax | log |
When the link function is exactly the one the exponential-family math hands you for free — the inverse of the natural-parameter relationship — it's called the canonical link. Every model in that table uses its canonical link, which is why the math stays clean and the training rule stays simple. You can use a non-canonical link (statisticians sometimes do), but the canonical one is the default and the one that makes everything sing.
Time to cash in everything. Back in Chapter 0, neither linear nor logistic regression could model the count of customers per hour. Now you have the recipe, so let's build the right model on the spot — without deriving a single thing from scratch.
Three lines. One: counts of independent events are modeled by the Poisson distribution (its single parameter is the average rate of events), and the Poisson is in the exponential family. Two: predict the mean — the expected count. Three: the natural parameter is θTx. The Poisson's response function (the inverse of its log link) is the exponential, so the model is:
That's Poisson regression, and you just invented it in one line by turning the crank. Look at what the exponential buys you, and why it's exactly right for counts. The output e-to-something is always positive — you can never predict a negative number of customers, the flaw that killed the straight line in Chapter 0. And because it's exponential, the model says counts grow multiplicatively: each unit of ad spend multiplies the expected turnout by a fixed factor, rather than adding a fixed number. That matches how counts actually behave in the world.
Drag any point to move it. The orange curve is the fitted Poisson mean (always positive, exponential growth); toggle the linear fit to see it predict impossible negative counts. The Poisson model is the GLM you just derived — fit by the same gradient ascent as logistic regression.
(No quiz — the lab is the test. If you can explain why the exponential response function is the right choice for count data, you understand GLMs.)
We opened with a mystery: linear and logistic regression had the identical update rule. Now we can finally solve it — and the solution covers every GLM, including the Poisson model you just built. The secret lives in that humble normalizer, the log-partition function a(η).
Here's the small miracle. For any exponential-family distribution, the log-partition function has a magical property: its derivative gives you the mean. The function whose only job seemed to be "make probabilities sum to one" turns out to secretly encode the distribution's average. (Its second derivative gives the variance — but the mean is what we need here.)
Now follow the consequence. To train any GLM by maximum likelihood, we take the slope of the log-likelihood with respect to each weight. Writing out the exponential-family form, taking the log, and differentiating, the only η-dependent pieces are "η times the data" and "minus a(η)." Differentiating the first gives the data y; differentiating the second gives the derivative of a, which we just said is the mean — and the mean is exactly our prediction h(x). So the slope reduces to:
There it is. The error, times the feature. For every single GLM. The data minus the predicted mean, scaled by the input. Because the log-partition function's derivative is always the mean, the "predicted" term in the gradient is always the model's output, and the gradient always collapses to this same clean shape. Linear regression, logistic regression, Poisson regression, softmax — one gradient to rule them all.
You came in with two algorithms that mysteriously shared a training rule. You leave with a framework that explains the mystery and hands you new models on demand. That's a genuine leap in understanding — from collecting models to generating them.
| Idea | What it says |
|---|---|
| Exponential family | Many distributions share the form b(y)·exp(ηT(y) − a(η)). η = natural parameter, T = sufficient statistic, a = log-partition (normalizer). |
| GLM Assumption 1 | Target y is exponential-family given x (you pick the distribution). |
| GLM Assumption 2 | Predict the mean: h(x) = E[y | x]. |
| GLM Assumption 3 | Natural parameter is linear: η = θTx. |
| Response function | Maps η → mean. Identity (Gaussian), sigmoid (Bernoulli), exp (Poisson), softmax (multinomial). |
| Link function | The inverse: maps mean → η. Identity, logit, log, log. |
| Universal gradient | (y − h(x))·xj — because a′(η) = mean. Same for every GLM. |
python import numpy as np # The ONLY thing that changes between GLMs is the response function. responses = { 'gaussian': lambda z: z, # identity → linear regression 'bernoulli': lambda z: 1/(1+np.exp(-z)), # sigmoid → logistic regression 'poisson': lambda z: np.exp(z), # exp → Poisson regression } def fit_glm(X, y, family, alpha=0.01, iters=3000): g = responses[family] theta = np.zeros(X.shape[1]) for _ in range(iters): h = g(X @ theta) # predicted mean — the ONE line that varies theta += alpha * X.T @ (y - h) # the universal (y − h)·x gradient return theta # statsmodels exposes the whole family directly: import statsmodels.api as sm sm.GLM(y, X, family=sm.families.Poisson()).fit() # swap Poisson→Gaussian→Binomial
"The purpose of models is not to fit the data but to sharpen the questions." — Samuel Karlin. GLMs sharpen one question to a point: what kind of thing is my target, really? Answer that, and the model writes itself.