How a line learns to say yes or no. Squash a weighted sum into a probability, derive the loss from maximum likelihood, and meet the update rule that — astonishingly — is identical to linear regression's. The doorway from regression to all of classification.
In the last lesson you predicted a house's price — a continuous number that could be anything from $200k to $540k. Now a different kind of question: an email arrives. Is it spam? The answer isn't a number on a sliding scale. It's a verdict: yes or no. 1 or 0.
This is classification — predicting which of a small set of categories an input belongs to. Spam vs. not-spam. Tumor vs. benign. Fraud vs. legitimate. The target y is no longer a real number; it's a label, here y ∈ {0, 1}. By convention we call y = 1 the positive class and y = 0 the negative class (the names are just labels — "positive" doesn't mean "good," it means "the thing we're detecting").
Here's the tempting shortcut: we already know how to fit a line. Why not just run linear regression on the 0/1 labels and call it spam if the prediction is above 0.5? Let's try exactly that and watch it fall apart — because the way it fails tells us precisely what we need to fix.
Each dot is an email: position = some feature (say, count of the word "free"), height = its label (0 = not spam at the bottom, 1 = spam at the top). The straight line is linear regression's best fit. Drag the far-right point rightward — a single obvious-spam email far out — and watch the line tilt, dragging its 0.5 crossing the wrong way and misclassifying points it had right before.
Two things just broke, and both are fatal:
The fix, in one sentence: instead of letting the line's output fly off to ±∞, squash it through an S-shaped function that gently saturates at 0 and 1. That function is the sigmoid, and bolting it onto our linear model turns "predict a number" into "predict a probability." That single change — plus a loss derived from maximum likelihood — is the whole of logistic regression, the workhorse classifier behind credit scoring, medical diagnosis, ad click prediction, and the final layer of countless neural networks.
We need a function that takes any real number — the unbounded score θTx from our linear model — and gently bends it into the range (0, 1). It should map large positive scores near 1 ("definitely spam"), large negative scores near 0 ("definitely not"), and a score of 0 to exactly 0.5 ("a coin flip"). The function that does this, and does it most naturally, is the logistic function, also called the sigmoid (Greek for "S-shaped"):
Let's sanity-check its behavior by plugging in three values. The whole function lives or dies on the term e−z in the denominator:
# z = 0: e⁻⁰ = 1 → g = 1/(1+1) = 0.5 (maximum uncertainty) # z = +5: e⁻⁵ ≈ 0.0067 → g = 1/1.0067 ≈ 0.993 (almost certainly class 1) # z = −5: e⁻⁽⁻⁵⁾ = e⁵ ≈ 148 → g = 1/149 ≈ 0.0067 (almost certainly class 0)
As z → +∞, e−z → 0 so g → 1. As z → −∞, e−z → ∞ so g → 0. And it passes smoothly through 0.5 at z = 0. No matter how extreme the input, the output is trapped in (0, 1) — exactly the leash we needed. Our hypothesis becomes the linear score, squashed:
We keep every convention from linear regression — including the bias trick x0 = 1, so θTx = θ0 + θ1x1 + …. The only change is wrapping that weighted sum in g. Everything you learned about θTx as a dot product still holds; we've just added a final squash.
The S-curve is g(z). Move the input z slider to see the marker ride the curve toward 0 or 1. Then change the weight magnitude — bigger weights make the transition steeper (more decisive, sharper boundary); near-zero weights flatten it toward a constant 0.5 (the model "shrugs").
We'll need the derivative of g for training (next chapters). It has a famously beautiful form — the sigmoid's slope can be written entirely in terms of the sigmoid's own value. Let's derive it, every step, because the result is what makes the logistic update rule come out so simply:
So g′(z) = g(z)(1 − g(z)). The slope of the sigmoid at any point is its output times one-minus-its-output. It's biggest (= 0.25) at z = 0 where g = 0.5, and vanishes toward the flat tails where g is near 0 or 1. Remember this identity — it's the hinge that, in Chapter 4, collapses a scary-looking gradient into the elegant (y − h)x.
Now the geometry. With two features x1, x2 (say "count of the word free" and "count of links"), every email is a point in a 2D plane. The model h(x) = g(θ0 + θ1x1 + θ2x2) assigns each point a probability. Where does it switch from "probably not spam" to "probably spam"? Exactly where the probability crosses 0.5 — and we showed g(z) = 0.5 precisely when z = 0. So the decision boundary is the set of points where
That's the equation of a straight line (in higher dimensions, a flat hyperplane). On one side θTx > 0 so h > 0.5 → predict class 1; on the other side θTx < 0 so h < 0.5 → predict class 0. This is the key picture: logistic regression carves the feature space with a straight cut, and the sigmoid paints a smooth probability gradient that's steepest right at the cut and saturates as you move away from it.
Blue dots are class 0, orange dots are class 1. The background shading is the predicted P(y=1) — teal where the model says "class 0", warm where it says "class 1", with the white line the 0.5 boundary. Tune the three weights and watch the boundary rotate and slide. Try to separate the two clouds. Notice the shading sharpens as you grow the weights.
There's a deeper reading of θTx. Start from h = 1/(1+e−z) with z = θTx, and solve for z in terms of the probability. A few lines of algebra (invert the sigmoid):
So θTx = log( h / (1−h) ). The quantity h/(1−h) is the odds of class 1 (probability of yes divided by probability of no — "3-to-1 odds" means h = 0.75). Its logarithm is the log-odds or logit. The punchline: logistic regression is a plain linear model in log-odds space. Each weight θj says how many units the log-odds of spam shift per unit of feature j. That's why it's "linear" regression — the linearity lives in the log-odds, and the sigmoid is just the inverse map back to probability.
python import numpy as np def sigmoid(z): return 1 / (1 + np.exp(-z)) def predict_proba(theta, x): return sigmoid(theta @ x) # P(y=1 | x) def predict_label(theta, x): return 1 if theta @ x >= 0 else 0 # θᵀx ≥ 0 ⇔ proba ≥ 0.5
We have a model that outputs probabilities. Now we need a loss — a score of how wrong the model is — so we can train it. The obvious move is to reuse squared error from the last lesson: penalize (h(x) − y)2. Resist it. With the sigmoid inside, squared error becomes a non-convex function of θ — a bumpy landscape full of flat spots and local dips where gradient descent stalls. We need a loss that's convex (one bowl, like before) and that respects the probabilistic meaning of h. Maximum likelihood — the same principle that justified least squares in the last lesson — hands us exactly that.
Our model is a probability statement. It says P(y = 1 | x; θ) = h(x), and therefore P(y = 0 | x; θ) = 1 − h(x). Here's a slick trick to write both cases in one formula, using the fact that anything to the power 0 is 1:
Check it. If y = 1: the formula is h1(1−h)0 = h — correct. If y = 0: it's h0(1−h)1 = 1−h — correct. One expression, both labels. The exponents act as switches that turn each factor on or off.
Assuming the n training emails are independent, the probability of the entire labeled dataset is the product over all of them. Viewed as a function of θ, that's the likelihood:
As before, products of many small numbers are miserable to differentiate, so take the log — turning the product into a sum and the powers into multipliers. The log-likelihood:
Maximum likelihood says: pick the θ that makes this as large as possible (the data most probable). Equivalently, flip the sign and minimize its negative — and the negative of that expression is the famous cross-entropy loss (also called log loss or binary cross-entropy), the single most-used classification loss in machine learning. For one example:
Look at what it does to a single confident-but-wrong prediction. Suppose the true label is y = 1 but the model says h = 0.01 ("almost certainly not spam"). The loss is −log(0.01) ≈ 4.6 — a big penalty. Push the model to h = 0.0001 and the loss rockets to ≈9.2. As h → 0 while y = 1, the loss → +∞. Cross-entropy punishes confident wrongness without mercy — being sure and mistaken is the worst thing a probabilistic classifier can do, and the loss agrees. Meanwhile, when the model is right and confident (h → 1 with y = 1), −log(1) = 0: no penalty. That asymmetry — gentle on correct confidence, brutal on wrong confidence — is exactly what you want, and it falls straight out of the −log.
For a true label, slide the model's predicted probability and watch the cross-entropy penalty. Flip the true label to see both branches: when y=1 the loss is −log(h) (explodes as h→0); when y=0 it's −log(1−h) (explodes as h→1). The penalty is 0 only when the model is confidently right.
python def cross_entropy(h, y, eps=1e-12): h = np.clip(h, eps, 1-eps) # avoid log(0) return -(y*np.log(h) + (1-y)*np.log(1-h)) cross_entropy(0.99, 1) # 0.01 — confident & right: tiny loss cross_entropy(0.01, 1) # 4.61 — confident & WRONG: huge loss cross_entropy(0.50, 1) # 0.69 — a coin flip: ln(2)
We have the log-likelihood ℓ(θ) to maximize. As in the last lesson, we climb it by gradient ascent: take a step in the direction that increases ℓ. (Ascent, not descent — note the plus sign — because we're maximizing now, not minimizing.) The update is θj := θj + α · ∂ℓ/∂θj. So we need that derivative. Brace for it to look ugly — and then watch the sigmoid's magic identity from Chapter 1 demolish the ugliness.
Work with a single example (x, y) so we can drop the sum. We differentiate ℓ = y log g(θTx) + (1−y) log(1−g(θTx)), writing g for g(θTx):
The g(1−g) from the sigmoid's derivative is precisely what's needed to cancel the 1/g and 1/(1−g) from the log's derivative. Everything collapses. Since g = h(x), the gradient is simply:
which gives the stochastic gradient-ascent update rule:
Look hard at that update. Compare it to the LMS rule from the linear regression lesson: θj := θj + α(y − h)xj. They are character-for-character identical. Same learning rate, same (target − prediction) error, same feature gating. Yet these are completely different algorithms: linear regression's h is the raw line θTx; logistic regression's h is the squashed sigmoid g(θTx). Two different models, two different losses, derived from two different probability assumptions — and out pops the same update rule.
Is this a coincidence? No. It's the first visible crack of a deep structure. Both Gaussian-noise regression and Bernoulli classification are members of one family — the exponential family — and for every member, maximum-likelihood gradient ascent produces this same "(error) × (feature)" update. That unifying framework is Generalized Linear Models, the entire subject of the next lesson. For now, savor the surprise: the rule you learned for predicting house prices also trains a spam filter, unchanged.
Let's run a single logistic update with numbers. One example: x = (1, 2) (bias + one feature), true label y = 1, current θ = (0, 0), learning rate α = 0.5.
Step 1 — score and squash. z = θTx = 0·1 + 0·2 = 0. h = g(0) = 1/(1+e0) = 1/2 = 0.5. (At θ = 0 the model is maximally uncertain — sensible.)
Step 2 — error. y − h = 1 − 0.5 = 0.5. The true label is 1, the model said 0.5, so push the probability up.
Step 3 — update each weight via θj += α(y−h)xj:
θ₀ := 0 + 0.5 × 0.5 × 1 = 0.25 θ₁ := 0 + 0.5 × 0.5 × 2 = 0.50
Step 4 — verify. New z = 0.25·1 + 0.50·2 = 1.25; new h = g(1.25) = 1/(1+e−1.25) ≈ 0.777. The predicted probability of the (correct) class rose from 0.50 to 0.78 in one step. The model is learning to say "yes."
python def logreg_step(theta, x, y, alpha): h = sigmoid(theta @ x) # squashed prediction return theta + alpha * (y - h) * x # IDENTICAL shape to LMS theta = np.zeros(2) print(logreg_step(theta, np.array([1.,2.]), 1, 0.5)) # [0.25 0.5] ✓
Time to put it all in motion. Below is a live logistic-regression trainer. Two clouds of points — class 0 (blue) and class 1 (orange) — sit in the feature plane. The model starts with random weights, so its decision boundary is nonsense. Press Train and watch gradient ascent on the log-likelihood swing the boundary into place, the probability field sharpen, and the log-likelihood climb. Everything from Chapters 1–4 happening at once.
Click empty space to add a point of the selected class · drag a point to move it · shift-click to delete. Background = predicted P(y=1); white line = decision boundary.
One thing to watch for, because it's the soul of the algorithm: as training proceeds, the points the model is already sure about stop contributing. Recall the update weight is (y − h) — the error. For a point deep in its own territory, h is already ~1 (or ~0) and matches y, so (y − h) ≈ 0: that point stops tugging on the boundary. The only points still exerting force are the ones near the boundary, where the model is uncertain. Logistic regression naturally focuses its attention on the hard, borderline cases — the same instinct that, taken to its logical extreme, gives the support vector machine.
(No quiz — the lab is the test. If you can predict how the boundary will move before pressing Train, you understand logistic regression.)
Let's take a short, illuminating detour back in time. Logistic regression's sigmoid gives a soft answer — "73% spam." What if we force a hard verdict instead, replacing the smooth S-curve with a brutal step function that outputs only 0 or 1?
Keep everything else identical — same hypothesis h(x) = g(θTx), same update rule θj := θj + α(y − h(x))xj. With this threshold g, the algorithm is the perceptron, invented by Frank Rosenblatt in 1958 and once breathlessly hailed as a model of how neurons in the brain learn. It is the great-grandparent of every neural network alive today.
Because h now outputs only 0 or 1, the error term (y − h) takes only three values, and the update becomes almost comically simple:
Two separable classes. Press Step to process one point: if it's misclassified (flashing ring), the boundary jumps to fix it; if correct, nothing happens. Watch the line lurch only on errors and freeze once everything is classified right. Hit Make non-separable to overlap the classes — now the boundary never stops twitching, because there's always a mistake to chase.
The perceptron has a celebrated guarantee, the Perceptron Convergence Theorem: if the data is linearly separable, this mistake-driven rule will find a separating boundary in a finite number of steps — full stop. But the flip side is its fatal flaw: if the data is not separable (even one stray point), it never converges — it cycles forever, chasing the unfixable mistake. This brittleness, exposed in Minsky and Papert's 1969 book (the perceptron can't even learn XOR), helped trigger the first "AI winter."
Spam-or-not has two classes. But what about classifying an email into spam, personal, or work — or an image into one of 1,000 object categories? We need to generalize from a yes/no sigmoid to a "pick one of k" model. The generalization is one of the most important functions in all of deep learning: the softmax.
The idea: instead of one weight vector θ, keep k of them — θ1, …, θk, one per class. Each produces a raw score (a logit) ti = θiTx measuring how much x looks like class i. But raw scores can be negative and don't sum to 1, so they aren't probabilities. The softmax fixes both problems at once — exponentiate (forcing positivity) and normalize (forcing them to sum to 1):
The exp makes every score positive and amplifies differences (a slightly larger logit becomes a much larger probability — "winner takes more"); dividing by the sum guarantees the k outputs form a valid probability distribution. With k = 2, softmax collapses exactly back to the sigmoid — it's the honest generalization, not a different idea.
Three classes, three weight vectors. The plane is colored by which class wins; the shading shows confidence (vivid = sure, washed-out = a three-way tie near the meeting point). Slide the logits for the query point (the white dot) and watch the probability bars renormalize — pushing one logit up steals probability from the others.
The loss for the multiclass case is cross-entropy again — the negative log-probability the model assigned to the true class y: ℓCE = −log φy = −log( exp(ty) / ∑j exp(tj) ). And its gradient with respect to logit i is breathtakingly simple:
where 1{y = i} is the indicator — 1 if i is the true class, 0 otherwise (the "one-hot" label). In words: the gradient is predicted probability minus the true (one-hot) label — written compactly, φ − ey. This is the exact same "(prediction − target)" shape as the binary case and as LMS. Push down the probabilities of wrong classes, push up the probability of the right one, in proportion to how off you are. Chain it through ti = θiTx and you get ∂ℓ/∂θi = (φi − 1{y=i})·x — the universal update, one more time.
python def softmax(t): t = t - t.max() # subtract max for numerical stability e = np.exp(t); return e / e.sum() def softmax_ce_grad(Theta, x, y, k): phi = softmax(Theta @ x) # predicted class probabilities, shape (k,) onehot = np.eye(k)[y] # true label as a one-hot vector return np.outer(phi - onehot, x) # ∂ℓ/∂Θ = (φ − e_y) xᵀ
Gradient ascent is reliable but slow — it only knows the slope, so near a gently-curving maximum it takes many timid steps. What if we also used the curvature — how fast the slope itself is changing — to jump intelligently? That's Newton's method, and for logistic regression it converges in a handful of iterations where gradient ascent needs hundreds.
Start with the classic use of Newton's method: finding a zero of a function f, i.e. solving f(θ) = 0. The trick: at your current guess, replace the curvy f with its tangent line, and jump to where that line crosses zero. The tangent has slope f′(θ), so a little geometry gives the update:
To maximize ℓ(θ) rather than zero a function, recall a maximum sits where the derivative is zero. So apply Newton's zero-finder to f = ℓ′: we want ℓ′(θ) = 0. Substituting f = ℓ′ (so f′ = ℓ″) gives the optimization form:
The update divides the slope by the curvature. Where the function is sharply curved (large ℓ″), it takes cautious steps; where it's nearly flat, it takes bold ones — automatically, with no learning rate to tune. That's the magic: Newton's method has no α. The curvature sets the step size for you.
Both climb the same 1-D log-likelihood toward its peak (green line). Press Step and watch Newton leap nearly to the top in one or two jumps by following the tangent-to-the-derivative, while gradient ascent inches along. Newton uses the curvature; gradient ascent doesn't.
With many parameters, the slope ℓ′ becomes the gradient vector ∇θℓ, and the curvature ℓ″ becomes the Hessian matrix H — the matrix of all second partial derivatives, Hij = ∂2ℓ / ∂θi∂θj. The update generalizes to the Newton–Raphson rule:
Applied to logistic regression's log-likelihood, this is called Fisher scoring (or IRLS — iteratively reweighted least squares). It typically converges in well under ten iterations. The cost: each step must build and invert the d×d Hessian, which is O(d3) — cheap when features number in the dozens, but ruinous for the millions of parameters in a neural net. That's the trade-off in one line: Newton uses second-order information for blistering convergence but pays a d3 price per step; gradient descent is first-order, slow but O(d) and scalable. It's exactly why deep learning runs on (stochastic) gradient descent and its cousins, not Newton's method — the same scalability story from the normal-equations-vs-GD discussion in the last lesson.
You can now build a classifier from zero, derive its loss from first principles, and train it three ways. Before the cheat sheet, the failure modes — because knowing where logistic regression breaks is what makes you trust it where it doesn't.
If the two classes are perfectly separable, there's no finite "best" θ. The model can always raise the log-likelihood a hair by scaling θ larger — making the boundary sharper and pushing every correct probability closer to a perfect 0 or 1. So gradient ascent drives |θ| → ∞ and never settles (you saw this in the Ch.5 lab with well-separated clouds). The cure is regularization: add a penalty λ|θ|2 on the weights' size, so the model balances fitting the data against staying modest. This also gives more sensible, better-calibrated probabilities.
If 99% of emails are non-spam, a model can score 99% accuracy by predicting "not spam" always — useless. Cross-entropy and class-weighting help, but you must watch precision/recall, not raw accuracy. And remember Ch.2: the boundary is always a straight hyperplane. For tangled classes (XOR, concentric rings), feed in nonlinear features or move to a kernel method or neural net — logistic regression on the right features is astonishingly strong, but on raw tangled inputs it's helpless.
| Concept | Formula | In words |
|---|---|---|
| Sigmoid | g(z) = 1/(1+e−z) | Squash any score into (0,1) |
| Hypothesis | h(x) = g(θTx) = P(y=1|x) | Predicted probability of the positive class |
| Decision boundary | θTx = 0 | Straight line / hyperplane where h = 0.5 |
| Sigmoid derivative | g′ = g(1−g) | Slope in terms of the output |
| Cross-entropy loss | −[y log h + (1−y) log(1−h)] | MLE under a Bernoulli model; convex |
| Update rule | θj := θj + α(y−h)xj | Identical to LMS — (error)×(feature) |
| Softmax (k classes) | φi = eti/∑etj | Probabilities over k classes; gradient φ−ey |
| Newton step | θ := θ − H−1∇ℓ | Use curvature; few iterations, O(d³) each |
python import numpy as np def sigmoid(z): return 1/(1+np.exp(-z)) def fit_logistic(X, y, alpha=0.1, iters=2000): theta = np.zeros(X.shape[1]) for _ in range(iters): h = sigmoid(X @ theta) # predicted probabilities theta += alpha * X.T @ (y - h) # gradient ASCENT: + (y−h)·x summed return theta # sklearn one-liner (uses a fast solver like L-BFGS + L2 regularization): from sklearn.linear_model import LogisticRegression clf = LogisticRegression().fit(X_raw, y) clf.predict_proba(X_new) # calibrated P(y=1)
"The perceptron may eventually be able to learn, make decisions, and translate languages." — The New York Times, 1958. They were early, but not wrong — the line that learned to say yes became the last layer of every neural network.