Logistic Regression & the Perceptron — From Absolute Zero to Mastery

Chapter 0: The Spam Problem — When the Answer Is Yes or No

In the last lesson you predicted a house's price — a continuous number that could be anything from $200k to $540k. Now a different kind of question: an email arrives. Is it spam? The answer isn't a number on a sliding scale. It's a verdict: yes or no. 1 or 0.

This is classification — predicting which of a small set of categories an input belongs to. Spam vs. not-spam. Tumor vs. benign. Fraud vs. legitimate. The target y is no longer a real number; it's a label, here y ∈ {0, 1}. By convention we call y = 1 the positive class and y = 0 the negative class (the names are just labels — "positive" doesn't mean "good," it means "the thing we're detecting").

Here's the tempting shortcut: we already know how to fit a line. Why not just run linear regression on the 0/1 labels and call it spam if the prediction is above 0.5? Let's try exactly that and watch it fall apart — because the way it fails tells us precisely what we need to fix.

Why a straight line can't classify

Each dot is an email: position = some feature (say, count of the word "free"), height = its label (0 = not spam at the bottom, 1 = spam at the top). The straight line is linear regression's best fit. Drag the far-right point rightward — a single obvious-spam email far out — and watch the line tilt, dragging its 0.5 crossing the wrong way and misclassifying points it had right before.

Line's value at the far point: — Points misclassified by the 0.5 rule: —

Two things just broke, and both are fatal:

The output leaves [0, 1]. The line happily predicts 1.4 or −0.3. But what is a "1.4 probability of spam"? Meaningless. A 0/1 problem demands an output that stays between 0 and 1 so we can read it as a probability.
One confident example wrecks the boundary. Squared error punishes the line for being "too right" — a clear-spam email far out on the axis has a big residual against the line (which only reaches ~1), so the line tilts to chase it, moving the decision threshold and flipping easy points to the wrong side. Classification shouldn't care how far past the boundary a point is, only which side it's on.

The fix, in one sentence: instead of letting the line's output fly off to ±∞, squash it through an S-shaped function that gently saturates at 0 and 1. That function is the sigmoid, and bolting it onto our linear model turns "predict a number" into "predict a probability." That single change — plus a loss derived from maximum likelihood — is the whole of logistic regression, the workhorse classifier behind credit scoring, medical diagnosis, ad click prediction, and the final layer of countless neural networks.

The plan. Ch.1: the sigmoid squashing function. Ch.2: how θ^Tx becomes a probability and draws a decision boundary. Ch.3–4: the right loss (cross-entropy, from MLE) and the update rule — which turns out to be identical to linear regression's LMS rule, a beautiful surprise we'll explain. Ch.5: train one live. Ch.6: the perceptron (logistic regression's hard-edged ancestor). Ch.7: softmax for many classes. Ch.8: Newton's method. The arc: from "a line that says yes/no" to the foundation of all classification.

Common misconception: "Logistic regression is for regression — it's in the name." No — despite the name, logistic regression is a classification algorithm. The "regression" refers to the fact that it models a continuous quantity under the hood (the probability, or more precisely the log-odds), but its job is to assign discrete class labels. Blame a century-old naming convention.

Why does plain linear regression fail when applied directly to 0/1 classification labels?

Linear regression can't handle two features at once. Because 0 and 1 are too close together for a line to separate. Its output isn't bounded to [0,1] (so it can't be a probability), and squared error lets a single far-out point tilt the line and move the decision boundary. Linear regression has no parameters to train.

Chapter 1: The Sigmoid — A Squashing Function

We need a function that takes any real number — the unbounded score θ^Tx from our linear model — and gently bends it into the range (0, 1). It should map large positive scores near 1 ("definitely spam"), large negative scores near 0 ("definitely not"), and a score of 0 to exactly 0.5 ("a coin flip"). The function that does this, and does it most naturally, is the logistic function, also called the sigmoid (Greek for "S-shaped"):

g(z) = 1 / (1 + e^−z)

Let's sanity-check its behavior by plugging in three values. The whole function lives or dies on the term e^−z in the denominator:

# z = 0:   e⁻⁰ = 1            → g = 1/(1+1)   = 0.5   (maximum uncertainty)
# z = +5:  e⁻⁵ ≈ 0.0067       → g = 1/1.0067  ≈ 0.993 (almost certainly class 1)
# z = −5:  e⁻⁽⁻⁵⁾ = e⁵ ≈ 148  → g = 1/149     ≈ 0.0067 (almost certainly class 0)

As z → +∞, e^−z → 0 so g → 1. As z → −∞, e^−z → ∞ so g → 0. And it passes smoothly through 0.5 at z = 0. No matter how extreme the input, the output is trapped in (0, 1) — exactly the leash we needed. Our hypothesis becomes the linear score, squashed:

h_θ(x) = g(θ^Tx) = 1 / (1 + e^{−θ^Tx})

We keep every convention from linear regression — including the bias trick x₀ = 1, so θ^Tx = θ₀ + θ₁x₁ + …. The only change is wrapping that weighted sum in g. Everything you learned about θ^Tx as a dot product still holds; we've just added a final squash.

The sigmoid, and what the weights do to it

The S-curve is g(z). Move the input z slider to see the marker ride the curve toward 0 or 1. Then change the weight magnitude — bigger weights make the transition steeper (more decisive, sharper boundary); near-zero weights flatten it toward a constant 0.5 (the model "shrugs").

input z1.5

weight scale1.0

g(z) = — → predicted P(y=1) = —

A derivative so clean it feels like a gift

We'll need the derivative of g for training (next chapters). It has a famously beautiful form — the sigmoid's slope can be written entirely in terms of the sigmoid's own value. Let's derive it, every step, because the result is what makes the logistic update rule come out so simply:

g′(z) = d/dz · (1 + e^−z)⁻¹
= −(1 + e^−z)⁻² · (−e^−z) ← chain rule; derivative of e^−z is −e^−z
= (1 / (1 + e^−z)) · (e^−z / (1 + e^−z)) ← split the fraction
= g(z) · (1 − g(z)) ← because e^−z/(1+e^−z) = 1 − g(z)

So g′(z) = g(z)(1 − g(z)). The slope of the sigmoid at any point is its output times one-minus-its-output. It's biggest (= 0.25) at z = 0 where g = 0.5, and vanishes toward the flat tails where g is near 0 or 1. Remember this identity — it's the hinge that, in Chapter 4, collapses a scary-looking gradient into the elegant (y − h)x.

Common misconception: "The sigmoid's output of, say, 0.7 is the model's accuracy or confidence score I can read loosely." It's a specific claim: h(x) is the model's estimated probability that y = 1. A well-trained logistic model is calibrated — among all emails it scores 0.7, about 70% really are spam. That probabilistic meaning (which a raw neural-net logit lacks) is exactly what we'll earn in Chapter 3 by deriving the loss from maximum likelihood.

What is the derivative of the sigmoid g(z) = 1/(1+e^−z)?

g′(z) = e^−z g′(z) = g(z)(1 − g(z)) — the output times one minus the output g′(z) = 1/(1+z) g′(z) = g(z)²

Chapter 2: From Scores to a Decision Boundary

Now the geometry. With two features x₁, x₂ (say "count of the word free" and "count of links"), every email is a point in a 2D plane. The model h(x) = g(θ₀ + θ₁x₁ + θ₂x₂) assigns each point a probability. Where does it switch from "probably not spam" to "probably spam"? Exactly where the probability crosses 0.5 — and we showed g(z) = 0.5 precisely when z = 0. So the decision boundary is the set of points where

θ^Tx = θ₀ + θ₁x₁ + θ₂x₂ = 0

That's the equation of a straight line (in higher dimensions, a flat hyperplane). On one side θ^Tx > 0 so h > 0.5 → predict class 1; on the other side θ^Tx < 0 so h < 0.5 → predict class 0. This is the key picture: logistic regression carves the feature space with a straight cut, and the sigmoid paints a smooth probability gradient that's steepest right at the cut and saturates as you move away from it.

The weights have a crisp geometric meaning. The vector (θ₁, θ₂) points perpendicular to the boundary, in the direction of increasing probability — it's the "this way to spam" arrow. Its length sets how fast probability ramps from 0 to 1 as you cross (big weights = a knife-edge boundary; small weights = a gentle fade). And θ₀ shifts the boundary toward or away from the origin. Same three-knob intuition as linear regression, now drawing a frontier instead of a trend.

The decision boundary and the probability field

Blue dots are class 0, orange dots are class 1. The background shading is the predicted P(y=1) — teal where the model says "class 0", warm where it says "class 1", with the white line the 0.5 boundary. Tune the three weights and watch the boundary rotate and slide. Try to separate the two clouds. Notice the shading sharpens as you grow the weights.

θ₀ (bias)0.0

θ₁1.2

θ₂1.2

Training accuracy: —

The log-odds interpretation

There's a deeper reading of θ^Tx. Start from h = 1/(1+e^−z) with z = θ^Tx, and solve for z in terms of the probability. A few lines of algebra (invert the sigmoid):

h = 1/(1+e^−z) ⇒ 1 + e^−z = 1/h ⇒ e^−z = (1−h)/h
⇒ e^z = h/(1−h) ⇒ z = log( h / (1−h) )

So θ^Tx = log( h / (1−h) ). The quantity h/(1−h) is the odds of class 1 (probability of yes divided by probability of no — "3-to-1 odds" means h = 0.75). Its logarithm is the log-odds or logit. The punchline: logistic regression is a plain linear model in log-odds space. Each weight θ_j says how many units the log-odds of spam shift per unit of feature j. That's why it's "linear" regression — the linearity lives in the log-odds, and the sigmoid is just the inverse map back to probability.

python
import numpy as np
def sigmoid(z): return 1 / (1 + np.exp(-z))

def predict_proba(theta, x):
    return sigmoid(theta @ x)        # P(y=1 | x)
def predict_label(theta, x):
    return 1 if theta @ x >= 0 else 0   # θᵀx ≥ 0  ⇔  proba ≥ 0.5

Common misconception: "Logistic regression can fit any curvy boundary between the classes." Its boundary is always a straight line / flat hyperplane in the feature space, because θ^Tx = 0 is a linear equation. If two classes are tangled in a way no straight cut separates (think a bullseye pattern), basic logistic regression cannot do it — you'd first map the data through nonlinear features (x₁², x₁x₂, …), exactly as we made linear regression fit curves in the last lesson.

In logistic regression, the decision boundary (where the model switches its predicted class) is the set of points satisfying:

h(x) = 1 θ^Tx = 1 θ^Tx = 0, equivalently h(x) = 0.5 — a straight line / hyperplane e^{−θ^Tx} = 0

Chapter 3: The Right Loss — Cross-Entropy from Maximum Likelihood

We have a model that outputs probabilities. Now we need a loss — a score of how wrong the model is — so we can train it. The obvious move is to reuse squared error from the last lesson: penalize (h(x) − y)². Resist it. With the sigmoid inside, squared error becomes a non-convex function of θ — a bumpy landscape full of flat spots and local dips where gradient descent stalls. We need a loss that's convex (one bowl, like before) and that respects the probabilistic meaning of h. Maximum likelihood — the same principle that justified least squares in the last lesson — hands us exactly that.

Write down the probability of the data

Our model is a probability statement. It says P(y = 1 | x; θ) = h(x), and therefore P(y = 0 | x; θ) = 1 − h(x). Here's a slick trick to write both cases in one formula, using the fact that anything to the power 0 is 1:

p(y | x; θ) = h(x)^y · (1 − h(x))^1−y

Check it. If y = 1: the formula is h¹(1−h)⁰ = h — correct. If y = 0: it's h⁰(1−h)¹ = 1−h — correct. One expression, both labels. The exponents act as switches that turn each factor on or off.

Assuming the n training emails are independent, the probability of the entire labeled dataset is the product over all of them. Viewed as a function of θ, that's the likelihood:

L(θ) = ∏_i=1ⁿ h(x⁽ⁱ⁾)^y⁽ⁱ⁾ (1 − h(x⁽ⁱ⁾))^{1−y⁽ⁱ⁾}

As before, products of many small numbers are miserable to differentiate, so take the log — turning the product into a sum and the powers into multipliers. The log-likelihood:

ℓ(θ) = ∑_i=1ⁿ [ y⁽ⁱ⁾ log h(x⁽ⁱ⁾) + (1 − y⁽ⁱ⁾) log(1 − h(x⁽ⁱ⁾)) ]

Maximum likelihood says: pick the θ that makes this as large as possible (the data most probable). Equivalently, flip the sign and minimize its negative — and the negative of that expression is the famous cross-entropy loss (also called log loss or binary cross-entropy), the single most-used classification loss in machine learning. For one example:

ℓ_CE(h, y) = −[ y log h + (1 − y) log(1 − h) ]

Why this loss has exactly the right shape

Look at what it does to a single confident-but-wrong prediction. Suppose the true label is y = 1 but the model says h = 0.01 ("almost certainly not spam"). The loss is −log(0.01) ≈ 4.6 — a big penalty. Push the model to h = 0.0001 and the loss rockets to ≈9.2. As h → 0 while y = 1, the loss → +∞. Cross-entropy punishes confident wrongness without mercy — being sure and mistaken is the worst thing a probabilistic classifier can do, and the loss agrees. Meanwhile, when the model is right and confident (h → 1 with y = 1), −log(1) = 0: no penalty. That asymmetry — gentle on correct confidence, brutal on wrong confidence — is exactly what you want, and it falls straight out of the −log.

The penalty curve — why confident mistakes hurt

For a true label, slide the model's predicted probability and watch the cross-entropy penalty. Flip the true label to see both branches: when y=1 the loss is −log(h) (explodes as h→0); when y=0 it's −log(1−h) (explodes as h→1). The penalty is 0 only when the model is confidently right.

predicted h = P(y=1)0.50

cross-entropy loss = —

python
def cross_entropy(h, y, eps=1e-12):
    h = np.clip(h, eps, 1-eps)          # avoid log(0)
    return -(y*np.log(h) + (1-y)*np.log(1-h))

cross_entropy(0.99, 1)   # 0.01  — confident & right: tiny loss
cross_entropy(0.01, 1)   # 4.61  — confident & WRONG: huge loss
cross_entropy(0.50, 1)   # 0.69  — a coin flip: ln(2)

Same principle, second appearance. In the last lesson, maximum likelihood under Gaussian noise gave us squared error. Here, maximum likelihood under a Bernoulli (coin-flip) model gives us cross-entropy. Different assumption about the data → different loss — but the recipe ("write the likelihood, take the log, minimize the negative") is identical. In Lesson 3 (GLMs) we'll see these aren't two tricks but one framework.

Common misconception: "Just use accuracy as the loss — count how many it gets right." Accuracy is what we care about, but it's a terrible training signal: it's flat and stair-stepped (nudging θ a little usually changes zero predictions, so the gradient is 0 almost everywhere). Cross-entropy is a smooth surrogate — it always has a non-zero slope telling the model which way to improve, even when no prediction has flipped yet. You optimize cross-entropy; you report accuracy.

Why do we train logistic regression with cross-entropy loss instead of the squared error used in linear regression?

Squared error is harder to type. Squared error with a sigmoid inside is non-convex (gradient descent gets stuck); cross-entropy is the convex maximum-likelihood loss and punishes confident-but-wrong predictions infinitely. Cross-entropy always gives 100% accuracy. Squared error can't be computed for two classes.

Chapter 4: The Update Rule — A Stunning Coincidence

We have the log-likelihood ℓ(θ) to maximize. As in the last lesson, we climb it by gradient ascent: take a step in the direction that increases ℓ. (Ascent, not descent — note the plus sign — because we're maximizing now, not minimizing.) The update is θ_j := θ_j + α · ∂ℓ/∂θ_j. So we need that derivative. Brace for it to look ugly — and then watch the sigmoid's magic identity from Chapter 1 demolish the ugliness.

The derivative (every step)

Work with a single example (x, y) so we can drop the sum. We differentiate ℓ = y log g(θ^Tx) + (1−y) log(1−g(θ^Tx)), writing g for g(θ^Tx):

∂ℓ/∂θ_j = [ y · (1/g) − (1−y) · (1/(1−g)) ] · ∂g/∂θ_j ← derivative of log, chain rule
= [ y/g − (1−y)/(1−g) ] · g(1−g) · ∂(θ^Tx)/∂θ_j ← use g′ = g(1−g) from Ch.1
= [ y(1−g) − (1−y)g ] · x_j ← the g(1−g) cancels both denominators!
= [ y − yg − g + yg ] · x_j = (y − g) x_j ← expand; −yg and +yg cancel

The g(1−g) from the sigmoid's derivative is precisely what's needed to cancel the 1/g and 1/(1−g) from the log's derivative. Everything collapses. Since g = h(x), the gradient is simply:

∂ℓ/∂θ_j = (y − h_θ(x)) x_j

which gives the stochastic gradient-ascent update rule:

θ_j := θ_j + α (y⁽ⁱ⁾ − h_θ(x⁽ⁱ⁾)) x_j⁽ⁱ⁾

Wait — we've seen this exact rule before

Look hard at that update. Compare it to the LMS rule from the linear regression lesson: θ_j := θ_j + α(y − h)x_j. They are character-for-character identical. Same learning rate, same (target − prediction) error, same feature gating. Yet these are completely different algorithms: linear regression's h is the raw line θ^Tx; logistic regression's h is the squashed sigmoid g(θ^Tx). Two different models, two different losses, derived from two different probability assumptions — and out pops the same update rule.

Is this a coincidence? No. It's the first visible crack of a deep structure. Both Gaussian-noise regression and Bernoulli classification are members of one family — the exponential family — and for every member, maximum-likelihood gradient ascent produces this same "(error) × (feature)" update. That unifying framework is Generalized Linear Models, the entire subject of the next lesson. For now, savor the surprise: the rule you learned for predicting house prices also trains a spam filter, unchanged.

The universal learning shape. "weight += learning_rate × (target − prediction) × input." You first met it as LMS. Here it is again for classification. It reappears as the perceptron rule (next chapter), in the softmax gradient (Ch.7), and — via backpropagation — as the update for the output layer of every neural network. If you internalize one formula from these two lessons, make it this one.

One update, by hand

Let's run a single logistic update with numbers. One example: x = (1, 2) (bias + one feature), true label y = 1, current θ = (0, 0), learning rate α = 0.5.

Step 1 — score and squash. z = θ^Tx = 0·1 + 0·2 = 0. h = g(0) = 1/(1+e⁰) = 1/2 = 0.5. (At θ = 0 the model is maximally uncertain — sensible.)

Step 2 — error. y − h = 1 − 0.5 = 0.5. The true label is 1, the model said 0.5, so push the probability up.

Step 3 — update each weight via θ_j += α(y−h)x_j:

θ₀ := 0 + 0.5 × 0.5 × 1 = 0.25
θ₁ := 0 + 0.5 × 0.5 × 2 = 0.50

Step 4 — verify. New z = 0.25·1 + 0.50·2 = 1.25; new h = g(1.25) = 1/(1+e^−1.25) ≈ 0.777. The predicted probability of the (correct) class rose from 0.50 to 0.78 in one step. The model is learning to say "yes."

python
def logreg_step(theta, x, y, alpha):
    h = sigmoid(theta @ x)              # squashed prediction
    return theta + alpha * (y - h) * x    # IDENTICAL shape to LMS

theta = np.zeros(2)
print(logreg_step(theta, np.array([1.,2.]), 1, 0.5))  # [0.25 0.5] ✓

Common misconception: "Identical update rule means logistic and linear regression are the same algorithm." They share the update's form, but h means different things (raw line vs. squashed probability), so the numbers flowing through differ entirely, and they solve different problems with different losses. The shared shape is a clue about deep structure (GLMs) — not evidence they're interchangeable. Never run linear regression on a classification task expecting logistic's behavior.

The logistic-regression update rule turns out to be θ_j := θ_j + α(y − h)x_j — identical in form to linear regression's LMS rule. Why?

It's a typo in the textbook; they're actually different. Because logistic regression uses squared error too. Both are maximum-likelihood fits of exponential-family models, which all share this (error)×(feature) gradient — the deep reason is Generalized Linear Models. Because the sigmoid has no effect on the gradient.

Chapter 5: The Training Lab — Watch a Classifier Learn

Time to put it all in motion. Below is a live logistic-regression trainer. Two clouds of points — class 0 (blue) and class 1 (orange) — sit in the feature plane. The model starts with random weights, so its decision boundary is nonsense. Press Train and watch gradient ascent on the log-likelihood swing the boundary into place, the probability field sharpen, and the log-likelihood climb. Everything from Chapters 1–4 happening at once.

Experiments that teach (do these):

Train on the default clouds — the boundary slides to the gap and the shading snaps from blurry to crisp as the weights grow.
Add points on the wrong side (click) to make the classes overlap. Now perfect separation is impossible; watch the boundary settle on the best compromise, the probability field staying soft (honest uncertainty) instead of hardening.
Drag the clouds far apart (well-separated). The weights grow huge, the boundary becomes a knife-edge, and the log-likelihood keeps creeping up forever — a preview of the "separation blows up the weights" pathology in Ch.9.
Crank the learning rate and watch the boundary jitter or overshoot.

Logistic Regression Trainer

Click empty space to add a point of the selected class · drag a point to move it · shift-click to delete. Background = predicted P(y=1); white line = decision boundary.

learning rate α0.30

add as class

      log-likelihood —  |  accuracy —  |  iter 0  |  |θ| —
    

One thing to watch for, because it's the soul of the algorithm: as training proceeds, the points the model is already sure about stop contributing. Recall the update weight is (y − h) — the error. For a point deep in its own territory, h is already ~1 (or ~0) and matches y, so (y − h) ≈ 0: that point stops tugging on the boundary. The only points still exerting force are the ones near the boundary, where the model is uncertain. Logistic regression naturally focuses its attention on the hard, borderline cases — the same instinct that, taken to its logical extreme, gives the support vector machine.

(No quiz — the lab is the test. If you can predict how the boundary will move before pressing Train, you understand logistic regression.)

Chapter 6: The Perceptron — Logistic Regression's Hard-Edged Ancestor

Let's take a short, illuminating detour back in time. Logistic regression's sigmoid gives a soft answer — "73% spam." What if we force a hard verdict instead, replacing the smooth S-curve with a brutal step function that outputs only 0 or 1?

g(z) = 1 if z ≥ 0, 0 if z < 0

Keep everything else identical — same hypothesis h(x) = g(θ^Tx), same update rule θ_j := θ_j + α(y − h(x))x_j. With this threshold g, the algorithm is the perceptron, invented by Frank Rosenblatt in 1958 and once breathlessly hailed as a model of how neurons in the brain learn. It is the great-grandparent of every neural network alive today.

Because h now outputs only 0 or 1, the error term (y − h) takes only three values, and the update becomes almost comically simple:

Correct prediction (h = y): error = 0 → no change. The perceptron learns only from its mistakes.
Predicted 0, true 1 (y − h = +1): add α·x to θ — nudge the boundary so this point falls on the "1" side next time.
Predicted 1, true 0 (y − h = −1): subtract α·x from θ — push the point to the "0" side.

Perceptron learning — driven by mistakes

Two separable classes. Press Step to process one point: if it's misclassified (flashing ring), the boundary jumps to fix it; if correct, nothing happens. Watch the line lurch only on errors and freeze once everything is classified right. Hit Make non-separable to overlap the classes — now the boundary never stops twitching, because there's always a mistake to chase.

mistakes this pass: — · updates: 0

The perceptron has a celebrated guarantee, the Perceptron Convergence Theorem: if the data is linearly separable, this mistake-driven rule will find a separating boundary in a finite number of steps — full stop. But the flip side is its fatal flaw: if the data is not separable (even one stray point), it never converges — it cycles forever, chasing the unfixable mistake. This brittleness, exposed in Minsky and Papert's 1969 book (the perceptron can't even learn XOR), helped trigger the first "AI winter."

Common misconception: "The perceptron and logistic regression are basically the same since they share the update rule." Cosmetically similar, fundamentally different. The perceptron outputs a hard 0/1 with no probability — you can't ask "how confident?", you can't calibrate it, and there's no clean way to derive it as maximum likelihood. Logistic regression's soft sigmoid gives genuine probabilities and a convex loss, so it converges gracefully even on messy, overlapping data where the perceptron spins out. The sigmoid isn't a cosmetic upgrade — it's what makes the method well-behaved.

The perceptron updates its weights only when it makes a mistake. What happens if the data is NOT linearly separable?

It finds the perfect boundary anyway. It immediately stops with an error message. It never converges — there's always a misclassified point, so it keeps updating forever, cycling around. It converts itself into logistic regression.

Chapter 7: Many Classes — Softmax Regression

Spam-or-not has two classes. But what about classifying an email into spam, personal, or work — or an image into one of 1,000 object categories? We need to generalize from a yes/no sigmoid to a "pick one of k" model. The generalization is one of the most important functions in all of deep learning: the softmax.

The idea: instead of one weight vector θ, keep k of them — θ₁, …, θ_k, one per class. Each produces a raw score (a logit) t_i = θ_i^Tx measuring how much x looks like class i. But raw scores can be negative and don't sum to 1, so they aren't probabilities. The softmax fixes both problems at once — exponentiate (forcing positivity) and normalize (forcing them to sum to 1):

P(y = i | x) = φ_i = exp(t_i) / ∑_j=1^k exp(t_j)

The exp makes every score positive and amplifies differences (a slightly larger logit becomes a much larger probability — "winner takes more"); dividing by the sum guarantees the k outputs form a valid probability distribution. With k = 2, softmax collapses exactly back to the sigmoid — it's the honest generalization, not a different idea.

Softmax over three classes

Three classes, three weight vectors. The plane is colored by which class wins; the shading shows confidence (vivid = sure, washed-out = a three-way tie near the meeting point). Slide the logits for the query point (the white dot) and watch the probability bars renormalize — pushing one logit up steals probability from the others.

logit t₁ (red)2.0

logit t₂ (green)1.0

logit t₃ (blue)0.0

P = [—]

The loss and its gorgeous gradient

The loss for the multiclass case is cross-entropy again — the negative log-probability the model assigned to the true class y: ℓ_CE = −log φ_y = −log( exp(t_y) / ∑_j exp(t_j) ). And its gradient with respect to logit i is breathtakingly simple:

∂ℓ_CE/∂t_i = φ_i − 1{y = i}

where 1{y = i} is the indicator — 1 if i is the true class, 0 otherwise (the "one-hot" label). In words: the gradient is predicted probability minus the true (one-hot) label — written compactly, φ − e_y. This is the exact same "(prediction − target)" shape as the binary case and as LMS. Push down the probabilities of wrong classes, push up the probability of the right one, in proportion to how off you are. Chain it through t_i = θ_i^Tx and you get ∂ℓ/∂θ_i = (φ_i − 1{y=i})·x — the universal update, one more time.

python
def softmax(t):
    t = t - t.max()                  # subtract max for numerical stability
    e = np.exp(t); return e / e.sum()

def softmax_ce_grad(Theta, x, y, k):
    phi = softmax(Theta @ x)         # predicted class probabilities, shape (k,)
    onehot = np.eye(k)[y]            # true label as a one-hot vector
    return np.outer(phi - onehot, x)   # ∂ℓ/∂Θ = (φ − e_y) xᵀ

This is the last layer of almost every classifier you'll ever use. ImageNet networks, language models predicting the next token, anything choosing among many options — the final step is softmax over class logits, trained with cross-entropy. You've just derived the engine room of modern deep learning. The only thing a neural net adds is many nonlinear layers before the logits; the head is exactly this.

Common misconception: "Softmax just picks the biggest logit." Softmax is soft — it returns a full probability distribution, not a single winner. Taking the argmax is a separate, final step you do at test time. During training you need the soft probabilities, because the gradient (φ − e_y) depends on how much probability mass landed on each class, not just which one won. Hard-max would have zero gradient almost everywhere — the same flatness problem that ruled out training on accuracy.

The softmax function converts k raw logits into class probabilities. What does it guarantee about its outputs?

They are all equal. They are all non-negative and sum to exactly 1 — a valid probability distribution over the k classes. Exactly one output is 1 and the rest are 0. They are always between −1 and 1.

Chapter 8: Newton's Method — Using Curvature to Go Faster

Gradient ascent is reliable but slow — it only knows the slope, so near a gently-curving maximum it takes many timid steps. What if we also used the curvature — how fast the slope itself is changing — to jump intelligently? That's Newton's method, and for logistic regression it converges in a handful of iterations where gradient ascent needs hundreds.

Start with the classic use of Newton's method: finding a zero of a function f, i.e. solving f(θ) = 0. The trick: at your current guess, replace the curvy f with its tangent line, and jump to where that line crosses zero. The tangent has slope f′(θ), so a little geometry gives the update:

θ := θ − f(θ) / f′(θ)

To maximize ℓ(θ) rather than zero a function, recall a maximum sits where the derivative is zero. So apply Newton's zero-finder to f = ℓ′: we want ℓ′(θ) = 0. Substituting f = ℓ′ (so f′ = ℓ″) gives the optimization form:

θ := θ − ℓ′(θ) / ℓ″(θ)

The update divides the slope by the curvature. Where the function is sharply curved (large ℓ″), it takes cautious steps; where it's nearly flat, it takes bold ones — automatically, with no learning rate to tune. That's the magic: Newton's method has no α. The curvature sets the step size for you.

Newton's method vs. gradient ascent — a convergence race

Both climb the same 1-D log-likelihood toward its peak (green line). Press Step and watch Newton leap nearly to the top in one or two jumps by following the tangent-to-the-derivative, while gradient ascent inches along. Newton uses the curvature; gradient ascent doesn't.

Newton θ — · GD θ — · step 0

The multidimensional version: Fisher scoring

With many parameters, the slope ℓ′ becomes the gradient vector ∇_θℓ, and the curvature ℓ″ becomes the Hessian matrix H — the matrix of all second partial derivatives, H_ij = ∂²ℓ / ∂θ_i∂θ_j. The update generalizes to the Newton–Raphson rule:

θ := θ − H⁻¹ ∇_θℓ(θ)

Applied to logistic regression's log-likelihood, this is called Fisher scoring (or IRLS — iteratively reweighted least squares). It typically converges in well under ten iterations. The cost: each step must build and invert the d×d Hessian, which is O(d³) — cheap when features number in the dozens, but ruinous for the millions of parameters in a neural net. That's the trade-off in one line: Newton uses second-order information for blistering convergence but pays a d³ price per step; gradient descent is first-order, slow but O(d) and scalable. It's exactly why deep learning runs on (stochastic) gradient descent and its cousins, not Newton's method — the same scalability story from the normal-equations-vs-GD discussion in the last lesson.

Common misconception: "Newton's method is strictly better, so always use it." Only when d is small and ℓ is nicely concave. For high-dimensional, non-convex problems (neural nets), the Hessian is enormous, expensive, and possibly indefinite (curving up in some directions, down in others), which can send Newton's method toward saddle points or even uphill in the wrong direction. Practitioners use cheaper "quasi-Newton" approximations (L-BFGS) or first-order methods (Adam) instead. Curvature is powerful but costly — and dangerous when the landscape isn't a clean bowl.

What is the key advantage of Newton's method over gradient ascent, and its main cost?

It needs no data; but it's slow. It uses curvature (the Hessian) to converge in very few iterations with no learning rate — but each step must invert a d×d matrix (O(d³)), which doesn't scale to many parameters. It always finds the global maximum of any function for free. It replaces the gradient with a random guess.

Chapter 9: When It Breaks, & Connections

You can now build a classifier from zero, derive its loss from first principles, and train it three ways. Before the cheat sheet, the failure modes — because knowing where logistic regression breaks is what makes you trust it where it doesn't.

Perfect separation blows up the weights

If the two classes are perfectly separable, there's no finite "best" θ. The model can always raise the log-likelihood a hair by scaling θ larger — making the boundary sharper and pushing every correct probability closer to a perfect 0 or 1. So gradient ascent drives |θ| → ∞ and never settles (you saw this in the Ch.5 lab with well-separated clouds). The cure is regularization: add a penalty λ|θ|² on the weights' size, so the model balances fitting the data against staying modest. This also gives more sensible, better-calibrated probabilities.

Class imbalance & the non-linear boundary

If 99% of emails are non-spam, a model can score 99% accuracy by predicting "not spam" always — useless. Cross-entropy and class-weighting help, but you must watch precision/recall, not raw accuracy. And remember Ch.2: the boundary is always a straight hyperplane. For tangled classes (XOR, concentric rings), feed in nonlinear features or move to a kernel method or neural net — logistic regression on the right features is astonishingly strong, but on raw tangled inputs it's helpless.

The whole lesson on one page

Concept	Formula	In words
Sigmoid	g(z) = 1/(1+e^−z)	Squash any score into (0,1)
Hypothesis	h(x) = g(θ^Tx) = P(y=1\|x)	Predicted probability of the positive class
Decision boundary	θ^Tx = 0	Straight line / hyperplane where h = 0.5
Sigmoid derivative	g′ = g(1−g)	Slope in terms of the output
Cross-entropy loss	−[y log h + (1−y) log(1−h)]	MLE under a Bernoulli model; convex
Update rule	θ_j := θ_j + α(y−h)x_j	Identical to LMS — (error)×(feature)
Softmax (k classes)	φ_i = e^t_i/∑e^t_j	Probabilities over k classes; gradient φ−e_y
Newton step	θ := θ − H⁻¹∇ℓ	Use curvature; few iterations, O(d³) each

Full from-scratch trainer

python
import numpy as np
def sigmoid(z): return 1/(1+np.exp(-z))

def fit_logistic(X, y, alpha=0.1, iters=2000):
    theta = np.zeros(X.shape[1])
    for _ in range(iters):
        h = sigmoid(X @ theta)                 # predicted probabilities
        theta += alpha * X.T @ (y - h)         # gradient ASCENT: + (y−h)·x summed
    return theta

# sklearn one-liner (uses a fast solver like L-BFGS + L2 regularization):
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(X_raw, y)
clf.predict_proba(X_new)    # calibrated P(y=1)

Where to go next

Linear Regression & LMS — the prerequisite; the identical update rule started here.
Generalized Linear Models (next) — the framework that explains why linear and logistic regression share an update rule. Both are exponential-family GLMs; the magic stops being magic.
Loss Functions — cross-entropy in full generality, plus the contrastive and ranking losses that extend it.
Support Vector Machines — the other great linear classifier; it focuses entirely on the borderline points logistic regression already leans toward.
Kernels — how to give a linear classifier curved boundaries without manually building features.

You can now teach this. Sigmoid squashes a score into a probability; the decision boundary is where θ^Tx = 0; cross-entropy is the maximum-likelihood loss under a Bernoulli model; the update rule is identical to LMS for a deep reason (GLMs); the perceptron is the hard-threshold ancestor; softmax extends it to many classes; Newton's method trades a d³ cost for blazing convergence. From "a line that says yes/no" to the foundation of all classification.

"The perceptron may eventually be able to learn, make decisions, and translate languages." — The New York Times, 1958. They were early, but not wrong — the line that learned to say yes became the last layer of every neural network.