Classical ML · CS229

Linear Regression & the LMS Algorithm

The first machine learning algorithm anyone should truly understand. Fit a line to data, measure how wrong it is, and roll downhill until it's right — then discover that the same answer falls out of pure probability.

Prerequisites: Basic algebra + what a derivative is. That's it. We build the rest.
11
Chapters
9+
Simulations
0
Assumed Knowledge

Chapter 0: The Portland Housing Problem

You're standing in a real-estate office in Portland, Oregon. On the desk is a spreadsheet: 47 houses that sold last year, each with two numbers — the living area in square feet, and the price it sold for, in thousands of dollars.

A client walks in. "I'm selling a 2,500 ft² house," she says. "What's it worth?"

Your spreadsheet has no 2,500 ft² house in it. There's a 2,400 that sold for $369k and a 3,000 that sold for $540k. So the answer is... somewhere in between? But where exactly? And what's your reasoning — what principle lets you turn 47 past sales into a confident number for a house you've never seen?

This is the entire problem of supervised learning in one sentence: you have a pile of example input–output pairs, and you want to learn a rule that predicts the output for new inputs. When the output is a continuous number — a price, a temperature, a blood-pressure reading — we call it a regression problem. When it's a small set of categories — cat vs. dog, spam vs. not — it's classification. Today: regression.

The supervised learning contract. You are given n training examples, each a pair (x, y): x is the input (also called a feature), y is the target we want to predict. Your job is to produce a function h — called the hypothesis — so that h(x) is a good guess for y, even on inputs you've never seen.

Let's look at the actual data. Here are the first few rows, and below them, the same data as a scatter plot — each house is one dot, area on the horizontal axis, price on the vertical.

Living area (ft²)Price ($1000s)
2104400
1600330
2400369
1416232
3000540
The data, and your first guess at a rule

Each blue dot is a house that actually sold. Drag the two handles on the line — the left handle moves the intercept, the right handle tilts the slope. Try to lay the line so it threads through the cloud of dots. When you think it's good, ask yourself: how would you know if a slightly different line were better?

Predict price for a 2,500 ft² house: Total squared error:

Play with it for a moment. You can feel when the line is bad — it sits above all the dots, or it's tilted the wrong way. And you can feel when it's "pretty good." But notice the discomfort: two people would draw two slightly different lines, and neither could prove the other wrong. Eyeballing is not a principle. We need something a machine can compute and optimize.

Here's the plan for the whole lesson, and it's a beautiful arc. First (Ch. 1) we'll pin down exactly what "a line" is as a mathematical object with adjustable knobs. Then (Ch. 2) we'll invent a single number that scores how wrong any line is — turning "looks good" into arithmetic. Then (Ch. 3–6) we'll find the line that makes that number as small as possible, two completely different ways: by rolling downhill (gradient descent and its famous LMS update rule) and by solving an equation exactly (the normal equations). Chapter 7 is a live lab where you watch both methods race. And in Chapter 8 comes the twist that elevates this from "a trick" to "the right thing to do": we'll show that minimizing squared error is exactly what you get if you assume the world is a clean line plus Gaussian noise and ask for the most probable parameters. The recipe isn't arbitrary. It's inevitable.

Common misconception: "Regression means drawing a smooth curve through the points." Not quite. A regression model can be a straight line, a plane, or a high-order curve — what makes it linear regression is that the prediction is a weighted sum of features, linear in the parameters we're tuning, not necessarily a straight line in the input. We'll see in Ch. 9 that you can fit a wiggly polynomial and it's still "linear regression" under the hood.
Why isn't "just eyeball the best line through the dots" a satisfactory solution to the housing problem?

Chapter 1: The Hypothesis — A Line Is Just Two Knobs

In the last chapter you dragged a line around with two handles. Those two handles are the entire model. Let's name them.

A straight line that predicts price from area has the form

h(x) = θ0 + θ1 x

Read it aloud: "the prediction h at input x equals theta-zero plus theta-one times x." The symbol h stands for hypothesis — historical jargon for "the function we're proposing as the rule." The two numbers θ0 and θ1 are the parameters (also called weights). They are the knobs. Everything about machine learning is choosing good values for knobs like these.

Each knob has a concrete meaning you can see on the plot:

Watch what each knob does on its own:

Two knobs, two motions

Slide θ0 and the line shifts vertically without changing its tilt. Slide θ1 and it pivots. These are the only two degrees of freedom a line in a plane has.

θ0 (intercept) 80
θ1 (slope) 0.15
Equation:

More than one feature

Living area isn't the only thing that sets a price. Suppose we also know the number of bedrooms. Now each house is described by two numbers, and our line becomes a tilted plane:

h(x) = θ0 + θ1 x1 + θ2 x2

Here x1 is living area, x2 is bedrooms, and we've gained a third knob θ2 — "dollars per bedroom." There's nothing special about stopping at two. With d features we'd write h(x) = θ0 + θ1x1 + … + θdxd. Writing that sum out every time is tedious, so let's find a cleaner notation.

The x0 = 1 trick

That lonely θ0 with no x attached is annoying — it breaks the pattern. Here's a slick fix that the whole field uses: invent a fake feature x0 that is always equal to 1. Then

θ0 = θ0 · 1 = θ0 x0

and suddenly every term has the same shape θjxj. The hypothesis collapses into one tidy sum:

h(x) = ∑j=0d θj xj = θTx

That last form, θTx, is the dot product of the parameter vector θ = (θ0, θ1, …, θd) and the feature vector x = (x0, x1, …, xd) = (1, x1, …, xd). The little T means "transpose" — it just lines the two vectors up so we can multiply them entry-by-entry and add. The dot product is "multiply each feature by its weight and sum the results." That's the beating heart of a linear model, and you'll meet it again in logistic regression, neural networks, and attention layers. Learn it once here.

Why this matters: The fake feature x0=1 isn't a hack to hide later — it's the standard convention. From now on "θTx" silently includes the intercept. When you see a paper write h(x) = θTx with no separate bias term, this is why: the bias is hiding in the x0=1 slot.

A worked prediction, by hand

Let's compute one prediction with actual numbers so the dot product stops being abstract. Suppose after training we found

θ = (θ0, θ1, θ2) = (89.6,  0.139,  −8.74)

(These are close to the real CS229 fitted values for area + bedrooms.) A house has 2,500 ft² and 4 bedrooms. With the x0=1 convention, its feature vector is x = (1, 2500, 4). The prediction is the dot product:

# h(x) = θ₀·x₀ + θ₁·x₁ + θ₂·x₂
h(x) = 89.6 × 1  +  0.139 × 2500  +  (−8.74) × 4
     = 89.6      +  347.5         +  (−34.96)
     = 402.14

So the model predicts about $402k. Notice every step: we multiplied each feature by its weight (89.6×1, 0.139×2500, −8.74×4), then added the three products. The negative weight on bedrooms is real and counterintuitive — given a fixed area, more bedrooms means smaller rooms, which the data says is slightly less desirable. We'll revisit that surprise. For now, just internalize: prediction = weighted sum of features.

Dot-product calculator

Set the two features and three weights; watch each product and the running sum that forms the prediction. This is literally what theta @ x does in NumPy.

x₁ area2500
x₂ beds4
θ₁0.139
python
import numpy as np

theta = np.array([89.6, 0.139, -8.74])   # θ₀ (bias), θ₁, θ₂
x     = np.array([1.0, 2500, 4])         # x₀=1 (bias trick), area, beds

# Three equivalent ways to compute the same prediction:
pred = theta[0]*x[0] + theta[1]*x[1] + theta[2]*x[2]  # by hand
pred = np.sum(theta * x)                            # elementwise multiply, then sum
pred = theta @ x                                    # the dot product θᵀx
print(pred)   # 402.14
Common misconception:Tx must mean some heavy matrix math." No — for a single example it's just one number: multiply pairs, add them up. The transpose notation only exists to make the bookkeeping line up. If you can compute 89.6 + 0.139×2500 − 8.74×4 on paper, you can compute θTx.
We introduce the fake feature x0 = 1. What does it accomplish?

Chapter 2: Measuring Wrongness — The Cost Function

We have a model with knobs. We have a vague feeling that some knob settings are better than others. To turn that feeling into something a machine can optimize, we need to score any choice of θ with a single number: small when the line fits well, large when it fits badly. Then "find the best line" becomes "find the θ that makes the score smallest" — a math problem, not an opinion.

Start with one house. The line predicts h(x(i)); the house actually sold for y(i). The gap between them,

r(i) = h(x(i)) − y(i)

is called the residual — how far off we were on house i. (The superscript (i) is just an index labelling the i-th training example; it is not an exponent. h(x(2)) means "the prediction for house number 2.") A positive residual means we overpredicted; negative means we underpredicted.

We want a total score across all n houses. Three natural ideas, and why two of them fail:

IdeaFormulaProblem
Sum the raw residuals∑ (h(x(i)) − y(i))A +50 error and a −50 error cancel to zero. A wild line can score "perfect."
Sum the absolute residuals∑ |h(x(i)) − y(i)|No cancellation — this actually works (it's "MAE"), but the absolute value has a sharp corner at 0, so its derivative is awkward.
Sum the squared residuals∑ (h(x(i)) − y(i)No cancellation, smooth everywhere, easy derivative. Winner.

Squaring does two jobs at once. First, it kills the sign — (−50)² = 2500, same as (+50)² — so errors can't cancel. Second, it is a smooth, bowl-shaped function with a derivative everywhere, which (next chapter) lets us roll downhill on it. There's also a third, deeper reason rooted in probability that we'll earn in Chapter 8. For now, accept squaring as the natural choice and define the cost function (also called the least-squares or mean-squared-error objective):

J(θ) = ½ ∑i=1n (hθ(x(i)) − y(i)

Two questions you should be asking. Why the ½ out front? Purely for convenience: when we differentiate the square next chapter, the 2 from the power rule cancels the ½, leaving a clean expression. It doesn't change where the minimum is — scaling a function by a constant doesn't move its lowest point — so we're free to put it there. And why is J a function of θ, not x? Because the data (the x's and y's) is fixed — it's the spreadsheet, it doesn't move. The only things we can change are the knobs θ. J(θ) asks: "for this setting of the knobs, how much total squared error do we rack up across all the houses?"

The mental picture. Draw your line through the scatter. From each data point, draw a vertical segment up or down to the line — that's the residual. Now build an actual square on each segment (side length = residual). J(θ) is half the total area of all those squares. Minimizing J means shrinking that total area. Tilt or shift the line and the squares grow and shrink; the best line is the one with the least total square-area.
The residual squares

Drag the slope and intercept. Each orange square's area is one residual squared; the cost J is half their total area. Hunt for the setting that makes the squares smallest — that's least squares, literally.

θ0120
θ10.08
Cost J(θ) =   

Computing J by hand on a tiny dataset

Abstractions lie; numbers don't. Let's compute J for a 3-point dataset, every step shown. Take these three points and the candidate line h(x) = 1 + 1·x (so θ0=1, θ1=1):

ix(i)y(i)h(x)=1+xresidual h−yresidual²
11122 − 1 = 11
22333 − 3 = 00
33444 − 4 = 00

Now sum the last column and halve it:

# J(θ) = ½ Σ (h(xᵢ) − yᵢ)²
J = ½ × (1 + 0 + 0)
  = ½ × 1
  = 0.5

So this line scores J = 0.5. Is that good? On its own the number is meaningless — cost is only useful for comparison. Let's try a worse line, h(x) = 0 + 1·x (drop the intercept to 0):

# Line h(x) = x:  predictions are 1, 2, 3
residuals = (1−1), (2−3), (3−4)   =   0, −1, −1
squared   =    0,    1,    1
J = ½ × (0 + 1 + 1) = ½ × 2 = 1.0

This line scores J = 1.0 — twice as costly as the first. So the machine now has a definite, defensible verdict: the line h(x) = 1 + x fits these three points better than h(x) = x. No eyeballing, no opinion — just a smaller number. That comparison is the whole game. The rest of the lesson is finding the θ with the very smallest J.

python
import numpy as np

X = np.array([[1,1],[1,2],[1,3]])  # columns: x₀=1, x₁
y = np.array([1,3,4])

def cost(theta):
    pred = X @ theta              # h(xᵢ) for every row at once
    resid = pred - y
    return 0.5 * np.sum(resid**2)   # ½ Σ residual²

print(cost(np.array([1,1])))  # 0.5  — the good line
print(cost(np.array([0,1])))  # 1.0  — the worse line
Common misconception: "A lower cost is always a better model." Lower cost means a better fit to the training data — which is not the same as a better predictor. A wiggly curve can drive J all the way to zero by passing through every training dot and still predict garbage on new houses (overfitting, Ch. 9). Cost measures fit, not wisdom. We minimize it, but we never trust it blindly.
Why does the cost function square the residuals instead of just summing them?

Chapter 3: Rolling Downhill — Gradient Descent

We have a score J(θ) and a goal: find the θ that makes it smallest. We could try every possible (θ0, θ1) pair — but there are infinitely many, and with d features the search space explodes. We need a smarter method. Here's one of the most important algorithms in all of computing, and it's based on a child's intuition: if you want to reach the bottom of a valley in thick fog, just keep stepping downhill.

Picture the cost J as a landscape. The ground position is the knob setting (θ0, θ1); the height at that position is the cost J(θ). For least-squares linear regression this landscape is a single smooth bowl — a convex paraboloid — with exactly one lowest point and no other dips to get trapped in. (We'll prove the "one minimum" claim in Ch. 6.) You can't see the whole bowl, but standing at any point you can feel which way is downhill. So: step downhill, feel again, step again. Repeat until the ground is flat. That flat spot is the minimum.

The key analogy: Gradient descent is a blindfolded hiker feeling for the steepest descent with their feet and taking a step that way, over and over. The "feeling for the slope" is computing a derivative. The "size of the step" is the learning rate. That's the entire algorithm.

The update rule

"Feel the slope in the θj direction" is precisely the partial derivative ∂J/∂θj — the rate at which cost changes as you nudge knob j. If that derivative is positive, increasing θj increases cost, so we should decrease θj. If negative, the opposite. Either way, we move against the derivative. That gives the gradient descent update:

θj := θj − α · ∂J(θ)/∂θj   (for every j, simultaneously)

The := means "assign" — overwrite the old θj with the new value (like theta_j = theta_j - ... in code), not an equation claiming the two sides are already equal. The number α (alpha) is the learning rate: how big a step to take. Small α = timid baby steps; large α = bold leaps. We'll see both fail in their own way.

Working out the derivative (every step)

To run the algorithm we need an actual formula for ∂J/∂θj. Let's derive it for the simplest case — a single training example (x, y), so the sum in J disappears and J = ½(h(x) − y)². Watch the chain rule peel it apart:

∂/∂θj · ½(h(x) − y)²
= 2 · ½(h(x) − y) · ∂/∂θj(h(x) − y)  ← chain rule; bring down the 2
= (h(x) − y) · ∂/∂θj0x0 + … + θdxd − y)  ← the 2 and ½ cancel
= (h(x) − y) · xj  ← only the θjxj term survives the ∂/∂θj

Look at how clean that is. This is why we put the ½ in J: the 2 from squaring and the ½ out front annihilate each other, leaving no stray constant. And the last step is the magic — differentiating the sum θ0x0 + θ1x1 + … with respect to θj, every term is a constant (w.r.t. θj) except the one containing θj, whose derivative is just its coefficient xj. So:

∂J/∂θj = (h(x) − y) xj

Plug that into the update and you get gradient descent for linear regression. We'll unpack what it means — the famous LMS rule — in the very next chapter. First, let's see the descent happen and feel how α controls everything.

Descending the cost bowl

The contour rings are the cost landscape over (θ0, θ1) — the bullseye is the minimum. Press Step to take one gradient step; Run to animate. Then crank the learning rate α and watch the personality change: too small and it crawls; too large and it overshoots, zig-zags, and finally explodes off the chart.

learning rate α0.25
iter 0 · J =

The Goldilocks problem of α

The learning rate is the single most temperamental knob in machine learning. Three regimes, all visible in the widget above:

There's no universal best α; it depends on the data's scale. A standard remedy (Ch. 9) is to rescale features so the bowl is round rather than a long thin canyon — round bowls forgive larger steps.

python
def gradient_descent(X, y, alpha=0.1, iters=1000):
    theta = np.zeros(X.shape[1])          # start at the origin
    n = len(y)
    for _ in range(iters):
        resid = X @ theta - y                # h(xᵢ) − yᵢ for all i
        grad  = X.T @ resid                  # Σ (h−y)·xⱼ  →  ∂J/∂θ for every j at once
        theta = theta - alpha * grad         # the update, all knobs together
    return theta
Common misconception: "Bigger learning rate = learns faster, so crank it up." Up to a point, yes — but past a threshold the step jumps over the valley and lands higher than it left. More speed becomes negative speed: the cost increases every iteration and the model blows up. If your loss is suddenly NaN or growing, the very first thing to try is a smaller α.
You run gradient descent and the cost increases every iteration, eventually becoming infinite. The single most likely cause?

Chapter 4: The LMS Update Rule

Substitute the derivative we just derived, ∂J/∂θj = (h(x) − y)xj, back into the gradient-descent update θj := θj − α·∂J/∂θj. The two minus signs (one in the update, one if we flip the residual) let us write it in the form everyone quotes. For a single training example (x(i), y(i)):

θj := θj + α (y(i) − hθ(x(i))) xj(i)

This is the LMS update rule — "LMS" stands for Least Mean Squares — also known as the Widrow–Hoff learning rule, after the engineers who used it in adaptive filters in 1960. It is one of the most important three-term expressions in machine learning, and once you understand it you understand the skeleton of how almost every model on Earth learns, including deep neural networks. So let's read it slowly.

Notice we flipped the residual: instead of (h − y) we wrote the error (y − h), "target minus prediction," and the sign in front became a plus. The piece (y(i) − hθ(x(i))) is how far below our prediction the truth was — positive when we underpredicted, negative when we overpredicted. Three behaviors fall out of this one term:

Read it as a sentence: "Nudge each weight by (how big a step we allow) × (how wrong we were) × (how much this feature contributed)." Learning rate, error, input. Memorize that triple — the same shape reappears in the perceptron, in logistic regression, and as the final layer of backpropagation.

One LMS step, fully by hand

Let's execute exactly one update with real numbers so nothing is hidden. Setup: one training example with area x1 = 2 (in thousands of ft², so the numbers stay readable) and true price y = 5. We use the bias trick, so x = (x0, x1) = (1, 2). Current knobs θ = (θ0, θ1) = (0, 1). Learning rate α = 0.1.

Step 1 — predict. h(x) = θTx = θ0·1 + θ1·2 = 0·1 + 1·2 = 2.

Step 2 — error. y − h(x) = 5 − 2 = 3. We underpredicted by 3, so both knobs should rise.

Step 3 — update each knob using θj := θj + α(y − h)xj:

# θ₀:  feature x₀ = 1
θ₀ := 0 + 0.1 × 3 × 1  =  0 + 0.3  =  0.3
# θ₁:  feature x₁ = 2  (so it gets twice the correction of θ₀)
θ₁ := 1 + 0.1 × 3 × 2  =  1 + 0.6  =  1.6

New knobs: θ = (0.3, 1.6). Step 4 — verify we improved. Re-predict: h(x) = 0.3·1 + 1.6·2 = 0.3 + 3.2 = 3.5. The new error is 5 − 3.5 = 1.5 — we cut it in half, from 3 down to 1.5, in a single step. And notice θ1 moved twice as far as θ0 (0.6 vs 0.3), exactly because its feature x1=2 was twice as large as x0=1. The feature gated the blame, just as promised.

Watch the error shrink, step by step

A single point (orange) and a line driven by the LMS rule. Each press of LMS step applies one update; the dashed segment is the current error (y − h). Watch it collapse toward zero. Try a big α and watch the line overshoot the point and oscillate.

α0.12
h = · error =
python
# One LMS update for a single example (x, y). x includes the bias x₀=1.
def lms_step(theta, x, y, alpha):
    h     = theta @ x                 # prediction θᵀx
    error = y - h                     # target − prediction
    return theta + alpha * error * x     # θⱼ += α·(y−h)·xⱼ  for all j

theta = np.array([0.0, 1.0])
x, y  = np.array([1.0, 2.0]), 5.0
print(lms_step(theta, x, y, 0.1))   # [0.3 1.6] — matches our hand calc
Common misconception: "The error term (y − h) means the rule only cares about getting the average right." No — the error is computed per example and multiplied by that example's feature. A weight tied to a feature that's usually zero (say, "has a swimming pool") only gets corrected on the rare houses where that feature fires. The feature xj routes each example's error to exactly the weights that helped cause it.
In the LMS update θj := θj + α(y − h)xj, what is the size of the parameter change proportional to?

Chapter 5: Batch vs. Stochastic Gradient Descent

The LMS rule of Chapter 4 updated θ from a single example. But we have n houses, not one. How do we use the whole training set? There are two answers, and the difference between them — "look at everything before each step" vs. "step after every example" — is one of the most consequential design choices in large-scale machine learning. Let's build both.

Batch gradient descent: average the wisdom of the crowd

The true cost J(θ) = ½∑i(h(x(i)) − y(i))² sums over all examples, so its true derivative also sums over all examples. We worked out the single-example derivative; the full one is just the sum of those:

∂J/∂θj = ∑i=1n (hθ(x(i)) − y(i)) xj(i)

Plug it in and you get batch gradient descent: every step scans the entire training set, accumulates the error contribution from all n houses, and takes one well-informed step:

θj := θj + α ∑i=1n (y(i) − hθ(x(i))) xj(i)   (for every j)

"Batch" because one step consumes the whole batch of data. Each step points in the exact direction of steepest descent on the true cost, so the path is smooth and the convergence is reliable. The cost: every single step requires touching all n examples. With n in the millions, one step is expensive, and you need many steps.

Stochastic gradient descent: step on every example

The alternative throws out the "wait until you've seen everything" rule. Stochastic gradient descent (SGD, also called incremental gradient descent) loops over the examples and applies the single-example LMS update immediately after each one:

Loop {
  for i = 1 to n {
    θj := θj + α (y(i) − hθ(x(i))) xj(i)  (for every j)
  }
}

Each step uses one example's error as a noisy estimate of the true gradient. Any single example might point slightly wrong — but on average they point downhill, so the parameters drift toward the minimum while jittering around the path. The payoff: SGD starts improving after the very first example, before batch GD has even finished its first scan. On a huge dataset this is transformative.

The price: SGD never quite settles. Near the minimum, each example still yanks θ a little in its own direction, so the parameters keep oscillating in a small cloud around the true optimum rather than landing dead center. In practice that cloud is close enough — and you can shrink it by slowly decreasing α toward zero as training proceeds, which lets the jitter die down and the parameters converge. For large datasets, SGD's head start usually wins, which is why essentially all deep learning uses SGD (or a mini-batch hybrid of the two).

Two descents, same data, same start

Watch the two trajectories on the cost contours. Batch glides smoothly toward the bullseye. SGD takes a jagged, noisy path — faster early progress, but it never stops twitching near the center. Add noise to the data and SGD's jitter grows.

data noise1.0
batch J · sgd J

Why the batch gradient really is the sum of LMS terms

Let's verify the "sum of single-example gradients" claim with a 2-point dataset, because it's the conceptual hinge. Points: (x(1), y(1)) = ((1,1), 3) and (x(2), y(2)) = ((1,2), 5), with current θ = (0, 1).

# Predictions
h⁽¹⁾ = θᵀx⁽¹⁾ = 0·1 + 1·1 = 1     →  error y−h = 3−1 = 2
h⁽²⁾ = θᵀx⁽²⁾ = 0·1 + 1·2 = 2     →  error y−h = 5−2 = 3

# Per-example contribution to ∂J/∂θⱼ is (h−y)·xⱼ = −(error)·xⱼ
# θ₀ direction (x₀=1 for both):
  example 1:  −2 × 1 = −2
  example 2:  −3 × 1 = −3        sum = −5
# θ₁ direction (x₁ = 1 then 2):
  example 1:  −2 × 1 = −2
  example 2:  −3 × 2 = −6        sum = −8

# Batch gradient ∇J = (−5, −8). One batch step with α=0.1:
θ := (0,1) − 0.1·(−5,−8) = (0.5, 1.8)

The batch step moved both knobs using the combined evidence of both houses at once. SGD, by contrast, would have stepped on house 1, changed θ, then stepped on house 2 from the new θ — two smaller, sequential nudges instead of one summed leap. Same destination, different road.

Batch GDStochastic GD
Examples per stepAll nOne
Step directionExact steepest descentNoisy estimate
PathSmoothJagged
Cost per stepExpensive (O(n))Cheap (O(1))
Near the minimumSettles cleanlyOscillates in a cloud
Best whenSmall / medium dataLarge data — fast early progress
The modern default is a compromise. Real systems use mini-batch gradient descent: average the gradient over a small handful of examples (32, 64, 256) per step. It's noisy enough to be cheap and to escape bad spots, smooth enough to be stable, and it maps beautifully onto GPU hardware that loves processing batches in parallel. Batch and SGD are the two ends of a dial; mini-batch lives in the sweet middle.
Common misconception: "SGD is just a sloppy approximation you'd never use if you could afford batch." The noise is a feature, not just a cost-saver. In the bumpy loss landscapes of deep networks, SGD's jitter helps it jump out of bad local minima and saddle points that smooth batch descent would get stuck in. For linear regression's single bowl this doesn't matter — but it's why the whole field runs on SGD.
Why is stochastic gradient descent often preferred over batch gradient descent when the training set is very large?

Chapter 6: The Closed Form — Normal Equations

Gradient descent creeps toward the minimum over many iterations. But for linear regression there's something almost magical: we can jump straight to the answer in one shot, solving an equation instead of iterating. The cost is a perfect bowl, and calculus says the bottom of a bowl is exactly where the slope is zero in every direction. So instead of rolling downhill, set the gradient to zero and solve.

To do that for all parameters at once without drowning in subscripts, we switch to matrix notation. Stack the training inputs as rows of the design matrix X (n rows, one per house; d+1 columns, including the x0=1 bias column), and stack the targets in a vector y:

X = [ row i = (x(i))T ]  (n × (d+1)),    y = (y(1), …, y(n))T  (n × 1)

Now the magic of matrix multiplication: the matrix–vector product Xθ computes all n predictions at once. Row i of X is house i's features, and (row i)·θ = hθ(x(i)). So Xθ is the vector of every prediction, and Xθ − y is the vector of every residual. Stacking residuals into a vector r, the cost is half the sum of their squares — and "sum of squares of a vector's entries" is exactly rTr (a vector dotted with itself). Therefore:

J(θ) = ½ (Xθ − y)T(Xθ − y)

This is the same J as before — same ½, same sum of squared residuals — just written compactly so we can differentiate it in one move instead of d+1 separate ones.

Setting the gradient to zero (every step)

Expand the product and differentiate with respect to the vector θ. Three matrix-calculus facts do all the work: (a) aTb = bTa (dot products commute), (b) ∇θ(bTθ) = b, and (c) ∇θTAθ) = 2Aθ for symmetric A. Here A = XTX, which is always symmetric. Watch:

θJ = ½ ∇θ(Xθ − y)T(Xθ − y)
= ½ ∇θTXTXθ − 2(XTy)Tθ + yTy)  ← expand; combine the two cross terms using aTb=bTa
= ½ (2XTXθ − 2XTy)  ← apply facts (b) and (c); yTy has no θ, derivative 0
= XTXθ − XTy

Set that gradient to zero — the defining condition for the bottom of the bowl — and you get the famous normal equations:

XTX θ = XTy

This is a plain linear system: a known matrix (XTX) times the unknown θ equals a known vector (XTy). Multiply both sides by the inverse of XTX and you have the parameters in closed form — no iteration, no learning rate, no convergence to wait for:

θ = (XTX)−1 XTy
This also proves the bowl has exactly one minimum. Because J is a convex quadratic, the single point where the gradient vanishes is the global minimum — there are no other dips. That's why gradient descent in Chapters 3–5 always reaches the same answer this formula gives (for a small enough α): both methods are chasing the one and only bottom of the same bowl.

Solving the normal equations by hand

Let's grind one out completely on the 3-point set from Chapter 2: points (x,y) = (1,1), (2,3), (3,4). With the bias column, the design matrix and target are

X = [1 1]      y = [1]
    [1 2]          [3]
    [1 3]          [4]

Step 1 — form XTX (a 2×2 matrix). XT is 2×3; multiply by X (3×2):

# (XᵀX)₀₀ = Σ x₀·x₀ = 1+1+1 = 3
# (XᵀX)₀₁ = Σ x₀·x₁ = 1+2+3 = 6
# (XᵀX)₁₁ = Σ x₁·x₁ = 1+4+9 = 14
XᵀX = [ 3   6 ]
      [ 6  14 ]

Step 2 — form XTy (a length-2 vector):

# (Xᵀy)₀ = Σ x₀·y = 1·1 + 1·3 + 1·4 = 8
# (Xᵀy)₁ = Σ x₁·y = 1·1 + 2·3 + 3·4 = 1+6+12 = 19
Xᵀy = [ 8 ]
      [19 ]

Step 3 — invert XTX. For a 2×2 matrix [[a,b],[c,d]] the inverse is 1/(ad−bc)·[[d,−b],[−c,a]]. The determinant is ad−bc = 3·14 − 6·6 = 42 − 36 = 6.

(XᵀX)⁻¹ = (1/6) [ 14  −6 ]
                 [ −6   3 ]

Step 4 — multiply θ = (XTX)−1XTy:

θ = (1/6) [ 14  −6 ] [ 8 ]
          [ −6   3 ] [19 ]

θ₀ = (1/6)(14·8  + (−6)·19) = (1/6)(112 − 114) = (1/6)(−2) = −0.333
θ₁ = (1/6)((−6)·8 + 3·19)   = (1/6)(−48 + 57)  = (1/6)(9)  =  1.5

So the exact best-fit line is h(x) = −0.333 + 1.5x. No iterations, no learning rate — just arithmetic. Sanity check at x=2: h = −0.333 + 3.0 = 2.667, sitting neatly between the y-values 1, 3, 4. This is the provably optimal line; gradient descent from Chapter 3, run long enough, would converge to these same two numbers.

python
X = np.array([[1,1],[1,2],[1,3]], dtype=float)
y = np.array([1,3,4], dtype=float)

# Closed form θ = (XᵀX)⁻¹ Xᵀy — but prefer the numerically stable solver:
theta = np.linalg.solve(X.T @ X, X.T @ y)
print(theta)                         # [-0.333  1.5]

# Or skip the normal equations entirely — lstsq is more robust:
theta, *_ = np.linalg.lstsq(X, y, rcond=None)
print(theta)                         # [-0.333  1.5]

When the inverse doesn't exist

The formula needs (XTX)−1, which only exists when XTX is invertible (non-singular). Two situations break it: (1) more features than examples (d+1 > n) — not enough data to pin down all the knobs; and (2) redundant features — e.g. you include area in both square feet and square meters; one is a constant multiple of the other, so they carry no independent information and XTX collapses. Both are fixable (drop the redundant feature, gather more data, or add a regularization term that makes the matrix invertible again — ridge regression), but the bare formula above silently fails. This is one practical reason gradient descent stays useful: it just keeps creeping downhill and never has to invert anything.

Normal equations vs. gradient descent — when to use which. Normal equations: exact, no α to tune, but you must invert a (d+1)×(d+1) matrix — cost grows like d³, painful past ~10,000 features, and impossible if XTX is singular. Gradient descent: needs α and many iterations, but scales to millions of features and examples and never inverts anything. Few features → normal equations. Massive or streaming data → gradient descent. Same answer, different engineering trade-offs.
Common misconception: "Since the normal equations give the exact answer instantly, gradient descent is obsolete." For a 2-feature toy problem, sure, use the formula. But the matrix inverse costs roughly d³ operations and needs XTX to fit in memory and be invertible. With 100,000 features (common in text or genomics) inverting a 100,000×100,000 matrix is hopeless, and with redundant features it's undefined. Gradient descent — and its descendants that train every modern neural net — sidesteps both problems.
The normal equations give θ = (XTX)−1XTy in closed form. What is their main practical limitation compared to gradient descent?

Chapter 7: The Living Regression Lab

Everything so far — the hypothesis, the cost, gradient descent, the LMS rule, batch vs. SGD, the normal equations — now collides in one playground. This is the chapter where it all becomes one moving picture. Take your time here; if the lesson clicks anywhere, it clicks here.

The left panel is the data and the fitted line. The right panel is the cost landscape — the same bowl from Chapter 3, seen from above as contour rings — with a moving dot showing where the current (θ0, θ1) sits on it. As the line on the left improves, the dot on the right slides downhill toward the bullseye. Watch the two panels together: a better line is literally a lower point on the bowl. That correspondence — "fitting the data" = "descending the cost surface" — is the single most important idea in the whole lesson, and here you can see both halves at once.

Things to try (each teaches something):
  • Run gradient descent and watch the line swing into place while the dot spirals into the bullseye.
  • Hit "Normal equations" to teleport straight to the exact optimum — the dot lands dead center instantly. Compare where GD was.
  • Crank α past the danger line and watch the dot ricochet outward off the bowl — divergence, live.
  • Drag a data point far away to create an outlier, refit, and watch the whole line bend toward it. That fragility is the seed of Chapter 9.
  • Add points by clicking empty space; delete the lab's certainty about the line.
Regression Lab — data ↔ cost surface, live

Click empty space to add a point · drag a point to move it · shift-click a point to delete it. The right panel is the cost bowl seen from above; the white dot is your current (θ01).

learning rate α0.080
mode
h(x) =  |  J =  |  iter 0  | 

Sit with one observation before moving on: gradient descent and the normal equations land on the same spot. Run GD to convergence, note the line; hit "Normal equations," and the line doesn't budge. Two utterly different procedures — one a thousand tiny downhill steps, the other a single matrix inversion — agree exactly, because both are solving the identical problem: find the bottom of this one convex bowl. When two independent methods agree, you've usually found something real.

The big synthesis. "Training a model" = "defining a cost surface from the data, then finding its lowest point." The data shapes the bowl; the algorithm descends it. Change the data (drag a point) and the bowl deforms, so the minimum moves, so the best line changes. This three-way dance — data → cost surface → optimal parameters — is not special to linear regression. It is the template for nearly all of machine learning. You just watched it happen.

(No quiz this chapter — the lab is the test. If you can predict, before clicking, what dragging a point will do to the line and to the dot on the bowl, you understand linear regression.)

Chapter 8: Why Squares? The Probabilistic Story

Back in Chapter 2 we chose to square the residuals, and the reasons were practical: squaring kills the sign and gives a smooth derivative. Honest reasons — but they leave a nagging question. Why squares and not, say, fourth powers, or absolute values? Those also kill the sign. Is least squares just a convenient convention, or is it somehow the right thing to do? This chapter delivers the deep answer, and it's one of the most satisfying results in all of statistics: if the world is a clean line corrupted by Gaussian noise, then minimizing squared error is exactly equivalent to finding the most probable parameters. Least squares isn't arbitrary. It's maximum likelihood in disguise.

A model of where the data comes from

Let's stop treating the y-values as arbitrary and propose a generative story: each true price is given by the perfect line plus some random scatter,

y(i) = θTx(i) + ε(i)

where ε(i) (epsilon) is an error term — everything the line doesn't capture: features we left out (the house's view, the neighbourhood school), measurement noise, plain randomness. We now make a specific, famous assumption about that noise: the ε(i) are independent and identically distributed (IID) and follow a Gaussian (Normal) distribution with mean 0 and some variance σ², written ε(i) ~ 𝒩(0, σ²). Mean 0 says the noise doesn't systematically push prices up or down; the bell shape says small deviations are common and big ones are rare. Its density is the bell curve:

p(ε(i)) = (1 / √(2π)σ) · exp( −(ε(i))² / 2σ² )

Since y(i) = θTx(i) + ε(i), and θTx(i) is a fixed number once we pick θ, the randomness in y comes entirely from ε. So y(i) is also bell-shaped, centered on the line's prediction. Substituting ε(i) = y(i) − θTx(i) into the density:

p(y(i) | x(i); θ) = (1 / √(2π)σ) · exp( −(y(i) − θTx(i))² / 2σ² )

Read this as: "given a house's features x and a choice of line θ, here's the probability the house sold for price y." A price right on the line is most probable; prices far from the line are exponentially less probable — and notice that exponent already contains a squared residual. The squares are creeping in, not by our choice this time, but from the shape of the bell curve itself.

The line is a ridge of probability

The generative model says each point is drawn from a bell curve centered on the line. Tilt the line and slide the noise σ; the shaded band is ±1σ. The data log-likelihood below peaks when the line threads the points — and that peak is exactly where squared error is smallest.

θ1 (slope)0.10
noise σ50
log-likelihood ℓ(θ) =   (higher = more probable)  ·  ½∑resid² =

Maximum likelihood: pick the θ that makes the data most probable

Here's the principle. Different lines θ assign different probabilities to the prices we actually observed. The maximum likelihood principle says: choose the θ under which the observed data was most probable. It's common sense — prefer the explanation that makes what you saw least surprising.

Because the examples are independent, the probability of the whole dataset is the product of the individual probabilities. Viewed as a function of θ, this product is called the likelihood L(θ):

L(θ) = ∏i=1n p(y(i) | x(i); θ) = ∏i=1n (1/√(2π)σ) exp( −(y(i) − θTx(i))² / 2σ² )

Maximizing a product of exponentials is awkward. The standard trick: take the logarithm. The log is monotonically increasing, so whatever θ maximizes L also maximizes log L — but the log turns the product into a sum and cancels the exp. Call it the log-likelihood ℓ(θ):

ℓ(θ) = log L(θ)
= ∑i=1n log[ (1/√(2π)σ) exp(−(y(i)−θTx(i))² / 2σ²) ]  ← log of product = sum of logs
= n·log(1/√(2π)σ) − (1/σ²)·½∑i=1n(y(i)−θTx(i))²  ← log·exp cancels; pull out constants

The punchline

Now look at that last line as a function of θ. The first term, n·log(1/√(2π)σ), has no θ in it — it's a constant we can ignore. The 1/σ² is a positive constant multiplier. So maximizing ℓ(θ) means maximizing −½∑(y − θTx)², which — because of the minus sign — means minimizing:

½ ∑i=1n (y(i) − θTx(i))²  =  J(θ)

That is exactly our least-squares cost J from Chapter 2. We didn't aim for it; it fell out. Maximizing the probability of the data under Gaussian-noise assumptions is identical to minimizing squared error. The squares were never an arbitrary choice — they are the fingerprint of the Gaussian's exp(−something²) shape. Choose a different noise distribution and you'd get a different cost: Laplacian (double-exponential) noise gives absolute error (the robust loss of Ch. 9). Least squares is the right answer to a specific, named assumption about the world.

One more elegant detail: notice σ² multiplied the whole sum but dropped out of the answer — the best-fit θ doesn't depend on how noisy we think the data is. That's why we never had to know or estimate σ to do linear regression. The noise level changes how confident we should be in the fit, but not the fit itself.

python
# Maximizing Gaussian log-likelihood == minimizing ½Σ(y − ŷ)². Demo:
def neg_log_likelihood(theta, X, y, sigma):
    resid = y - X @ theta
    n = len(y)
    # −ℓ(θ) up to constants that don't depend on θ:
    return n*np.log(np.sqrt(2*np.pi)*sigma) + (1/sigma**2)*0.5*np.sum(resid**2)

# The θ that minimizes −ℓ is the SAME θ that minimizes ½Σresid², for ANY σ>0.
# → least squares IS maximum likelihood under Gaussian noise.
Common misconception: "This proves least squares is always correct." It proves least squares is the maximum-likelihood estimator under the specific assumption that noise is IID Gaussian. When that assumption is wrong — heavy-tailed noise, outliers, skewed errors — least squares is no longer optimal and can be badly misled (Ch. 9). The probabilistic view is powerful precisely because it makes the hidden assumption explicit: it tells you exactly when to trust the method and when to reach for a different loss.
Under what assumption about the data is minimizing squared error (least squares) exactly equivalent to maximum-likelihood estimation of θ?

Chapter 9: When It Breaks — Underfitting, Overfitting, Outliers

You now own a complete, working algorithm. A good engineer's next question is always: how does it fail? Knowing the failure modes is what separates "I can run LinearRegression().fit()" from "I understand linear regression." There are four ways it breaks, and each one points to a whole branch of the field.

1. Underfitting — the model is too simple

A straight line can only ever be straight. If the true relationship curves — prices that rise fast then plateau for huge mansions — no choice of θ0, θ1 will fit it well. The line systematically misses, leaving a pattern in the residuals. This is underfitting: the model lacks the flexibility to capture the structure that's plainly in the data. The fix is to give it more expressive features.

And here's the trick that surprises people: you can fit a curve with linear regression. Just invent new features that are powers of x — feed the model x, x², x³ as separate columns. The hypothesis h(x) = θ0 + θ1x + θ2x² + θ3x³ is a wiggly cubic in the input x, yet it's still perfectly linear in the parameters θ, so every formula we derived still applies unchanged. "Linear" regression was never about straight lines — it's about being linear in the knobs.

2. Overfitting — the model is too flexible

If a little curvature helps, why not a lot? Fit a 5th-order polynomial and the curve can pass through every training point exactly, driving J all the way to zero. Victory? No — disaster. Between the points the curve thrashes wildly up and down, and on a new house it predicts nonsense. The model memorized the training data, including its noise, instead of learning the trend. This is overfitting: low training cost, terrible real-world predictions. Remember the Chapter 2 warning — low cost is fit, not wisdom. Here it bites.

Underfit → just right → overfit

Same data, three model complexities. Slide the polynomial degree. Degree 1 underfits (misses the curve). A middle degree captures the trend. Crank it high and the curve contorts to hit every point — flawless on these dots, useless between them. Watch the training cost fall to ~0 even as the curve gets obviously worse.

polynomial degree1
training cost J =  

The cure is regularization — add a penalty on large θ values to the cost, so the model is rewarded for staying simple unless the data really demands complexity. (That's ridge and lasso regression, and conveniently the ridge penalty also fixes the non-invertibility problem below.) The bias–variance tradeoff that governs this balance gets its own lesson.

3. Outliers — the Achilles' heel of squaring

The very thing that made squared error convenient now backfires. Because errors are squared, a single wild point — a data-entry typo, a mansion sold to a relative for $1 — contributes a huge penalty (its residual squared), and the line bends hard to reduce it, abandoning the other 46 honest points. Least squares is not robust: one bad point can wreck the whole fit. This is the Gaussian assumption from Chapter 8 failing — real noise had a heavy tail the bell curve didn't expect. The remedy is a loss that grows more slowly than the square (absolute error, Huber loss), which is exactly the robust estimation story.

4. Singular XTX and unscaled features

Two more practical traps. First, as we saw in Chapter 6, if features are redundant (one is a linear combination of others) or you have fewer examples than features, XTX has no inverse and the normal equations simply don't define an answer. Second, for gradient descent, features on wildly different scales (area in the thousands, bedrooms in single digits) stretch the cost bowl into a long thin canyon, so any α that's safe for one direction crawls hopelessly in the other. The fix is feature scaling — standardize each feature to mean 0, variance 1 — which rounds the bowl and lets gradient descent take confident steps. Always scale your features before running GD.

Every failure is a doorway. Underfitting → richer features & basis functions. Overfitting → regularization & the bias–variance tradeoff. Outliers → robust losses. Non-invertibility → ridge regression. Slow convergence → feature scaling & better optimizers. The humble line you just mastered is the trunk; every advanced method in supervised learning grows from patching one of these cracks.
Common misconception: "If my model fits the training data perfectly (J = 0), it's the best possible model." Perfect training fit is a red flag, not a trophy — it almost always means the model memorized noise and will generalize poorly. The goal is low error on unseen data, which you estimate by holding out a validation set, never by celebrating a zero on the data you trained on.
A 5th-order polynomial passes through every training point, giving training cost J = 0, but predicts terribly on new houses. What is this, and why does it happen?

Chapter 10: Connections & Cheat Sheet

You started this lesson unable to defend any particular line through a scatter of houses. You can now: define a hypothesis, score it with a principled cost, minimize that cost two different ways, and — the deep part — explain why squared error is the right cost when noise is Gaussian. That's a real, complete piece of machine learning, mastered from zero. More importantly, you've met the template that nearly every other model reuses.

The whole lesson on one page

ConceptFormulaIn words
Hypothesishθ(x) = θTx = ∑j θjxjPrediction = weighted sum of features (x0=1 holds the bias)
Cost (least squares)J(θ) = ½∑i(h(x(i))−y(i)Half the total squared residual — how wrong the line is
Gradient∂J/∂θj = ∑i(h(x(i))−y(i))xj(i)Slope of the cost in the θj direction
GD updateθj := θj − α·∂J/∂θjStep downhill against the gradient
LMS rule (1 example)θj := θj + α(y−h)xjNudge by learning rate × error × input
Normal equationsθ = (XTX)−1XTyClosed-form exact minimum (no iteration)
Probabilistic viewy = θTx + ε,  ε ~ 𝒩(0,σ²)Least squares = MLE under Gaussian noise

Symbol glossary

SymbolMeaning
x(i), y(i)features and target of the i-th training example ((i) is an index, not a power)
θthe parameter / weight vector — the knobs we tune
x0 = 1the fake feature that carries the intercept (bias trick)
αlearning rate — gradient-descent step size
ε, σ²noise term and its variance
X, ydesign matrix (examples × features) and target vector

Production code — the whole thing, both ways

python
import numpy as np

def fit_normal_equations(X, y):
    """Exact closed-form solution. Best for few features."""
    return np.linalg.lstsq(X, y, rcond=None)[0]   # stabler than (XᵀX)⁻¹Xᵀy

def fit_gradient_descent(X, y, alpha=0.01, iters=5000):
    """Iterative. Scales to huge data. Remember to feature-scale X first."""
    theta = np.zeros(X.shape[1])
    for _ in range(iters):
        theta -= alpha * (X.T @ (X @ theta - y))   # θ −= α·∇J
    return theta

# sklearn one-liner — does the normal equations under the hood:
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X_raw, y)     # handles the bias for you
print(model.coef_, model.intercept_)

Where to go next

You can now teach this. If a friend asked "what is linear regression, really?" you could draw the scatter, define the line as two knobs, build the squared-error cost as literal squares, roll downhill with the LMS rule, jump to the exact answer with the normal equations, and finish with the kicker that it's all just maximum likelihood under Gaussian noise. From a vague feeling about a line to a complete, defensible theory. That's mastery.

"With four parameters I can fit an elephant, and with five I can make him wiggle his trunk." — John von Neumann, on the danger of overfitting. You now know exactly what he meant.