Classical ML · CS229

Bias, Variance & Double Descent

Why a model that fits its training data perfectly can predict garbage — and why, against all classical wisdom, making it even bigger can fix it. The anatomy of generalization, from the textbook tradeoff to the deep-learning twist.

Prerequisites: Linear Regression + the idea of overfitting. We build the rest.
10
Chapters
7+
Simulations
0
Assumed Knowledge

Chapter 0: The Model That Aced the Test and Failed the Class

Back in the linear regression lesson, we saw something unsettling. A high-degree polynomial could thread its way through every single training point, driving the training error to exactly zero — a flawless fit. And yet, on new data, it predicted nonsense, wiggling wildly between the points it had memorized. A perfect score on the practice exam, a failing grade on the real one.

This is the central puzzle of machine learning, and it deserves a real explanation, not just the label "overfitting." Why does fitting the training data better sometimes make predictions worse? And how do we know, before we see the test results, how complex a model to build? Get this wrong and your model is either too dumb to learn or too clever for its own good.

Watch a model overfit in real time

The dots are noisy samples from a smooth true curve (a gentle quadratic). Slide the model complexity (polynomial degree). Watch the fitted curve: too low and it misses the shape; just right and it tracks the truth; too high and it contorts to hit every noisy dot — training error plummets to zero while the fit gets visibly worse. The faint test points are new data the model never saw.

model complexity (degree)2
training error: test error:

Slide it to degree 12 and stare at the numbers: training error near zero, test error enormous. The model isn't learning the signal (the smooth curve) — it's memorizing the noise (the random scatter of this particular sample). On a different sample, that noise would be different, so the memorized wiggles would be wrong. That's the crux.

To understand this precisely — and to predict the right complexity in advance — we're going to take the test error apart like a watchmaker, into three distinct pieces. Two of them we can control by choosing the model; one is an unbeatable floor set by the noise itself. The two controllable pieces are called bias and variance, and they pull in opposite directions. Master their tug-of-war and overfitting stops being a mystery and becomes a dial you can tune.

The plan. Ch.1: training vs test error — what we actually care about. Ch.2: bias, the error of a too-simple model. Ch.3: variance, the wobble of a too-complex one. Ch.4: the clean decomposition that adds them up. Ch.5: the classic tradeoff curve and its sweet spot. Ch.6: how to fight variance (more data, regularization). Then the modern plot twist — Ch.7–8: double descent, where making a model absurdly large makes it generalize better, shattering the textbook curve and explaining why deep learning works at all.
Common misconception: "Lower training error is always progress." Training error measures memorization, not understanding. Past a point, driving training error down only fits noise — test error climbs even as training error falls. The two can move in opposite directions, and only one of them (test error) is the thing you actually want small. Celebrating a tiny training error is the rookie's victory lap before the real race.
A degree-12 polynomial achieves nearly zero training error but huge test error on a smooth underlying function. What is it actually doing?

Chapter 1: Training Error vs. Test Error — The Gap That Matters

Let's get the vocabulary exactly right, because the whole subject hinges on one distinction. When you train a model, you minimize the training error — the average loss on the examples you fit. But that's never the real goal. The real goal is the test error — the average loss on new, unseen examples drawn from the same source. You don't care how well the model recites the data it studied; you care how well it predicts the data it hasn't.

The difference between them has a name: the generalization gap. A model that has truly learned has a small gap — it does about as well on new data as on its training data. A model that has overfit has a large gap — great on training, poor on test. The entire art of building models that work is the art of keeping this gap small while still fitting the signal.

This gives us crisp definitions of the two ways a model can fail:

The two curves that tell the whole story

As model complexity grows (left→right), watch the two error curves. Training error falls monotonically — more complexity always fits the seen data better. But test error is U-shaped: it falls, bottoms out, then rises. The gap between them — the shaded region — is the generalization gap, widening as the model overfits. Slide to find where test error is lowest.

complexity2
Why you must hold out a test set. You cannot measure the generalization gap using the training data — by definition the model has seen it. So in practice you hold out a portion of your data, never let the model train on it, and use it only to estimate test error. Peeking at the test set during model-building secretly shrinks the gap you measure while the true gap stays large — the cardinal sin of machine learning. The test set is sacred precisely because it's unseen.
Common misconception: "If training and test data come from the same distribution, training error should equal test error." Even from the identical source, the model has seen the training points and not the test points — and a flexible model can exploit the specific quirks of what it saw. That asymmetry (seen vs. unseen) is the entire reason a gap exists. Same distribution, different role: one taught the model, the other audits it.
What is the "generalization gap," and which scenario produces a large one?

Chapter 2: Bias — The Error You Can't Train Away

The first of our two controllable error sources is bias, and it's the failure of being too simple. Here's the precise idea: imagine you could train your model on an infinite amount of data — so much that noise averages out completely and you find the genuinely best model your chosen family can offer. The error that still remains at that point is the bias. It's the gap between the best your model family can do and the truth.

A straight line is the classic high-bias model. If the true relationship is curved, no line — not even the best possible line fit to infinite data — can ever match it. The line is fundamentally, structurally incapable of representing a curve. That permanent, unfixable error is bias. Crucially, more data does not help bias one bit. Throw a billion points at a line trying to fit a parabola and it's still a line, still wrong in the same way. Bias is a limitation of the model's expressiveness, not of the data's quantity.

Bias doesn't shrink with more data

A curved true function and a best-fit straight line. Crank the amount of data all the way up — the line gets more confident, but it never gets more correct: it stays stubbornly far from the curve in the same places. That persistent error, even with mountains of data, is bias. The line simply cannot bend.

training points20
best-fit line's error vs the true curve:  (barely changes!)

Think of bias as a model's stubbornness or its built-in blind spot. A linear model has strong assumptions baked in — "the world is a straight line" — and if those assumptions are wrong, no amount of evidence will talk it out of them. High bias means high stubbornness: the model imposes its preconception on the data and underfits whatever doesn't match. The cure for bias is not more data — it's a more flexible model (a higher-degree polynomial, more features, a neural network) that can actually represent the truth.

Common misconception: "Bias means the model is skewed toward one answer, like a biased coin." In this context bias has a specific technical meaning: the error of the best possible model in your family, the part that survives even with infinite data. It's about expressiveness, not prejudice. A high-bias model isn't 'unfair' — it's just too rigid to capture the underlying pattern, no matter how much you train it.
You fit a straight line to data from a curved function. You collect 100× more data. What happens to the bias?

Chapter 3: Variance — The Wobble Across Datasets

The opposite failure is being too complex, and its signature is variance. Here's the thought experiment that defines it: imagine training your model not once, but on many different training sets — each a fresh random sample from the same source. A low-variance model gives nearly the same answer every time. A high-variance model gives wildly different answers each time, because it's chasing the random noise that happens to be in each particular sample.

Variance is the model's jumpiness — how much its learned function dances around as the training data is reshuffled. A degree-12 polynomial fit to three different samples of the same curve produces three dramatically different wiggly monsters, each contorted to hit its own sample's noise. The signal (the true curve) is the same across samples; only the noise differs — so a model that fits the noise is, by definition, fitting something that won't reappear, and it varies enormously from sample to sample.

High variance: the same model, different every time

Press Resample to draw a fresh training set from the same true curve and refit. With low complexity, the fitted line barely moves between samples — stable, low variance. Crank the complexity up and hit Resample repeatedly: the high-degree fit thrashes around violently, a totally different curve each time. That instability is variance.

complexity (degree)9
spread of fits:

Notice the cruel symmetry with bias. Bias is being too rigid — the model ignores the data and imposes its own shape (underfitting). Variance is being too flexible — the model takes the data too seriously, including its noise (overfitting). And unlike bias, variance does shrink with more data: with enough points, the noise starts to average out, and even a flexible model can't be fooled by it as easily. That asymmetry — more data kills variance but not bias — will matter enormously.

Common misconception: "Variance is just another word for the model being wrong." Variance specifically measures inconsistency — how much the model changes when retrained on different samples — not how wrong any single fit is. A high-variance model might fit your one dataset beautifully; the problem is that this beautiful fit is a fluke of this sample's noise and would look completely different on the next sample. It's wrong in an unstable way.
What does the "variance" of a model-fitting procedure measure?

Chapter 4: The Decomposition — Error in Three Pieces

Now the beautiful result that ties it together. For regression with squared error, the expected test error at any point splits exactly into three non-overlapping pieces. This isn't a vague analogy — it's an algebraic identity:

Test Error = Bias2 + Variance + Irreducible Noise

Let's earn each piece. We average over two sources of randomness: the random training set (which determines which model we learn) and the random noise on the test point. Define the average model as the one you'd get by training on infinitely many datasets and averaging their predictions — the "typical" model your procedure produces. Then the test error decomposes cleanly:

The reason these add cleanly (no cross-terms) is a tidy fact: the noise is independent of everything, and the deviation-from-average has zero mean by construction, so when you expand the squared error, the cross-terms vanish in expectation. Three sources of error, perfectly separated, summing to the total. (We won't grind the full algebra here, but it's two applications of "for independent A and B with A centered at zero, the expected square of their sum is the sum of their expected squares.")

Why this decomposition is so powerful. It turns the vague worry "is my model too simple or too complex?" into two separately diagnosable quantities. High bias and low variance? Your model is too simple — add complexity. Low bias and high variance? Too complex — simplify or get more data. You can even estimate each: train on several data subsets, and the spread of predictions is variance while their average distance from the truth is bias. Diagnosis before treatment.
The three pieces, summed

Slide model complexity and watch the stacked components of test error. Bias² (blue) falls as complexity rises — flexible models represent the truth better. Variance (purple) climbs — flexible models fit noise. Noise (gray) is a flat floor. Their sum is the total test error — the U-shaped curve.

complexity4
bias²= + var= + noise= =
Common misconception: "If I build a perfect model, test error goes to zero." Never — the irreducible noise is a floor you cannot cross. If the data has inherent randomness (and real data always does), even an oracle that knows the true function exactly will make errors on individual points, because the points themselves are noisy. The best you can ever do is drive bias and variance to zero and be left sitting on the noise floor. Chasing below it means you're fitting noise — overfitting, by definition.
The test error decomposes into three pieces. Which one can NO model, however good, ever reduce?

Chapter 5: The Tradeoff — Finding the Sweet Spot

Now the central practical insight, the one that guided model selection for fifty years. Bias and variance pull in opposite directions as you change complexity. Make the model simpler: bias rises, variance falls. Make it more complex: bias falls, variance rises. You cannot minimize both at once by tuning complexity — you must trade one against the other. And since test error is their sum, it's minimized somewhere in the middle: the famous U-shaped curve, with a sweet spot at the bottom.

This is the bias-variance tradeoff, and it's a lens you'll use forever. Too far left (too simple): high bias dominates, the model underfits, both errors high. Too far right (too complex): high variance dominates, the model overfits, test error high despite tiny training error. The sweet spot is where the rising variance and falling bias balance — the model complex enough to capture the signal, simple enough to ignore the noise.

This is the full picture — play with all of it. The lab below ties every chapter together: the data and fit on the left, the error anatomy on the right. As you slide complexity, watch the left-panel curve go from rigid (underfit) to perfect to deranged (overfit), while the right panel shows exactly why in terms of bias and variance. The lowest point of the red test-error curve is the complexity you'd actually choose.

Bias-Variance Lab

Left: noisy data + the fitted curve at your chosen complexity. Right: the error curves — training (always falling), test (U-shaped), with the optimal complexity marked. Drag the complexity slider and watch both panels move together. Press Resample to see the fit's variance directly.

complexity (degree)3

One sobering takeaway: the sweet spot depends on how much data you have. With little data, variance is dangerous, so you should prefer simpler models (accept some bias to avoid wild variance). With abundant data, variance is tamed, so you can afford more complexity and drive bias down. "How complex should my model be?" has no universal answer — it's a function of your data budget. (No quiz — the lab is the test. If you can predict where the test-error curve bottoms out before sliding there, you understand the tradeoff.)

Chapter 6: Fighting Variance — More Data & Regularization

Suppose you've diagnosed your model as high-variance — it overfits, the test error far exceeds the training error. You don't have to retreat to a simpler model (and pay in bias). You have two more powerful tools that attack variance directly while keeping your model's expressiveness.

Tool 1: more data

As we noted, variance shrinks as data grows. The intuition: variance comes from fitting the noise of your particular sample, and with more points, the noise increasingly cancels out (random ups and downs average toward zero), so the model can't be misled by it. A degree-12 polynomial that thrashes wildly on 15 points becomes quite stable on 15,000 — there's no room left to wiggle between so many constraints. More data is the cleanest fix because it reduces variance without raising bias. The catch is simply that data costs money and time.

Tool 2: regularization

When you can't get more data, regularization is the workhorse. The idea: add a penalty to the cost that discourages the model from using large parameter values. A model with huge weights can swing violently to hit every point (high variance); penalizing weight size keeps the fitted function smooth, resisting the urge to chase noise. You're deliberately introducing a little bias (the model can't fit quite as freely) in exchange for a large reduction in variance — and since test error is their sum, that's often a winning trade.

Regularization tames the wiggle

A high-degree polynomial fit to noisy data. Slide the regularization strength up: the curve relaxes from a frantic, overfit wiggle (left, zero regularization) to a smooth, sensible fit (middle) — and if you over-do it, to an over-smoothed near-line (right, too much regularization, now underfitting). There's a sweet spot here too. Watch the test error.

regularization λ0.00
test error:  
Regularization is a bias-variance dial. Strength zero = full flexibility = low bias, high variance (overfit). Strength enormous = forced simplicity = high bias, low variance (underfit). In between lies a sweet spot, exactly like the complexity tradeoff — but now you can keep a powerful, expressive model and dial in just enough smoothing. This is why modern huge models lean on regularization (weight decay, dropout, early stopping) instead of shrinking: keep the expressiveness, control the variance.
Common misconception: "Regularization makes the model worse because it stops it from fitting the data." It stops the model from fitting the training data as tightly — on purpose. That slight worsening of training fit is the price for a large gain in test performance, because the model stops memorizing noise. Judged by training error, regularization looks harmful; judged by test error (the thing that matters), it usually helps. Never evaluate a regularizer on the training set.
How does regularization improve a high-variance (overfitting) model?

Chapter 7: Double Descent — When the Textbook Curve Breaks

Everything so far is the classical story, and it ruled machine learning for decades: test error is U-shaped, so don't make your model too big. Then deep learning arrived with models having millions or billions of parameters — far more than training examples — and they generalized beautifully. By the textbook curve, they should have been catastrophic overfitting disasters. They weren't. Something was missing from the picture.

That something is double descent, one of the most surprising discoveries in modern machine learning. When you plot test error against model size past the point where the classical curve stops, a second act appears. Test error follows the familiar U — down, then up — rising to a sharp peak right at the interpolation threshold: the point where the model has just barely enough parameters to fit the training data exactly. And then, as you add even more parameters into the overparameterized regime, the test error descends again — often to a new minimum lower than the classical sweet spot. Two descents, with a dangerous peak between them.

The double descent curve — real, simulated live

This is genuine min-norm linear regression on random features, computed in your browser — not a hand-drawn cartoon. Slide the number of parameters across the interpolation threshold (marked, where params = training points). Watch test error rise to a spike there, then fall again as the model becomes hugely overparameterized. The left valley is the classical sweet spot; the right valley is the deep-learning regime.

# parameters10

Look at what happens right at the peak (parameters ≈ training points). Here the model has exactly enough capacity to interpolate the data and no more — so it's forced into the one and only contorted function that threads every noisy point, with no freedom to be smooth. That single forced solution is maximally sensitive to noise: variance explodes, and test error spikes. It's the worst of both worlds — complex enough to chase all the noise, constrained enough to have no gentler option.

But push past it, into genuine overparameterization, and something liberating happens: now there are many different parameter settings that all fit the training data perfectly. The model gets to choose among them — and, remarkably, the training procedure tends to pick a smooth, simple one. That's the subject of the next chapter, and it's the key to why bigger can be better.

Common misconception: "Double descent means the bias-variance tradeoff was wrong." Not wrong — incomplete. The classical U-curve is exactly right up to the interpolation threshold, and the bias-variance decomposition still holds. Double descent reveals that "number of parameters" is a flawed measure of true model complexity in the overparameterized regime — a billion-parameter model that's implicitly kept smooth is, in the ways that matter, simpler than its parameter count suggests. The tradeoff is real; we were just measuring complexity wrong.
In the double descent curve, where does test error reach its dangerous peak?

Chapter 8: Why Bigger Wins — Implicit Regularization

The puzzle of the second descent: in the overparameterized regime there are infinitely many models that fit the training data perfectly (zero training error). Most of them are horrible — wild, overfit functions. So why does training a huge model land on a good one instead of a terrible one? The answer is a subtle and beautiful idea: implicit regularization.

Here's the mechanism, made concrete for linear models. When many solutions fit the data, gradient descent starting from zero doesn't pick an arbitrary one — it converges to the minimum-norm solution, the one with the smallest weights among all perfect fits. And small weights, as we learned in the last chapter, mean a smooth function. So the optimizer is secretly regularizing — not because you added a penalty, but because the geometry of gradient descent inherently prefers the gentlest solution that works. The model is overparameterized in raw count, but the effective function it learns is simple.

The resolution of the paradox. A billion-parameter network can represent monstrously complex functions — but gradient descent, left to its own devices, doesn't choose them. Among all the ways to fit the data, it gravitates toward simple, smooth ones. So the capacity is enormous but the realized complexity is modest. This is why "number of parameters" mismeasures complexity past the interpolation threshold, and why modern deep learning — vastly overparameterized, often trained with little explicit regularization — generalizes at all. The optimizer is a quiet regularizer.

This reframes the practical advice of the whole lesson. In the classical regime (modest models, scarce data), the bias-variance tradeoff rules: tune complexity to the sweet spot, regularize, don't go too big. In the modern regime (massive models, lots of data and compute), the lesson flips: go big, push past the interpolation peak into the second descent, and let implicit (and a little explicit) regularization keep the realized function smooth. Both are true; which applies depends on where you are on the curve. Knowing both regimes — and that the peak between them is a place to avoid — is what modern fluency looks like.

Common misconception: "Double descent means regularization is obsolete — just make everything huge." Implicit regularization helps, but explicit regularization (weight decay, dropout, early stopping, and crucially the right amount of data) still improves the overparameterized regime and, as the source figures show, can erase the interpolation peak entirely. The lesson isn't "size replaces care" — it's "size, plus the right regularization, plus enough data." Bigger is a tool, not a free lunch.
In the overparameterized regime, why does a huge model generalize well even though many of its possible solutions would overfit terribly?

Chapter 9: Connections & Cheat Sheet

Overfitting is no longer a mystery to you — it's a quantity you can decompose, diagnose, and control. And you've seen both the classical theory and the modern twist that the classical theory couldn't explain. That two-regime picture is genuinely current understanding, not just textbook history.

The whole lesson on one page

ConceptWhat it means
Training vs test errorYou minimize training error; you care about test error. Their difference is the generalization gap.
UnderfittingHigh training error — model too simple (high bias).
OverfittingLow training error, high test error — model too complex (high variance).
BiasError of the best model in the family, even with infinite data. Fixed by more flexibility, NOT more data.
VarianceHow much the fit changes across training samples. Fixed by more data or regularization.
DecompositionTest error = bias² + variance + irreducible noise.
TradeoffSimpler → more bias, less variance. Complex → less bias, more variance. Sweet spot in the middle (classical U-curve).
RegularizationPenalize large weights → trade a little bias for much less variance.
Double descentPast the interpolation threshold (params ≈ data), test error peaks then descends again. Overparameterized models generalize.
Implicit regularizationGradient descent prefers minimum-norm (smooth) solutions, so huge models stay effectively simple.

Diagnosing a model in code

python
# The practical bias-variance diagnosis: compare train vs validation error.
train_err = mse(model.predict(X_train), y_train)
val_err   = mse(model.predict(X_val),   y_val)

if train_err > acceptable:
    # HIGH BIAS (underfitting): model too simple.
    # → add features / complexity, train longer, reduce regularization
    pass
elif val_err - train_err > gap_threshold:
    # HIGH VARIANCE (overfitting): big generalization gap.
    # → more data, add regularization, simplify  —  OR go big (double descent)
    pass

# A learning curve (error vs training-set size) tells bias from variance:
#  curves converge high  → high bias (more data won't help)
#  big persistent gap    → high variance (more data WILL help)

Where to go next

You can now teach this. Test error splits into bias squared (too-simple error, unfixable by data), variance (instability across samples, fixed by data or regularization), and irreducible noise (the floor). Trading bias against variance gives the classical U-curve and its sweet spot. But past the interpolation threshold, double descent kicks in: test error peaks then falls again, because gradient descent's implicit preference for smooth, minimum-norm solutions keeps overparameterized models effectively simple. The full anatomy of generalization, classical and modern.

"All models are wrong, but some are useful." — George Box. Bias is how wrong; variance is how unreliably wrong; and the art is making the unavoidable wrongness as useful as the noise floor allows.