Language Modeling from Scratch · CS336 · Lecture 9

Scaling Laws I: The Basics

You have 10,000 H100s for a month. Should you train a bigger model or feed it more data? The answer is neither obvious nor arbitrary — it is a predictable mathematical consequence of power-law scaling. This lesson derives the C = 6ND compute rule, the joint loss formula L(N,D), the Chinchilla compute-optimal allocation, and the famous D = 20N rule — with worked numbers throughout.

Prerequisites: CS336 Lec 2 (FLOPs, 6ND rule), basic logarithms. No calculus required.

Chapters

Live Canvases

Derived

D = 20N Rule

Chapter 0: The Budget Problem

Imagine this: your friend hands you 10,000 H100 GPUs for a month and says "build the best open-source language model you can." Your infra team is ready, your dataset is assembled. The last question — the one that will determine whether you win or lose — is: which model do I train?

Specifically: given a fixed compute budget C, do you train a large model on fewer tokens, or a small model on more tokens? This is not a question you can answer by intuition. GPT-3 (175B params) was trained on 300B tokens — barely 2 tokens per parameter. Is that right? Too few? Too many? The answer from Chinchilla (2022) is: GPT-3 was massively undertrained. The optimal allocation for that compute budget was 70B parameters trained on 1.4T tokens. Same compute, far better model.

The core question of this lesson. Given a FLOPs budget C, what is the optimal number of parameters N* and training tokens D* to minimize validation loss? This is the compute-optimal question, and scaling laws are the tool that answers it.

Before we can answer it, we need to understand the empirical phenomenon that makes it answerable at all: scaling laws. These are clean, predictable power-law relationships between compute/data/parameters and model loss. They were not derived from first principles — they were discovered empirically, and they have held across orders of magnitude of scale.

Here is the key empirical finding, shown as a log-log plot. On log scales, loss decreases as a straight line with compute, parameters, or data. That straight line on a log-log plot is the signature of a power law.

Loss vs Compute: the straight line on log-log axes

Drag the exponent slider. A power law L = A·C^−β is a straight line on log-log axes — and empirically, β ≈ 0.05 for compute.

Exponent β 0.050

Constant A 2.0

You double the compute budget (2C). By roughly what factor does the loss improve if the exponent β = 0.05?

Loss drops to zero — double compute = perfect model. Loss halves exactly (factor of 2 improvement). Loss improves by factor 2^0.05 ≈ 1.035 — only 3.5% better per doubling. Loss stays constant — compute alone doesn’t help without more data.

Chapter 1: Why Power Laws?

A power law is a relationship of the form y = A · x^−α. On a log-log plot, this becomes log(y) = log(A) − α·log(x), which is a straight line with slope −α. Any time you see a straight line on a log-log plot, you are looking at a power law.

Why should neural network loss follow a power law with data? Here is the cleanest intuition, drawn from statistics. Consider estimating the mean of a Gaussian: you observe x₁, x₂, ..., x_n ~ N(μ, σ²) and estimate μ̂ = (∑x_i) / n. The squared error is E[(μ̂ − μ)²] = σ² / n. That is a power law with exponent 1. On a log-log plot: log(Error) = −log(n) + 2·log(σ) — a straight line with slope −1.

Why neural nets have smaller exponents. Classical estimation on a 1D problem gives slope −1. But neural networks learn in very high-dimensional spaces. For a nonparametric function f(x) in d dimensions, the error scales as n^−1/d. Language has intrinsic dimension d ≫ 1, so the slope is much shallower. Empirically, the exponent for language model data scaling is around −0.095 (Kaplan 2020) — consistent with an effective intrinsic dimensionality of roughly 10.

The same logic extends to model parameters. Larger models can represent more complex functions. As N grows, the irreducible approximation error from model capacity shrinks as a power law. And for compute, since C ≈ 6ND (as we derive in Ch 2), any power law in N or D induces a power law in C as well.

python — verifying power laws with numpy
import numpy as np

# Suppose we have (compute, loss) pairs from training runs
compute = np.array([1e18, 1e19, 1e20, 1e21, 1e22])
loss    = np.array([3.20, 2.95, 2.74, 2.55, 2.38])

# Take logs: power law becomes linear
log_C = np.log10(compute)
log_L = np.log10(loss)

# Fit a line: log(L) = log(A) - beta * log(C)
slope, intercept = np.polyfit(log_C, log_L, 1)
print(f"Exponent beta = {-slope:.3f}")  # should be ~0.05
print(f"Constant   A  = {10**intercept:.3f}")

# From Kaplan+ 2020: beta_compute ~ 0.050,  A ~ 2.3
# From Chinchilla:   beta_compute ~ 0.032  (different fitting methodology)

Kaplan 2020 empirical exponents. For large language models trained on the WebText corpus: L(N) = (8.8×10¹³ / N)^0.076; L(D) = (5.4×10¹³ / D)^0.095; L(C) ≈ A · C^−0.050. These held from 10⁶ to 10¹⁰ parameters — six orders of magnitude.

Why do neural networks have shallower scaling law exponents than classical 1/n statistics?

Neural networks are worse at using data than classical estimators. Neural networks require regularization, which flattens the curve. Language lives in high intrinsic dimension d, so the rate is n^−1/d rather than n⁻¹, giving a much shallower slope. Neural networks converge to a non-zero irreducible loss regardless of data.

Chapter 2: Deriving C = 6ND

Before we can reason about scaling laws, we need to count FLOPs. The standard formula is C ≈ 6ND, where N is the number of parameters and D is the number of training tokens. Let's derive it from scratch, following the same accounting from CS336 Lec 2.

Consider one transformer with N parameters processing D total tokens. The key insight is that every parameter participates in a multiply-accumulate (MAC) operation for each token. A MAC counts as 2 FLOPs (one multiply, one add). So the forward pass costs roughly 2ND FLOPs.

Why the backward pass is 2× more expensive. Backpropagation computes two things: (1) the gradient of the loss with respect to each weight (∇_WL), and (2) the gradient with respect to the input activations (∇_xL, needed for the next layer back). Each of these requires a pass through the weight matrix, so each costs ~2ND FLOPs. Total backward: 4ND. Combined: forward 2ND + backward 4ND = 6ND.

Let's verify with a worked example. GPT-3 has N = 175 billion parameters and was trained on D = 300 billion tokens. C = 6 × 175×10⁹ × 300×10⁹ = 6 × 5.25×10²² = 3.15×10²³ FLOPs. OpenAI reported ~3.14×10²³ FLOPs — our formula is accurate to within 0.3%.

C ≈ 6 · N · D

The 6ND formula omits attention's O(T²) quadratic cost and embedding layer FLOPs. For most modern LLMs with context length T ≤ 4096 and N ≥ 1B, these corrections are <5% of total FLOPs. For context length T = 128K (long-context models), attention costs can become significant — but for the standard scaling law analysis, 6ND is the right number to use.

python — compute budget calculator
def compute_budget(N_params, D_tokens):
    """Estimate training FLOPs from 6ND rule."""
    return 6 * N_params * D_tokens

def tokens_from_budget(C_flops, N_params):
    """How many tokens fit in budget C given model size N?"""
    return C_flops / (6 * N_params)

# GPT-3 check
C_gpt3 = compute_budget(175e9, 300e9)
print(f"GPT-3 compute: {C_gpt3:.2e} FLOPs")   # 3.15e23

# Chinchilla (70B, 1.4T tokens)
C_chinchilla = compute_budget(70e9, 1.4e12)
print(f"Chinchilla:    {C_chinchilla:.2e} FLOPs")  # 5.88e23 ≈ 2× GPT-3

# Verification: same budget as GPT-3, optimal allocation
D_optimal = tokens_from_budget(C_gpt3, 70e9)
print(f"Optimal D for 70B, same C: {D_optimal:.2e}")  # ~7.5e11 ≈ 750B tokens

Why does the backward pass cost 4ND FLOPs rather than 2ND (the same as forward)?

Backprop uses higher-precision arithmetic, doubling the FLOP count. Backprop must compute both ∇_WL (weight gradients) and ∇_xL (input gradients), each costing ~2ND, for 4ND total. Backprop runs the forward pass twice: once for weights, once for activations. The optimizer step (Adam) accounts for the extra 2ND.

Chapter 3: The Joint Loss Formula L(N, D)

Now we can write down the key formula. From Kaplan et al. 2020 (and refined by Chinchilla 2022), the validation loss of a transformer depends on both parameter count N and data tokens D according to:

L(N, D) = E + ^A⁄_N^α + ^B⁄_D^β

Each term has a clear meaning. E is the irreducible entropy — the minimum achievable loss from the Bayes-optimal predictor on this data distribution. It is a constant that no amount of compute can reduce. A/N^α is the model capacity term: with only N parameters, the model cannot perfectly fit the data, and this approximation error shrinks as a power law in N. B/D^β is the data term: with only D tokens, the model cannot see every pattern, and this estimation error shrinks as a power law in D.

The decomposition is additive. This is an empirical choice, not derived from theory. The Rosenfeld 2020 paper uses the same form. What's remarkable is how well it fits — the additive formula with just 5 constants (E, A, α, B, β) predicts losses across 5+ orders of magnitude in N and D.

From Chinchilla (method 1+2 consensus): E ≈ 1.69, A ≈ 406.4, α ≈ 0.34, B ≈ 410.7, β ≈ 0.28. The Kaplan 2020 constants are slightly different (they used less careful LR scheduling), but the form is the same.

L(N,D) loss surface: trade N for D at fixed compute

Fix compute budget (slider). The red dot moves along the IsoFLOP curve C = 6ND as you split compute between parameters N and tokens D. Watch the loss.

log₁₀(C) [FLOPs] 10²²

Fraction of C to N (rest to D) 50%

In L(N,D) = E + A/N^α + B/D^β, what does the term E represent?

The expected loss at initialization before any training. The irreducible entropy — the minimum achievable loss that no model can beat, regardless of size or data. The loss contributed by the embedding layers, which don’t scale like other parameters. The early-stopping term that prevents overfitting.

Chapter 4: IsoFLOP Curves

Here is a powerful geometric way to visualize the compute-optimal question. An IsoFLOP curve is the set of all (N, D) pairs that share the same compute budget C = 6ND. On a log-log plot of N vs D, this is a straight line with slope −1.

Now imagine evaluating L(N, D) at every point along an IsoFLOP curve. You will get a U-shaped curve (on a linear plot of L vs N): too few parameters and the model has insufficient capacity; too many parameters and you've spent too much of your budget on N, leaving too few tokens D for training. There is a sweet spot in the middle — the compute-optimal allocation.

The Chinchilla method 2 recipe. Pick five or ten compute budgets. For each budget C, run many training runs varying N (and setting D = C/(6N)). Fit a parabola to the (N, loss) curve. The minimum is N*(C). Do this for all budgets, and you get the N*(C) power law. Chinchilla found N*(C) ∝ C^0.50 and D*(C) ∝ C^0.50 — both scale as the square root of compute.

The Kaplan 2020 paper found a different exponent: N*(C) ∝ C^0.73 (more compute → much bigger model). The key difference was LR schedule: Kaplan used cosine LR stopping early for each run; Chinchilla used a proper cosine decay to zero. With correct LR schedules, both model and data scale roughly equally with compute. Kaplan's bias toward larger models was a methodological artifact.

IsoFLOP: loss vs model size along a fixed-compute curve

Each curve shows loss L(N, D) where D = C/(6N) for a fixed C. The minimum of each curve is the compute-optimal N*. Adjust the budget to see how N* shifts.

log₁₀(C) [FLOPs] 10²²

Why does the IsoFLOP curve have a minimum rather than monotonically decreasing with N?

Large models overfit on fewer tokens and the regularization penalty increases loss. Larger models require more LR warm-up, wasting early training steps. For fixed C, increasing N forces D = C/(6N) to shrink. At large N, the data starvation penalty B/D^β dominates and increases loss even as the capacity term A/N^α falls. Transformer attention becomes a bottleneck at large N, causing loss to increase.

Chapter 5: Deriving the Compute-Optimal Allocation

We want to minimize L(N, D) subject to the constraint C = 6ND. This is a constrained optimization. Substitute D = C/(6N) into L:

L(N) = E + ^A⁄_N^α + ^B⁄_{(C / 6N)^β}

Take the derivative with respect to N, set it to zero. After algebra (which we'll spare — the key steps are: bring the β power through the denominator, differentiate, set dL/dN = 0), the compute-optimal N* satisfies:

N* = G · C^a/(a+b)

where G is a constant derived from A, B, α, β, and a/(a+b) is the exponent. With the Chinchilla constants (α = 0.34, β = 0.28), the exponents work out such that N* and D* both scale as C^0.5. That is: as compute doubles, optimal model size and optimal token count each grow by √2 ≈ 1.41.

Misconception: bigger is always better. Many practitioners pre-Chinchilla thought: "I should train the biggest model I can on whatever data I have." This is wrong under compute-optimal logic. A 7B model trained on 200B tokens beats a 70B model trained on 20B tokens at the same compute budget. The smaller-trained-longer model wins because it uses its compute budget efficiently across both dimensions.

Here is the derivation in more detail. We have the first-order condition dL/dN = 0:

-α · A · N^−α−1 + β · B · (6/C)^β · N^β−1 = 0

Rearranging: α A / N^α+1 = β B (6/C)^β / N^1−β. This gives N^α+β = (αA) / (βB) · (C/6)^β, and so:

N* = [(αA) / (βB)]^1/(α+β) · (C/6)^β/(α+β)

python — compute-optimal allocator
def chinchilla_optimal(C_flops,
                        alpha=0.34, beta=0.28,
                        A=406.4, B=410.7):
    """Returns (N*, D*) for a given compute budget C (FLOPs)."""
    # From the first-order condition of L(N, D=C/(6N)):
    G_num = alpha * A
    G_den = beta  * B
    G = (G_num / G_den) ** (1 / (alpha + beta))
    exponent = beta / (alpha + beta)    # ~0.45 for Chinchilla constants
    N_star = G * (C_flops / 6) ** exponent
    D_star = C_flops / (6 * N_star)
    return N_star, D_star

# GPT-3 compute budget
C = 3.15e23
N_star, D_star = chinchilla_optimal(C)
print(f"N* = {N_star/1e9:.1f}B params")   # ~67B
print(f"D* = {D_star/1e12:.1f}T tokens")   # ~1.5T
print(f"D*/N* ratio = {D_star/N_star:.1f}")  # ~22

With Chinchilla constants, how do N* and D* each scale with compute C?

N* scales as C^0.73 (fast), D* scales as C^0.27 (slow) — big models dominate. N* stays constant; only D* grows with compute — always train small models on more data. Both N* and D* scale as C^~0.5 — model and data should both grow with the square root of compute. N* and D* both scale linearly with C — doubling compute means double the model and double the data.

Chapter 6: The D = 20N Rule, Derived

From the Chinchilla first-order condition, we found N* and D* both as functions of C. What is the ratio D*/N*? Remarkably, it is almost constant across compute budgets. The Chinchilla paper reports:

D* ≈ 20 · N*

Let's verify with a worked example. The Chinchilla model itself: 70B parameters, 1.4T tokens. D/N = 1.4×10¹² / 70×10⁹ = 20. Exactly. LLaMA 65B: trained on 1.4T tokens. D/N = 1.4T / 65B = 21.5. LLaMA-2 70B: 2T tokens on 70B params. D/N = 28.6 — slightly overtrained for inference efficiency (we'll discuss why below).

Misconception: 20 tokens per parameter is a hard rule. It's not. The exact ratio depends slightly on the compute budget (it varies from ~18 to ~22 across the range C = 10²⁰ to 10²⁴). "20×" is a good rule of thumb, but what really matters is using the compute-optimal formula — not treating D = 20N as gospel.

The ratio arises from the specific Chinchilla scaling constants. If α ≈ β (both capacity and data terms have similar exponents), then N* ≈ D* up to a constant. With α = 0.34 and β = 0.28, the capacity term decays slightly faster, which means slightly fewer parameters and slightly more data — giving D/N ≈ 20.

Model	N (params)	D (tokens)	D/N	vs Chinchilla
GPT-3	175B	300B	1.7	Massively undertrained
Chinchilla	70B	1.4T	20	Optimal
LLaMA 65B	65B	1.4T	21.5	Near-optimal
LLaMA 2 70B	70B	2.0T	28.6	Intentionally overtrained
LLaMA 3 70B	70B	15T	214	Hugely overtrained (inference)

LLaMA 3's 70B model trained on 15T tokens is overtrained by 10× relative to compute-optimal. This is intentional: once deployed, inference runs millions of times. A smaller, cheaper model (even if it required extra training compute) saves far more inference FLOPs than the extra training cost. Compute-optimal for training ≠ compute-optimal for deployment.

A new model has 1.4B parameters. According to the Chinchilla D = 20N rule, how many tokens should it be trained on?

1.4 billion tokens (1 token per parameter). 7 billion tokens (5 tokens per parameter, like older GPT-2 era models). 28 billion tokens (20 × 1.4B = 28B tokens). 1.4 trillion tokens (1000 tokens per parameter, like LLaMA 3).

Chapter 7: Showcase: Compute-Optimal Allocator

This is the payoff. The simulator below implements the full Chinchilla compute-optimal framework. Set a compute budget (in log₁₀ FLOPs), and it shows you: (a) the IsoFLOP curve in (N, D) space, (b) the optimal N* and D* via the first-order condition, (c) the predicted final loss L(N*, D*), and (d) what happens if you deviate from optimality by going bigger or smaller.

Chinchilla Compute-Optimal Allocator — full simulation

Set a compute budget and see the optimal N* and D*. The U-curve shows how loss degrades when you deviate. The "model size" slider lets you explore the non-optimal allocations.

log₁₀(C) [FLOPs budget] 10²²

Your N (% of N* = 100%) 100%

Key insight from this simulation. The loss curve is asymmetric: going 10× larger than optimal is much worse than going 10× smaller. A model with 10N* and D*/(10) has barely enough data for its capacity and learns nothing well. A model with N*/10 and 10D* is overtrained — it has seen each token ~10× more than optimal, but diminishing returns on data make this far less bad than the undertrained case. The practical takeaway: if you must err, err toward smaller models trained longer.

For a compute budget of C = 10²³ FLOPs with Chinchilla constants, approximately what are N* and D*?

N* ≈ 175B params, D* ≈ 300B tokens (same as GPT-3). N* ≈ 100–200B params, D* ≈ 2–4T tokens (much more data than GPT-3 used). N* ≈ 1T params, D* ≈ 100B tokens (massive model, little data). N* ≈ 7B params, D* ≈ 140B tokens (same as a small open-source model).

Chapter 8: Hyperparameters & Practical Use

Beyond the N vs D tradeoff, scaling laws let you answer hyperparameter questions at small scale before paying big-scale costs. The scaling law design procedure has three steps: (1) train several small models with different hyperparameters, (2) fit a scaling law for each configuration, (3) extrapolate to the target scale and pick the winning hyperparameter.

Scaling law as a hyperparameter oracle. Want to know if Transformers beat LSTMs at 70B scale without spending $10M training both? Train 10M and 100M parameter versions of each, fit their scaling laws, extrapolate the lines to 70B. The crossing point (if any) tells you which is better at scale, and you've answered the question for <$10K.

The Kaplan 2020 paper did exactly this for several hyperparameter questions. Key findings:

Transformers vs LSTMs: Transformers have better scaling exponents — their loss curve has a steeper slope on log-log plots. At small scale they are comparable; at 10B+ params, Transformers win decisively.
Adam vs SGD: Nearly identical scaling laws. Optimizer choice has little effect on the asymptotic scaling behavior (though it can shift the constant A).
Depth vs width: Width and depth are largely interchangeable above a 2-layer minimum. Adding layers past 6 has diminishing returns. Most of scaling comes from N, not depth specifically.
Batch size: There is a critical batch size B_crit ≈ L_min / (gradient noise). Below B_crit, more steps are better. Above it, more examples per step are better. Crucially, B_crit grows as the loss decreases — bigger models at lower loss can use larger batches.

Extrapolation: fit small runs, predict the big run

Fit a power law to small-scale runs (shown as dots). The line extrapolates to predict loss at large scale. Adjust noise to see how robust the extrapolation is.

Measurement noise 0.010

True exponent β 0.050

Important limitation: scaling laws can be wrong. The Kaplan paper predicted "bigger models scale better than longer training." Chinchilla showed this was a methodological artifact of early LR stopping. Always: (1) validate your fitting methodology on held-out small runs before extrapolating, (2) use multiple fitting methods, (3) treat the constant as having large uncertainty. The exponent is usually stable; the constant (offset) is not.

What is the practical scaling law design procedure for hyperparameter selection?

Train at full scale with grid search over hyperparameters; pick the best. Use theoretical derivations (like the MuP transfer) to set hyperparameters without any training runs. Train small models under each hyperparameter setting, fit a scaling law for each, extrapolate to target scale, pick the winner. Copy hyperparameters from the best published model at a similar scale; scaling doesn’t affect the optimal choice.

Chapter 9: Connections & Cheat Sheet

Scaling laws sit at the heart of modern LLM development. Here is the conceptual map of how this lesson connects to the rest of the course and to the broader ML landscape.

Concept	Formula/Rule	Where Used
Compute budget	C ≈ 6ND	Every training decision
Data scaling law	L(D) = E + B/D^β	Dataset size planning
Model scaling law	L(N) = E + A/N^α	Architecture search
Joint law	L(N,D) = E + A/N^α + B/D^β	Compute-optimal allocation
Chinchilla rule	D* ≈ 20 · N*	Training token budgeting
Inference override	D/N up to 200+	Deployed small models
Hyperparameter transfer	Fit at small scale, extrapolate	Architecture, optimizer choices

Cheat sheet: given a compute budget C (FLOPs). 1. Compute-optimal N* ≈ (C/6)^0.5 × constant. 2. D* = C/(6N*) ≈ 20N*. 3. If inference matters: reduce N* by 2–10×, increase D* proportionally. 4. Predict final loss: L = E + A/N*^α + B/D*^β ≈ E + constant.

What this lesson did not cover: (a) scaling laws for fine-tuning (the data/task interaction is different), (b) scaling laws for chain-of-thought reasoning (emerges non-smoothly), (c) scaling laws for inference-time compute (Snell+ 2024), (d) how data quality shifts the constant A in the loss formula, and (e) architectural variants like Mixture of Experts (which change the N→FLOPs relationship). These are covered in CS336 Lec 10 (inference), Lec 11 (scaling details), and the MoE lesson.

Related CS336 Lessons

Lec 8: Pipeline & FSDP — How to actually run these huge models
Lec 2: Resource Accounting — C = 6ND derivation in detail
Lec 4: MoE — When N and active-N decouple

Broader Connections

Bias-Variance Tradeoff — The theoretical underpinning of the E + A/N + B/D form
Hestness+ 2017 — Earliest systematic neural scaling study
Kaplan+ 2020 — The paper that started modern scaling law practice
Hoffmann+ 2022 (Chinchilla) — The compute-optimal correction

"The exciting thing about scaling laws is not just that they tell you what to do — it's that they tell you what you cannot know until you've done the experiment at small scale." — Tatsu Hashimoto, CS336 Lec 9, 2025