Language Modeling from Scratch · CS336 · Lecture 9

Scaling Laws I: The Basics

You have 10,000 H100s for a month. Should you train a bigger model or feed it more data? The answer is neither obvious nor arbitrary — it is a predictable mathematical consequence of power-law scaling. This lesson derives the C = 6ND compute rule, the joint loss formula L(N,D), the Chinchilla compute-optimal allocation, and the famous D = 20N rule — with worked numbers throughout.

Prerequisites: CS336 Lec 2 (FLOPs, 6ND rule), basic logarithms. No calculus required.
10
Chapters
5
Live Canvases
Derived
D = 20N Rule

Chapter 0: The Budget Problem

Imagine this: your friend hands you 10,000 H100 GPUs for a month and says "build the best open-source language model you can." Your infra team is ready, your dataset is assembled. The last question — the one that will determine whether you win or lose — is: which model do I train?

Specifically: given a fixed compute budget C, do you train a large model on fewer tokens, or a small model on more tokens? This is not a question you can answer by intuition. GPT-3 (175B params) was trained on 300B tokens — barely 2 tokens per parameter. Is that right? Too few? Too many? The answer from Chinchilla (2022) is: GPT-3 was massively undertrained. The optimal allocation for that compute budget was 70B parameters trained on 1.4T tokens. Same compute, far better model.

The core question of this lesson. Given a FLOPs budget C, what is the optimal number of parameters N* and training tokens D* to minimize validation loss? This is the compute-optimal question, and scaling laws are the tool that answers it.

Before we can answer it, we need to understand the empirical phenomenon that makes it answerable at all: scaling laws. These are clean, predictable power-law relationships between compute/data/parameters and model loss. They were not derived from first principles — they were discovered empirically, and they have held across orders of magnitude of scale.

Here is the key empirical finding, shown as a log-log plot. On log scales, loss decreases as a straight line with compute, parameters, or data. That straight line on a log-log plot is the signature of a power law.

Loss vs Compute: the straight line on log-log axes

Drag the exponent slider. A power law L = A·C−β is a straight line on log-log axes — and empirically, β ≈ 0.05 for compute.

Exponent β 0.050
Constant A 2.0
You double the compute budget (2C). By roughly what factor does the loss improve if the exponent β = 0.05?

Chapter 1: Why Power Laws?

A power law is a relationship of the form y = A · x−α. On a log-log plot, this becomes log(y) = log(A) − α·log(x), which is a straight line with slope −α. Any time you see a straight line on a log-log plot, you are looking at a power law.

Why should neural network loss follow a power law with data? Here is the cleanest intuition, drawn from statistics. Consider estimating the mean of a Gaussian: you observe x1, x2, ..., xn ~ N(μ, σ2) and estimate μ̂ = (∑xi) / n. The squared error is E[(μ̂ − μ)2] = σ2 / n. That is a power law with exponent 1. On a log-log plot: log(Error) = −log(n) + 2·log(σ) — a straight line with slope −1.

Why neural nets have smaller exponents. Classical estimation on a 1D problem gives slope −1. But neural networks learn in very high-dimensional spaces. For a nonparametric function f(x) in d dimensions, the error scales as n−1/d. Language has intrinsic dimension d ≫ 1, so the slope is much shallower. Empirically, the exponent for language model data scaling is around −0.095 (Kaplan 2020) — consistent with an effective intrinsic dimensionality of roughly 10.

The same logic extends to model parameters. Larger models can represent more complex functions. As N grows, the irreducible approximation error from model capacity shrinks as a power law. And for compute, since C ≈ 6ND (as we derive in Ch 2), any power law in N or D induces a power law in C as well.

python — verifying power laws with numpy
import numpy as np

# Suppose we have (compute, loss) pairs from training runs
compute = np.array([1e18, 1e19, 1e20, 1e21, 1e22])
loss    = np.array([3.20, 2.95, 2.74, 2.55, 2.38])

# Take logs: power law becomes linear
log_C = np.log10(compute)
log_L = np.log10(loss)

# Fit a line: log(L) = log(A) - beta * log(C)
slope, intercept = np.polyfit(log_C, log_L, 1)
print(f"Exponent beta = {-slope:.3f}")  # should be ~0.05
print(f"Constant   A  = {10**intercept:.3f}")

# From Kaplan+ 2020: beta_compute ~ 0.050,  A ~ 2.3
# From Chinchilla:   beta_compute ~ 0.032  (different fitting methodology)
Kaplan 2020 empirical exponents. For large language models trained on the WebText corpus: L(N) = (8.8×1013 / N)0.076; L(D) = (5.4×1013 / D)0.095; L(C) ≈ A · C−0.050. These held from 106 to 1010 parameters — six orders of magnitude.
Why do neural networks have shallower scaling law exponents than classical 1/n statistics?

Chapter 2: Deriving C = 6ND

Before we can reason about scaling laws, we need to count FLOPs. The standard formula is C ≈ 6ND, where N is the number of parameters and D is the number of training tokens. Let's derive it from scratch, following the same accounting from CS336 Lec 2.

Consider one transformer with N parameters processing D total tokens. The key insight is that every parameter participates in a multiply-accumulate (MAC) operation for each token. A MAC counts as 2 FLOPs (one multiply, one add). So the forward pass costs roughly 2ND FLOPs.

Why the backward pass is 2× more expensive. Backpropagation computes two things: (1) the gradient of the loss with respect to each weight (∇WL), and (2) the gradient with respect to the input activations (∇xL, needed for the next layer back). Each of these requires a pass through the weight matrix, so each costs ~2ND FLOPs. Total backward: 4ND. Combined: forward 2ND + backward 4ND = 6ND.

Let's verify with a worked example. GPT-3 has N = 175 billion parameters and was trained on D = 300 billion tokens. C = 6 × 175×109 × 300×109 = 6 × 5.25×1022 = 3.15×1023 FLOPs. OpenAI reported ~3.14×1023 FLOPs — our formula is accurate to within 0.3%.

C ≈ 6 · N · D

The 6ND formula omits attention's O(T2) quadratic cost and embedding layer FLOPs. For most modern LLMs with context length T ≤ 4096 and N ≥ 1B, these corrections are <5% of total FLOPs. For context length T = 128K (long-context models), attention costs can become significant — but for the standard scaling law analysis, 6ND is the right number to use.

python — compute budget calculator
def compute_budget(N_params, D_tokens):
    """Estimate training FLOPs from 6ND rule."""
    return 6 * N_params * D_tokens

def tokens_from_budget(C_flops, N_params):
    """How many tokens fit in budget C given model size N?"""
    return C_flops / (6 * N_params)

# GPT-3 check
C_gpt3 = compute_budget(175e9, 300e9)
print(f"GPT-3 compute: {C_gpt3:.2e} FLOPs")   # 3.15e23

# Chinchilla (70B, 1.4T tokens)
C_chinchilla = compute_budget(70e9, 1.4e12)
print(f"Chinchilla:    {C_chinchilla:.2e} FLOPs")  # 5.88e23 ≈ 2× GPT-3

# Verification: same budget as GPT-3, optimal allocation
D_optimal = tokens_from_budget(C_gpt3, 70e9)
print(f"Optimal D for 70B, same C: {D_optimal:.2e}")  # ~7.5e11 ≈ 750B tokens
Why does the backward pass cost 4ND FLOPs rather than 2ND (the same as forward)?

Chapter 3: The Joint Loss Formula L(N, D)

Now we can write down the key formula. From Kaplan et al. 2020 (and refined by Chinchilla 2022), the validation loss of a transformer depends on both parameter count N and data tokens D according to:

L(N, D) = E + ANα + BDβ

Each term has a clear meaning. E is the irreducible entropy — the minimum achievable loss from the Bayes-optimal predictor on this data distribution. It is a constant that no amount of compute can reduce. A/Nα is the model capacity term: with only N parameters, the model cannot perfectly fit the data, and this approximation error shrinks as a power law in N. B/Dβ is the data term: with only D tokens, the model cannot see every pattern, and this estimation error shrinks as a power law in D.

The decomposition is additive. This is an empirical choice, not derived from theory. The Rosenfeld 2020 paper uses the same form. What's remarkable is how well it fits — the additive formula with just 5 constants (E, A, α, B, β) predicts losses across 5+ orders of magnitude in N and D.

From Chinchilla (method 1+2 consensus): E ≈ 1.69, A ≈ 406.4, α ≈ 0.34, B ≈ 410.7, β ≈ 0.28. The Kaplan 2020 constants are slightly different (they used less careful LR scheduling), but the form is the same.

L(N,D) loss surface: trade N for D at fixed compute

Fix compute budget (slider). The red dot moves along the IsoFLOP curve C = 6ND as you split compute between parameters N and tokens D. Watch the loss.

log10(C) [FLOPs] 1022
Fraction of C to N (rest to D) 50%
In L(N,D) = E + A/Nα + B/Dβ, what does the term E represent?

Chapter 4: IsoFLOP Curves

Here is a powerful geometric way to visualize the compute-optimal question. An IsoFLOP curve is the set of all (N, D) pairs that share the same compute budget C = 6ND. On a log-log plot of N vs D, this is a straight line with slope −1.

Now imagine evaluating L(N, D) at every point along an IsoFLOP curve. You will get a U-shaped curve (on a linear plot of L vs N): too few parameters and the model has insufficient capacity; too many parameters and you've spent too much of your budget on N, leaving too few tokens D for training. There is a sweet spot in the middle — the compute-optimal allocation.

The Chinchilla method 2 recipe. Pick five or ten compute budgets. For each budget C, run many training runs varying N (and setting D = C/(6N)). Fit a parabola to the (N, loss) curve. The minimum is N*(C). Do this for all budgets, and you get the N*(C) power law. Chinchilla found N*(C) ∝ C0.50 and D*(C) ∝ C0.50 — both scale as the square root of compute.

The Kaplan 2020 paper found a different exponent: N*(C) ∝ C0.73 (more compute → much bigger model). The key difference was LR schedule: Kaplan used cosine LR stopping early for each run; Chinchilla used a proper cosine decay to zero. With correct LR schedules, both model and data scale roughly equally with compute. Kaplan's bias toward larger models was a methodological artifact.

IsoFLOP: loss vs model size along a fixed-compute curve

Each curve shows loss L(N, D) where D = C/(6N) for a fixed C. The minimum of each curve is the compute-optimal N*. Adjust the budget to see how N* shifts.

log10(C) [FLOPs] 1022
Why does the IsoFLOP curve have a minimum rather than monotonically decreasing with N?

Chapter 5: Deriving the Compute-Optimal Allocation

We want to minimize L(N, D) subject to the constraint C = 6ND. This is a constrained optimization. Substitute D = C/(6N) into L:

L(N) = E + ANα + B(C / 6N)β

Take the derivative with respect to N, set it to zero. After algebra (which we'll spare — the key steps are: bring the β power through the denominator, differentiate, set dL/dN = 0), the compute-optimal N* satisfies:

N* = G · Ca/(a+b)

where G is a constant derived from A, B, α, β, and a/(a+b) is the exponent. With the Chinchilla constants (α = 0.34, β = 0.28), the exponents work out such that N* and D* both scale as C0.5. That is: as compute doubles, optimal model size and optimal token count each grow by √2 ≈ 1.41.

Misconception: bigger is always better. Many practitioners pre-Chinchilla thought: "I should train the biggest model I can on whatever data I have." This is wrong under compute-optimal logic. A 7B model trained on 200B tokens beats a 70B model trained on 20B tokens at the same compute budget. The smaller-trained-longer model wins because it uses its compute budget efficiently across both dimensions.

Here is the derivation in more detail. We have the first-order condition dL/dN = 0:

-α · A · N−α−1 + β · B · (6/C)β · Nβ−1 = 0

Rearranging: α A / Nα+1 = β B (6/C)β / N1−β. This gives Nα+β = (αA) / (βB) · (C/6)β, and so:

N* = [(αA) / (βB)]1/(α+β) · (C/6)β/(α+β)
python — compute-optimal allocator
def chinchilla_optimal(C_flops,
                        alpha=0.34, beta=0.28,
                        A=406.4, B=410.7):
    """Returns (N*, D*) for a given compute budget C (FLOPs)."""
    # From the first-order condition of L(N, D=C/(6N)):
    G_num = alpha * A
    G_den = beta  * B
    G = (G_num / G_den) ** (1 / (alpha + beta))
    exponent = beta / (alpha + beta)    # ~0.45 for Chinchilla constants
    N_star = G * (C_flops / 6) ** exponent
    D_star = C_flops / (6 * N_star)
    return N_star, D_star

# GPT-3 compute budget
C = 3.15e23
N_star, D_star = chinchilla_optimal(C)
print(f"N* = {N_star/1e9:.1f}B params")   # ~67B
print(f"D* = {D_star/1e12:.1f}T tokens")   # ~1.5T
print(f"D*/N* ratio = {D_star/N_star:.1f}")  # ~22
With Chinchilla constants, how do N* and D* each scale with compute C?

Chapter 6: The D = 20N Rule, Derived

From the Chinchilla first-order condition, we found N* and D* both as functions of C. What is the ratio D*/N*? Remarkably, it is almost constant across compute budgets. The Chinchilla paper reports:

D* ≈ 20 · N*

Let's verify with a worked example. The Chinchilla model itself: 70B parameters, 1.4T tokens. D/N = 1.4×1012 / 70×109 = 20. Exactly. LLaMA 65B: trained on 1.4T tokens. D/N = 1.4T / 65B = 21.5. LLaMA-2 70B: 2T tokens on 70B params. D/N = 28.6 — slightly overtrained for inference efficiency (we'll discuss why below).

Misconception: 20 tokens per parameter is a hard rule. It's not. The exact ratio depends slightly on the compute budget (it varies from ~18 to ~22 across the range C = 1020 to 1024). "20×" is a good rule of thumb, but what really matters is using the compute-optimal formula — not treating D = 20N as gospel.

The ratio arises from the specific Chinchilla scaling constants. If α ≈ β (both capacity and data terms have similar exponents), then N* ≈ D* up to a constant. With α = 0.34 and β = 0.28, the capacity term decays slightly faster, which means slightly fewer parameters and slightly more data — giving D/N ≈ 20.

ModelN (params)D (tokens)D/Nvs Chinchilla
GPT-3175B300B1.7Massively undertrained
Chinchilla70B1.4T20Optimal
LLaMA 65B65B1.4T21.5Near-optimal
LLaMA 2 70B70B2.0T28.6Intentionally overtrained
LLaMA 3 70B70B15T214Hugely overtrained (inference)

LLaMA 3's 70B model trained on 15T tokens is overtrained by 10× relative to compute-optimal. This is intentional: once deployed, inference runs millions of times. A smaller, cheaper model (even if it required extra training compute) saves far more inference FLOPs than the extra training cost. Compute-optimal for training ≠ compute-optimal for deployment.

A new model has 1.4B parameters. According to the Chinchilla D = 20N rule, how many tokens should it be trained on?

Chapter 7: Showcase: Compute-Optimal Allocator

This is the payoff. The simulator below implements the full Chinchilla compute-optimal framework. Set a compute budget (in log10 FLOPs), and it shows you: (a) the IsoFLOP curve in (N, D) space, (b) the optimal N* and D* via the first-order condition, (c) the predicted final loss L(N*, D*), and (d) what happens if you deviate from optimality by going bigger or smaller.

Chinchilla Compute-Optimal Allocator — full simulation

Set a compute budget and see the optimal N* and D*. The U-curve shows how loss degrades when you deviate. The "model size" slider lets you explore the non-optimal allocations.

log10(C) [FLOPs budget] 1022
Your N (% of N* = 100%) 100%
Key insight from this simulation. The loss curve is asymmetric: going 10× larger than optimal is much worse than going 10× smaller. A model with 10N* and D*/(10) has barely enough data for its capacity and learns nothing well. A model with N*/10 and 10D* is overtrained — it has seen each token ~10× more than optimal, but diminishing returns on data make this far less bad than the undertrained case. The practical takeaway: if you must err, err toward smaller models trained longer.
For a compute budget of C = 1023 FLOPs with Chinchilla constants, approximately what are N* and D*?

Chapter 8: Hyperparameters & Practical Use

Beyond the N vs D tradeoff, scaling laws let you answer hyperparameter questions at small scale before paying big-scale costs. The scaling law design procedure has three steps: (1) train several small models with different hyperparameters, (2) fit a scaling law for each configuration, (3) extrapolate to the target scale and pick the winning hyperparameter.

Scaling law as a hyperparameter oracle. Want to know if Transformers beat LSTMs at 70B scale without spending $10M training both? Train 10M and 100M parameter versions of each, fit their scaling laws, extrapolate the lines to 70B. The crossing point (if any) tells you which is better at scale, and you've answered the question for <$10K.

The Kaplan 2020 paper did exactly this for several hyperparameter questions. Key findings:

Extrapolation: fit small runs, predict the big run

Fit a power law to small-scale runs (shown as dots). The line extrapolates to predict loss at large scale. Adjust noise to see how robust the extrapolation is.

Measurement noise 0.010
True exponent β 0.050
Important limitation: scaling laws can be wrong. The Kaplan paper predicted "bigger models scale better than longer training." Chinchilla showed this was a methodological artifact of early LR stopping. Always: (1) validate your fitting methodology on held-out small runs before extrapolating, (2) use multiple fitting methods, (3) treat the constant as having large uncertainty. The exponent is usually stable; the constant (offset) is not.
What is the practical scaling law design procedure for hyperparameter selection?

Chapter 9: Connections & Cheat Sheet

Scaling laws sit at the heart of modern LLM development. Here is the conceptual map of how this lesson connects to the rest of the course and to the broader ML landscape.

ConceptFormula/RuleWhere Used
Compute budgetC ≈ 6NDEvery training decision
Data scaling lawL(D) = E + B/DβDataset size planning
Model scaling lawL(N) = E + A/NαArchitecture search
Joint lawL(N,D) = E + A/Nα + B/DβCompute-optimal allocation
Chinchilla ruleD* ≈ 20 · N*Training token budgeting
Inference overrideD/N up to 200+Deployed small models
Hyperparameter transferFit at small scale, extrapolateArchitecture, optimizer choices
Cheat sheet: given a compute budget C (FLOPs). 1. Compute-optimal N* ≈ (C/6)0.5 × constant. 2. D* = C/(6N*) ≈ 20N*. 3. If inference matters: reduce N* by 2–10×, increase D* proportionally. 4. Predict final loss: L = E + A/N*α + B/D*β ≈ E + constant.

What this lesson did not cover: (a) scaling laws for fine-tuning (the data/task interaction is different), (b) scaling laws for chain-of-thought reasoning (emerges non-smoothly), (c) scaling laws for inference-time compute (Snell+ 2024), (d) how data quality shifts the constant A in the loss formula, and (e) architectural variants like Mixture of Experts (which change the N→FLOPs relationship). These are covered in CS336 Lec 10 (inference), Lec 11 (scaling details), and the MoE lesson.

Related CS336 Lessons

Broader Connections

  • Bias-Variance Tradeoff — The theoretical underpinning of the E + A/N + B/D form
  • Hestness+ 2017 — Earliest systematic neural scaling study
  • Kaplan+ 2020 — The paper that started modern scaling law practice
  • Hoffmann+ 2022 (Chinchilla) — The compute-optimal correction
"The exciting thing about scaling laws is not just that they tell you what to do — it's that they tell you what you cannot know until you've done the experiment at small scale." — Tatsu Hashimoto, CS336 Lec 9, 2025