You have 10,000 H100s for a month. Should you train a bigger model or feed it more data? The answer is neither obvious nor arbitrary — it is a predictable mathematical consequence of power-law scaling. This lesson derives the C = 6ND compute rule, the joint loss formula L(N,D), the Chinchilla compute-optimal allocation, and the famous D = 20N rule — with worked numbers throughout.
Imagine this: your friend hands you 10,000 H100 GPUs for a month and says "build the best open-source language model you can." Your infra team is ready, your dataset is assembled. The last question — the one that will determine whether you win or lose — is: which model do I train?
Specifically: given a fixed compute budget C, do you train a large model on fewer tokens, or a small model on more tokens? This is not a question you can answer by intuition. GPT-3 (175B params) was trained on 300B tokens — barely 2 tokens per parameter. Is that right? Too few? Too many? The answer from Chinchilla (2022) is: GPT-3 was massively undertrained. The optimal allocation for that compute budget was 70B parameters trained on 1.4T tokens. Same compute, far better model.
Before we can answer it, we need to understand the empirical phenomenon that makes it answerable at all: scaling laws. These are clean, predictable power-law relationships between compute/data/parameters and model loss. They were not derived from first principles — they were discovered empirically, and they have held across orders of magnitude of scale.
Here is the key empirical finding, shown as a log-log plot. On log scales, loss decreases as a straight line with compute, parameters, or data. That straight line on a log-log plot is the signature of a power law.
Drag the exponent slider. A power law L = A·C−β is a straight line on log-log axes — and empirically, β ≈ 0.05 for compute.
A power law is a relationship of the form y = A · x−α. On a log-log plot, this becomes log(y) = log(A) − α·log(x), which is a straight line with slope −α. Any time you see a straight line on a log-log plot, you are looking at a power law.
Why should neural network loss follow a power law with data? Here is the cleanest intuition, drawn from statistics. Consider estimating the mean of a Gaussian: you observe x1, x2, ..., xn ~ N(μ, σ2) and estimate μ̂ = (∑xi) / n. The squared error is E[(μ̂ − μ)2] = σ2 / n. That is a power law with exponent 1. On a log-log plot: log(Error) = −log(n) + 2·log(σ) — a straight line with slope −1.
The same logic extends to model parameters. Larger models can represent more complex functions. As N grows, the irreducible approximation error from model capacity shrinks as a power law. And for compute, since C ≈ 6ND (as we derive in Ch 2), any power law in N or D induces a power law in C as well.
python — verifying power laws with numpy import numpy as np # Suppose we have (compute, loss) pairs from training runs compute = np.array([1e18, 1e19, 1e20, 1e21, 1e22]) loss = np.array([3.20, 2.95, 2.74, 2.55, 2.38]) # Take logs: power law becomes linear log_C = np.log10(compute) log_L = np.log10(loss) # Fit a line: log(L) = log(A) - beta * log(C) slope, intercept = np.polyfit(log_C, log_L, 1) print(f"Exponent beta = {-slope:.3f}") # should be ~0.05 print(f"Constant A = {10**intercept:.3f}") # From Kaplan+ 2020: beta_compute ~ 0.050, A ~ 2.3 # From Chinchilla: beta_compute ~ 0.032 (different fitting methodology)
Before we can reason about scaling laws, we need to count FLOPs. The standard formula is C ≈ 6ND, where N is the number of parameters and D is the number of training tokens. Let's derive it from scratch, following the same accounting from CS336 Lec 2.
Consider one transformer with N parameters processing D total tokens. The key insight is that every parameter participates in a multiply-accumulate (MAC) operation for each token. A MAC counts as 2 FLOPs (one multiply, one add). So the forward pass costs roughly 2ND FLOPs.
Let's verify with a worked example. GPT-3 has N = 175 billion parameters and was trained on D = 300 billion tokens. C = 6 × 175×109 × 300×109 = 6 × 5.25×1022 = 3.15×1023 FLOPs. OpenAI reported ~3.14×1023 FLOPs — our formula is accurate to within 0.3%.
The 6ND formula omits attention's O(T2) quadratic cost and embedding layer FLOPs. For most modern LLMs with context length T ≤ 4096 and N ≥ 1B, these corrections are <5% of total FLOPs. For context length T = 128K (long-context models), attention costs can become significant — but for the standard scaling law analysis, 6ND is the right number to use.
python — compute budget calculator def compute_budget(N_params, D_tokens): """Estimate training FLOPs from 6ND rule.""" return 6 * N_params * D_tokens def tokens_from_budget(C_flops, N_params): """How many tokens fit in budget C given model size N?""" return C_flops / (6 * N_params) # GPT-3 check C_gpt3 = compute_budget(175e9, 300e9) print(f"GPT-3 compute: {C_gpt3:.2e} FLOPs") # 3.15e23 # Chinchilla (70B, 1.4T tokens) C_chinchilla = compute_budget(70e9, 1.4e12) print(f"Chinchilla: {C_chinchilla:.2e} FLOPs") # 5.88e23 ≈ 2× GPT-3 # Verification: same budget as GPT-3, optimal allocation D_optimal = tokens_from_budget(C_gpt3, 70e9) print(f"Optimal D for 70B, same C: {D_optimal:.2e}") # ~7.5e11 ≈ 750B tokens
Now we can write down the key formula. From Kaplan et al. 2020 (and refined by Chinchilla 2022), the validation loss of a transformer depends on both parameter count N and data tokens D according to:
Each term has a clear meaning. E is the irreducible entropy — the minimum achievable loss from the Bayes-optimal predictor on this data distribution. It is a constant that no amount of compute can reduce. A/Nα is the model capacity term: with only N parameters, the model cannot perfectly fit the data, and this approximation error shrinks as a power law in N. B/Dβ is the data term: with only D tokens, the model cannot see every pattern, and this estimation error shrinks as a power law in D.
From Chinchilla (method 1+2 consensus): E ≈ 1.69, A ≈ 406.4, α ≈ 0.34, B ≈ 410.7, β ≈ 0.28. The Kaplan 2020 constants are slightly different (they used less careful LR scheduling), but the form is the same.
Fix compute budget (slider). The red dot moves along the IsoFLOP curve C = 6ND as you split compute between parameters N and tokens D. Watch the loss.
Here is a powerful geometric way to visualize the compute-optimal question. An IsoFLOP curve is the set of all (N, D) pairs that share the same compute budget C = 6ND. On a log-log plot of N vs D, this is a straight line with slope −1.
Now imagine evaluating L(N, D) at every point along an IsoFLOP curve. You will get a U-shaped curve (on a linear plot of L vs N): too few parameters and the model has insufficient capacity; too many parameters and you've spent too much of your budget on N, leaving too few tokens D for training. There is a sweet spot in the middle — the compute-optimal allocation.
The Kaplan 2020 paper found a different exponent: N*(C) ∝ C0.73 (more compute → much bigger model). The key difference was LR schedule: Kaplan used cosine LR stopping early for each run; Chinchilla used a proper cosine decay to zero. With correct LR schedules, both model and data scale roughly equally with compute. Kaplan's bias toward larger models was a methodological artifact.
Each curve shows loss L(N, D) where D = C/(6N) for a fixed C. The minimum of each curve is the compute-optimal N*. Adjust the budget to see how N* shifts.
We want to minimize L(N, D) subject to the constraint C = 6ND. This is a constrained optimization. Substitute D = C/(6N) into L:
Take the derivative with respect to N, set it to zero. After algebra (which we'll spare — the key steps are: bring the β power through the denominator, differentiate, set dL/dN = 0), the compute-optimal N* satisfies:
where G is a constant derived from A, B, α, β, and a/(a+b) is the exponent. With the Chinchilla constants (α = 0.34, β = 0.28), the exponents work out such that N* and D* both scale as C0.5. That is: as compute doubles, optimal model size and optimal token count each grow by √2 ≈ 1.41.
Here is the derivation in more detail. We have the first-order condition dL/dN = 0:
Rearranging: α A / Nα+1 = β B (6/C)β / N1−β. This gives Nα+β = (αA) / (βB) · (C/6)β, and so:
python — compute-optimal allocator def chinchilla_optimal(C_flops, alpha=0.34, beta=0.28, A=406.4, B=410.7): """Returns (N*, D*) for a given compute budget C (FLOPs).""" # From the first-order condition of L(N, D=C/(6N)): G_num = alpha * A G_den = beta * B G = (G_num / G_den) ** (1 / (alpha + beta)) exponent = beta / (alpha + beta) # ~0.45 for Chinchilla constants N_star = G * (C_flops / 6) ** exponent D_star = C_flops / (6 * N_star) return N_star, D_star # GPT-3 compute budget C = 3.15e23 N_star, D_star = chinchilla_optimal(C) print(f"N* = {N_star/1e9:.1f}B params") # ~67B print(f"D* = {D_star/1e12:.1f}T tokens") # ~1.5T print(f"D*/N* ratio = {D_star/N_star:.1f}") # ~22
From the Chinchilla first-order condition, we found N* and D* both as functions of C. What is the ratio D*/N*? Remarkably, it is almost constant across compute budgets. The Chinchilla paper reports:
Let's verify with a worked example. The Chinchilla model itself: 70B parameters, 1.4T tokens. D/N = 1.4×1012 / 70×109 = 20. Exactly. LLaMA 65B: trained on 1.4T tokens. D/N = 1.4T / 65B = 21.5. LLaMA-2 70B: 2T tokens on 70B params. D/N = 28.6 — slightly overtrained for inference efficiency (we'll discuss why below).
The ratio arises from the specific Chinchilla scaling constants. If α ≈ β (both capacity and data terms have similar exponents), then N* ≈ D* up to a constant. With α = 0.34 and β = 0.28, the capacity term decays slightly faster, which means slightly fewer parameters and slightly more data — giving D/N ≈ 20.
| Model | N (params) | D (tokens) | D/N | vs Chinchilla |
|---|---|---|---|---|
| GPT-3 | 175B | 300B | 1.7 | Massively undertrained |
| Chinchilla | 70B | 1.4T | 20 | Optimal |
| LLaMA 65B | 65B | 1.4T | 21.5 | Near-optimal |
| LLaMA 2 70B | 70B | 2.0T | 28.6 | Intentionally overtrained |
| LLaMA 3 70B | 70B | 15T | 214 | Hugely overtrained (inference) |
LLaMA 3's 70B model trained on 15T tokens is overtrained by 10× relative to compute-optimal. This is intentional: once deployed, inference runs millions of times. A smaller, cheaper model (even if it required extra training compute) saves far more inference FLOPs than the extra training cost. Compute-optimal for training ≠ compute-optimal for deployment.
This is the payoff. The simulator below implements the full Chinchilla compute-optimal framework. Set a compute budget (in log10 FLOPs), and it shows you: (a) the IsoFLOP curve in (N, D) space, (b) the optimal N* and D* via the first-order condition, (c) the predicted final loss L(N*, D*), and (d) what happens if you deviate from optimality by going bigger or smaller.
Set a compute budget and see the optimal N* and D*. The U-curve shows how loss degrades when you deviate. The "model size" slider lets you explore the non-optimal allocations.
Beyond the N vs D tradeoff, scaling laws let you answer hyperparameter questions at small scale before paying big-scale costs. The scaling law design procedure has three steps: (1) train several small models with different hyperparameters, (2) fit a scaling law for each configuration, (3) extrapolate to the target scale and pick the winning hyperparameter.
The Kaplan 2020 paper did exactly this for several hyperparameter questions. Key findings:
Fit a power law to small-scale runs (shown as dots). The line extrapolates to predict loss at large scale. Adjust noise to see how robust the extrapolation is.
Scaling laws sit at the heart of modern LLM development. Here is the conceptual map of how this lesson connects to the rest of the course and to the broader ML landscape.
| Concept | Formula/Rule | Where Used |
|---|---|---|
| Compute budget | C ≈ 6ND | Every training decision |
| Data scaling law | L(D) = E + B/Dβ | Dataset size planning |
| Model scaling law | L(N) = E + A/Nα | Architecture search |
| Joint law | L(N,D) = E + A/Nα + B/Dβ | Compute-optimal allocation |
| Chinchilla rule | D* ≈ 20 · N* | Training token budgeting |
| Inference override | D/N up to 200+ | Deployed small models |
| Hyperparameter transfer | Fit at small scale, extrapolate | Architecture, optimizer choices |
What this lesson did not cover: (a) scaling laws for fine-tuning (the data/task interaction is different), (b) scaling laws for chain-of-thought reasoning (emerges non-smoothly), (c) scaling laws for inference-time compute (Snell+ 2024), (d) how data quality shifts the constant A in the loss formula, and (e) architectural variants like Mixture of Experts (which change the N→FLOPs relationship). These are covered in CS336 Lec 10 (inference), Lec 11 (scaling details), and the MoE lesson.
"The exciting thing about scaling laws is not just that they tell you what to do — it's that they tell you what you cannot know until you've done the experiment at small scale." — Tatsu Hashimoto, CS336 Lec 9, 2025