You have a $10M GPU budget for your next model. Scaling laws predict the optimal N and D — but only if you fit them correctly. This lesson shows you how: IsoFLOP profiling, parametric fitting via Huber loss, learning-rate transfer with muP, data-constrained scaling when tokens repeat, and how CerebrasGPT, MiniCPM, and DeepSeek ran real scaling studies. Derive everything from scratch, predict before you pay.
Your team just secured funding. The compute budget is locked: 10,000 H100-days — roughly 1023 FLOPs. You need to decide: how many parameters, and how many tokens? Get it wrong by 2× in either direction and you waste millions of dollars training a suboptimal model.
CS336 Lecture 9 gave you the answer in principle: train compute-optimally, N* and D* ∝ C0.5, the famous D = 20N rule. But that lesson glossed over a crucial question: how do you actually know those exponents? Someone had to fit them. And the fit requires a series of small, carefully designed training runs — a scaling study. This lesson is about how to run that study correctly.
The core challenge: fitting L(N,D) requires training models across a range of compute budgets, but each of those models must be trained to its optimal point (not early-stopped). That's expensive. Done naively, fitting the scaling law costs as much as training the big model. Done cleverly — using IsoFLOP profiles, WSD learning rate schedules, and muP hyperparameter transfer — it costs 1–5% of the big run.
Here's the concrete workflow we'll build toward: (1) Define a small-model training API with fixed hyperparameters. (2) Run a grid of models at several FLOPs budgets, sweeping model size N while keeping compute C fixed. (3) Fit the parametric formula L(N,D) to the observations using Huber loss. (4) Extrapolate to your target budget. (5) Read off N*, D*, and the predicted loss. That's it — and it genuinely works, as CerebrasGPT, MiniCPM, and DeepSeek demonstrated.
The lesson also covers a subtler problem: what if you can't keep buying fresh data? When unique tokens run out and you must repeat your corpus, the effective token count degrades — and the scaling law needs a correction term. We'll derive that too.
Adjust the target budget (log10 FLOPs) and study fraction to see the tradeoff. The study fits L(N,D) from small runs so the big run is spent wisely.
Recall from Lecture 9 the joint scaling law proposed by Chinchilla (Hoffmann et al., 2022):
Here E is the irreducible entropy of language (the loss a perfect model still achieves, roughly 1.69 nats for English text), A and B are constants that depend on the data distribution, and α, β are the scaling exponents. The Chinchilla paper reports α = 0.34, β = 0.28, A = 406.4, B = 410.7, E = 1.69.
But where do these five numbers come from? You observe a set of (N, D, L) triples from your training runs. You want to find the (E, A, B, α, β) that best explains those observations. This is a nonlinear least-squares fitting problem.
The Huber loss with threshold δ is:
To fit L(N,D), you minimize:
Note the log-space residual: you fit log L, not L. This is because the power-law formula spans orders of magnitude — a raw residual on L would be dominated by high-loss (small-compute) points. In log space, all scales contribute equally.
In practice, you use scipy.optimize.minimize with L-BFGS-B and bounds to prevent negative parameters. Here's the full fitting code:
python — fit scaling law with Huber loss import numpy as np from scipy.optimize import minimize def huber(r, delta=1e-3): # Huber loss in log-space. delta controls outlier threshold. return np.where(np.abs(r) <= delta, 0.5 * r**2, delta * (np.abs(r) - 0.5 * delta)) def loss_pred(params, N, D): E, A, B, alpha, beta = params return E + A * N**(-alpha) + B * D**(-beta) def objective(params, N_arr, D_arr, L_arr): # Fit in log-space: minimize sum Huber(log_pred - log_obs) L_hat = loss_pred(params, N_arr, D_arr) resid = np.log(L_hat) - np.log(L_arr) return np.sum(huber(resid)) # Bounds: all positive, alpha/beta in (0,1), E in (0, 3) bounds = [(0.01, 3.0), # E (1.0, 1e5), # A (1.0, 1e5), # B (0.05, 0.99), # alpha (0.05, 0.99)] # beta x0 = [1.69, 406.4, 410.7, 0.34, 0.28] # Chinchilla as warm start result = minimize(objective, x0, args=(N_arr, D_arr, L_arr), method='L-BFGS-B', bounds=bounds) E, A, B, alpha, beta = result.x
Before you can fit L(N,D), you need the data: many (N, D, L) triples, each from a fully trained model. But "fully trained" means trained to compute-optimality — and doing that naively requires running each model until its loss stops improving, which takes O(n²) total compute for n models.
The key trick is IsoFLOP profiling (Chinchilla Method 2). Instead of training models for an unknown duration, you pick a target compute budget C and train every model in your sweep for exactly D = C/(6N) tokens. This ensures that every point in your sweep lies on the same IsoFLOP surface — the constraint C = 6ND is satisfied by construction.
Here's how to extract the optimal N from an IsoFLOP slice. At fixed C, train models with N in {10M, 30M, 100M, 300M, 1B} each for D = C/(6N) tokens. Plot loss vs log(N). The curve is U-shaped: too small an N and the model lacks capacity; too large an N and D is tiny and the model is data-starved. The minimum is N*(C).
Doing this for 3–5 budgets gives 3–5 (C, N*) pairs. Since Chinchilla theory predicts N* ∝ C0.5, you fit a line on log-log axes to get the constant of proportionality. That plus a similar fit for D* gives your full scaling recipe.
python — IsoFLOP sweep: train and find optimal N import numpy as np def isoflop_sweep(C_budget, N_sizes, train_fn): """ C_budget: total FLOPs (e.g. 1e20) N_sizes: list of model sizes to try train_fn: function(N, D) -> final_loss Returns: (N_opt, loss_opt, all_results) """ results = [] for N in N_sizes: D = C_budget / (6 * N) # Tokens so FLOPs = 6ND = C_budget if D < 1e6: # Skip if too few tokens (unstable) continue L = train_fn(N, D) results.append((N, D, L)) # Find minimum-loss N best = min(results, key=lambda x: x[2]) return best[0], best[2], results # Example with simulated loss using Chinchilla constants E, A, B, alpha, beta = 1.69, 406.4, 410.7, 0.34, 0.28 def sim_loss(N, D): return E + A*N**(-alpha) + B*D**(-beta) N_grid = [1e7, 3e7, 1e8, 3e8, 1e9] N_opt, L_opt, pts = isoflop_sweep(1e21, N_grid, sim_loss) # N_opt ≈ 7.7e8 (770M) for C = 1e21
Each curve holds total FLOPs C = 6ND constant. The minimum of each curve gives the optimal N*(C). Drag the budget slider and watch the minimum shift right as C grows.
The original Chinchilla paper actually proposed three different methods to estimate compute-optimal allocation. They give similar answers — but each has different cost and reliability tradeoffs. Understanding all three is essential if you're designing your own scaling study.
| Method | What you do | What you fit | Cost |
|---|---|---|---|
| Method 1: Envelope | Train many models at many (N,D) combos; for each N, find the D* that achieves minimum loss | Plot N vs D* on log-log; fit D* ∝ N | High (all combos) |
| Method 2: IsoFLOP | For each compute budget C, sweep N and set D=C/6N; find optimal N*(C) | Plot log C vs log N*; fit N* ∝ Ca | Medium |
| Method 3: Parametric fit | Train a grid of (N,D) pairs fully; fit L(N,D)=E+A/Nα+B/Dβ | Differentiate analytically to get N*(C), D*(C) | Low (once fits) |
Method 1 — the lower envelope: Plot D* vs N on a log-log scatter. The "lower envelope" means: for each N, look across all training runs and take the minimum-loss D. You're asking "given this model size, what's the minimum tokens needed to achieve the best possible loss?" Fit a line on log-log: log D* = m log N + b. If m ≈ 1 you get D* ∝ N (the 20N rule). CerebrasGPT and MiniCPM used Method 1 and found D*/N ratios from 20× to 192×, depending on data quality.
Method 2 — IsoFLOP minima: Covered in Chapter 2. DeepSeek used this approach. The benefit: each point in your grid is guaranteed to be at a well-defined compute budget, making the fit cleaner. The cost: you need multiple complete sweeps (not just one), and each sweep uses a different budget.
Method 3 — the joint fit: The most principled. Fit all five parameters simultaneously from all your (N, D, L) observations. MiniCPM used this as their "primary" method. The result: they found D*/N ratios of 192 — far above GPT-3's 1.7 — suggesting that with better data (The Pile, RedPajama, high-quality filtered web), optimal allocation favors much more data per parameter than the original Chinchilla estimated.
The theory is clean. Running the actual study is full of gotchas. This chapter covers the practical workflow used by CerebrasGPT, MiniCPM, and DeepSeek — three papers that published enough detail to replicate.
Step 1 — define a training API. Before sweeping model sizes, fix every hyperparameter that isn't being studied: optimizer (AdamW), β1 = 0.9, β2 = 0.95, weight decay = 0.1, gradient clipping = 1.0, learning rate schedule, vocab size, tokenizer. This ensures that observed differences in loss come from N and D, not confounders.
Step 2 — set model aspect ratios. At each size N, you need to pick depth d and width w. The scaling law treats models as a single number N, but real models have shapes. Most labs fix the aspect ratio (d/w) and scale it: width ∝ N0.5, depth ∝ N0.5 roughly. CerebrasGPT used a standard "square" ratio. MiniCPM explicitly fixed the ratio and only varied overall scale.
Step 3 — choose the learning rate schedule carefully. Standard cosine decay requires the final number of training tokens to be set at the start. But in a scaling study, you want to explore different D values from one training run. The WSD (Warmup-Stable-Decay) schedule, introduced by MiniCPM and also used by DeepSeek, solves this. The LR warms up, stays flat for a long stable phase, then decays in a short final phase (≈10% of total tokens).
Here's the WSD schedule in code:
python — WSD learning rate schedule def wsd_lr(step, total_steps, lr_max, warmup_frac=0.01, decay_frac=0.10): """ Warmup-Stable-Decay: warmup → flat stable → cosine decay. Key: decay phase is only ~10% of total, saving compute. """ warmup_steps = int(total_steps * warmup_frac) decay_start = int(total_steps * (1.0 - decay_frac)) if step < warmup_steps: return lr_max * step / warmup_steps # linear warmup elif step < decay_start: return lr_max # stable phase else: # Cosine decay from decay_start to total_steps progress = (step - decay_start) / (total_steps - decay_start) import math return lr_max * (1 + math.cos(math.pi * progress)) / 2 # Checkpoint at end of stable phase → resume with different decay windows # Each resume is a cheap extra data point for the scaling law fit
Step 4 — validate the fit. Once fitted, test whether the law predicts held-out models. Train a model at a compute budget you did NOT include in the fit. Compare predicted vs actual loss. If the error is <1%, you can trust the extrapolation. If it's >5%, something is wrong — either the power-law form breaks at that scale, or your small-scale runs had numerical issues.
Suppose you've fitted your scaling law from models up to 1B parameters, and you want to train a 7B model. You know the optimal N and D. But what learning rate and batch size should you use? The scaling law for L(N,D) says nothing about hyperparameters. You need separate rules for those.
Batch size scaling. There's a well-known empirical rule (from Kaplan 2020 and later MiniCPM): the optimal batch size grows as loss decreases. Specifically:
where B0 and k are fitted from small-scale runs. The intuition: early in training, the gradient is noisy and large batches average it out. As loss decreases, the gradient becomes less noisy and smaller batches are fine. So optimal batch size is not a constant — it's a function of where you are in training.
Learning rate scaling — the naive approach. DeepSeek's strategy: assume LR is roughly scale-invariant (because most Transformer hyperparameters are), run a small-scale sweep to find optimal LR, and transfer that LR directly to the big model. In practice, DeepSeek found LR needs only a mild adjustment with scale: the optimal LR decreases very slowly as N grows, roughly LR ∝ N−0.1.
Learning rate scaling — the principled approach (muP). Standard parameterization (μPstand) sets weight init ∝ 1/√n and LR = const. But as width n grows, the effective learning signal to each neuron changes — making the optimal LR a function of width. The maximum update parameterization (muP) corrects this so that the optimal LR stays constant across width. Chapter 6 derives this carefully.
Under standard (SP) parameterization, optimal LR drifts down with width. Under muP, it stays flat. Adjust the SP exponent to see how bad the drift is — and why it matters for transferring hyperparameters from small proxy models.
The dream: tune your hyperparameters on a tiny 10M parameter "proxy model," then transfer them directly to a 70B model — no re-tuning, no expensive sweeps at large scale. That's the promise of the Maximum Update Parametrization (muP).
The key insight behind muP is that standard neural network parameterizations cause the magnitude of activations and updates to change as you increase width. When activations blow up or shrink with width, the optimal learning rate must change to compensate. muP corrects the parameterization so that both activations and gradient updates remain O(1) as width n → ∞, making the optimal LR width-independent.
Deriving the init rule (A1). Consider a deep linear network hl = Wl hl-1. If Wl ∼ N(0, σ²I), then by random matrix theory the operator norm ||Wl|| → σ√(nl-1 + nl). To keep ||hl|| = Θ(√nl) under this, we need:
For square layers (nl = nl-1 = n), this simplifies to σ = Θ(1/√n) — the standard He init. So standard init satisfies A1. ✓
Deriving the LR rule (A2). The gradient update to Wl (for SGD) is:
This is a rank-1 outer product. Its operator norm scales as ||ΔWl||* ∝ ||∇hlℓ|| · ||hl-1|| ∝ ηl · √nl. For the activation update Δhl = ΔWl · hl-1 to be Θ(√nl), we need:
For a square layer with nl = nl-1 = n, this gives ηl = Θ(1) — the optimal learning rate should be constant with width. Standard parameterization sets η = const(1) and this works. But for layers with different fan-in and fan-out (like the attention projection from dhead·H to dmodel), the ratio nl/nl-1 changes with scale, so the LR must be adjusted per-layer.
What does muP actually change in the code?
python — muP vs Standard Parametrization differences # Standard Parametrization (SP) # Embedding: init std = 1.0 (or 0.02 in GPT-2) # All weight matrices: init std = 1/sqrt(fan_in) # Learning rate: uniform η for all layers # muP changes: # 1. Weight matrices: init std = 1/fan_in (not 1/sqrt(fan_in)) # 2. Per-layer LR: scale by 1/width for hidden layers (Adam version) # 3. Output logit scaling: multiply by 1/d (prevents logit explosion) class MuPLinear(nn.Linear): def __init__(self, fan_in, fan_out, width_mult=1.0): super().__init__(fan_in, fan_out) self.width_mult = width_mult nn.init.normal_(self.weight, std=1.0 / fan_in) # muP init def lr_scale(self): # This per-layer scale multiplies the global LR return 1.0 / self.width_mult # LR shrinks with width # CerebrasGPT values: scale_emb=10, lr=6e-3, init_base=0.08 # MiniCPM values: scale_emb=12, lr=0.01, init_std=0.1 # Key: these are stable across 9M → 13B parameter range (CerebrasGPT)
This is the payoff. You're running a real scaling study. Below you can configure the study parameters — number of IsoFLOP budgets, model sizes to sweep, and the true underlying scaling law — and watch the entire fitting + prediction pipeline execute live. The tool fits L(N,D) from the simulated training runs and predicts the optimal N and D for any target budget.
Simulated training runs (dots) at several IsoFLOP budgets. Fitted parabolas show L(N) at each budget. The orange envelope traces the optimal N*(C) across budgets. The prediction box shows the fitted parameters and forecasted loss at the target budget.
The Chinchilla formula assumes you have unlimited unique tokens. But what if you don't? LLaMA 3 trained on 15T tokens — and the high-quality filtered web is perhaps 5T unique tokens. That means some data gets repeated 2–3 times. The simple scaling law breaks down, and you need a correction.
Repeating data isn't as bad as it sounds, but it's not free either. A model trained on 2 epochs of 5T tokens is not equivalent to one trained on 10T unique tokens. The second epoch has diminishing returns because the model has already memorized the low-loss patterns. The correction term is empirically well-studied by Muennighoff et al. (2023):
where Deff is not the raw token count D but an effective token count that accounts for diminishing returns from repeated data:
Here D is total tokens (counting repeats), Dunique is unique tokens in the corpus, and f is a diminishing-returns function. Empirically, f(r) ≈ r0.7 for moderate repetition (r = 1 to 4 epochs). So 4 epochs of 1T tokens is equivalent to about 40.7 ≈ 2.64 epochs in terms of effective tokens — a 34% discount.
Gadre et al. (DataComp-LM, 2024) propose an equivalent "penalty" view: overtraining on a corpus with r > 1 incurs an expected loss penalty ΔL that grows predictably with r and can be estimated from small-scale experiments. This lets you evaluate "is it worth sourcing more unique data vs repeating what I have?"
python — effective token count with repetition discount import numpy as np def effective_tokens(D_unique, D_total, repeat_exp=0.7): """ D_unique: unique tokens in corpus D_total: tokens actually trained on (may be > D_unique) repeat_exp: empirical discount exponent (~0.7 from Muennighoff et al) Returns D_eff: effective token count for scaling law """ r = D_total / D_unique # repetition ratio (1.0 = one epoch) f = r ** repeat_exp # diminishing returns factor return D_unique * f # effective tokens # Example: 1T unique tokens, train for 4 epochs (4T total) D_eff = effective_tokens(1e12, 4e12) # D_eff = 1e12 * 4^0.7 = 1e12 * 2.64 = 2.64e12 # vs naive: 4T actual tokens. Effective discount: 34% # To decide: is it worth sourcing more data? # Cost(more data) < Cost(running extra epochs) → usually yes # But: at 1-2 epochs, D_eff ≈ D — repeating once is nearly free
Adjust model size N and unique token count Dunique. The teal curve shows loss vs epochs under the diminishing-returns correction. The orange dashed curve shows naive scaling law (ignoring repetition). See how they diverge past 1 epoch.
You can now run a complete, principled scaling study. Here's the full recipe and the connections to what comes next.
| Challenge | Solution | Used by |
|---|---|---|
| LR changes with scale | muP parametrization (LR stays constant across width) | CerebrasGPT, MiniCPM |
| Each training run needs full decay | WSD schedule — checkpoint at stable phase, replay decay | MiniCPM, DeepSeek |
| Batch size unknown at large scale | Fit Bopt ∝ L−k from small runs | MiniCPM, Kaplan 2020 |
| Fit is sensitive to outlier runs | Huber loss in log-space for fitting | Chinchilla, Hoffmann 2022 |
| Data is scarce — must repeat | Deff = Dunique · r0.7 correction | Muennighoff 2023, LLaMA 3 |
| Need D*/N estimate without full study | Use D = 20N as baseline; adjust based on data quality | GPT-3 era → LLaMA 3 went to 93N |
Key numbers to remember:
| Quantity | Chinchilla (2022) | LLaMA 3 (2024) | MiniCPM (2024) |
|---|---|---|---|
| D*/N rule | ≈ 20N | ≈ 93N | ≈ 192N |
| α (param exponent) | 0.34 | not published | fitted |
| β (data exponent) | 0.28 | not published | fitted |
| Method | All 3 | IsoFLOP (Method 2) | Method 1 + 3 |
| muP | No | No | Yes |
What comes next:
"Before you spend $10M, spend $100K to find out what the $10M will buy you."
— the spirit of scaling law methodology