Scaling Laws II: Details & Methodology — Language Modeling from Scratch (CS336 L11)

Chapter 0: The $10M Question

Your team just secured funding. The compute budget is locked: 10,000 H100-days — roughly 10²³ FLOPs. You need to decide: how many parameters, and how many tokens? Get it wrong by 2× in either direction and you waste millions of dollars training a suboptimal model.

CS336 Lecture 9 gave you the answer in principle: train compute-optimally, N* and D* ∝ C^0.5, the famous D = 20N rule. But that lesson glossed over a crucial question: how do you actually know those exponents? Someone had to fit them. And the fit requires a series of small, carefully designed training runs — a scaling study. This lesson is about how to run that study correctly.

The core challenge: fitting L(N,D) requires training models across a range of compute budgets, but each of those models must be trained to its optimal point (not early-stopped). That's expensive. Done naively, fitting the scaling law costs as much as training the big model. Done cleverly — using IsoFLOP profiles, WSD learning rate schedules, and muP hyperparameter transfer — it costs 1–5% of the big run.

The goal of this lesson. Learn to run a scaling study that predicts your big model's loss (and optimal N, D, LR, batch) from cheap small runs. Every technique we cover is motivated by the same question: how do I get a reliable prediction while spending as little compute as possible?

Here's the concrete workflow we'll build toward: (1) Define a small-model training API with fixed hyperparameters. (2) Run a grid of models at several FLOPs budgets, sweeping model size N while keeping compute C fixed. (3) Fit the parametric formula L(N,D) to the observations using Huber loss. (4) Extrapolate to your target budget. (5) Read off N*, D*, and the predicted loss. That's it — and it genuinely works, as CerebrasGPT, MiniCPM, and DeepSeek demonstrated.

The lesson also covers a subtler problem: what if you can't keep buying fresh data? When unique tokens run out and you must repeat your corpus, the effective token count degrades — and the scaling law needs a correction term. We'll derive that too.

Study budget: how much does fitting cost vs training?

Adjust the target budget (log₁₀ FLOPs) and study fraction to see the tradeoff. The study fits L(N,D) from small runs so the big run is spent wisely.

Target log₁₀ C 10²³

Study fraction (%) 3.0%

Why can't you simply train one model at each scale and use early-stopped checkpoints to estimate the scaling law?

Early stopping always underestimates the final loss, biasing the fit. The Chinchilla formula only applies when D ≥ 20N, which early stops violate. The loss at an early checkpoint reflects a suboptimal data-to-parameter ratio, not the model's true capacity — so it doesn't lie on the L(N,D) surface you're trying to fit. Learning rate warmup hasn't finished, so the early loss is artificially high.

Chapter 1: Fitting L(N,D) — The Parametric Form

Recall from Lecture 9 the joint scaling law proposed by Chinchilla (Hoffmann et al., 2022):

L(N, D) = E + A · N^−α + B · D^−β

Here E is the irreducible entropy of language (the loss a perfect model still achieves, roughly 1.69 nats for English text), A and B are constants that depend on the data distribution, and α, β are the scaling exponents. The Chinchilla paper reports α = 0.34, β = 0.28, A = 406.4, B = 410.7, E = 1.69.

But where do these five numbers come from? You observe a set of (N, D, L) triples from your training runs. You want to find the (E, A, B, α, β) that best explains those observations. This is a nonlinear least-squares fitting problem.

Why not ordinary least squares? Ordinary least squares minimizes the sum of squared residuals. But scaling-law training runs produce outliers — runs that diverged, underflowed, or hit numerical issues. A single bad run can drag the fit wildly. The solution is Huber loss, which is quadratic for small residuals but linear for large ones, making the fit robust to outliers.

The Huber loss with threshold δ is:

H_δ(r) = r²/2 if |r| ≤ δ else δ(|r| − δ/2)

To fit L(N,D), you minimize:

∑_i H_δ( log L̂(N_i,D_i) − log L_i )

Note the log-space residual: you fit log L, not L. This is because the power-law formula spans orders of magnitude — a raw residual on L would be dominated by high-loss (small-compute) points. In log space, all scales contribute equally.

In practice, you use scipy.optimize.minimize with L-BFGS-B and bounds to prevent negative parameters. Here's the full fitting code:

python — fit scaling law with Huber loss
import numpy as np
from scipy.optimize import minimize

def huber(r, delta=1e-3):
    # Huber loss in log-space. delta controls outlier threshold.
    return np.where(np.abs(r) <= delta,
                    0.5 * r**2,
                    delta * (np.abs(r) - 0.5 * delta))

def loss_pred(params, N, D):
    E, A, B, alpha, beta = params
    return E + A * N**(-alpha) + B * D**(-beta)

def objective(params, N_arr, D_arr, L_arr):
    # Fit in log-space: minimize sum Huber(log_pred - log_obs)
    L_hat = loss_pred(params, N_arr, D_arr)
    resid = np.log(L_hat) - np.log(L_arr)
    return np.sum(huber(resid))

# Bounds: all positive, alpha/beta in (0,1), E in (0, 3)
bounds = [(0.01, 3.0),   # E
          (1.0, 1e5),    # A
          (1.0, 1e5),    # B
          (0.05, 0.99),  # alpha
          (0.05, 0.99)]  # beta

x0 = [1.69, 406.4, 410.7, 0.34, 0.28]  # Chinchilla as warm start

result = minimize(objective, x0, args=(N_arr, D_arr, L_arr),
                  method='L-BFGS-B', bounds=bounds)
E, A, B, alpha, beta = result.x

How many runs do you need? The Chinchilla paper used >400 training runs across 9 model sizes (70M to 16B) and 4 training durations each. In practice, MiniCPM and DeepSeek used 10–30 runs. Five free parameters need at minimum 5 runs to fit, but 20–30 gives reliable confidence. The key is spanning at least 2 orders of magnitude in N and D.

Why do we fit log L instead of L directly when minimizing the Huber objective?

Because the Huber loss only converges when inputs are in (0,1). To make the fit linear so we can use ordinary least squares. Loss spans orders of magnitude across model sizes; fitting in log-space weights small-scale and large-scale observations equally and matches the power-law structure of the formula. Because L(N,D) can be negative before the fit converges.

Chapter 2: IsoFLOP Profiling

Before you can fit L(N,D), you need the data: many (N, D, L) triples, each from a fully trained model. But "fully trained" means trained to compute-optimality — and doing that naively requires running each model until its loss stops improving, which takes O(n²) total compute for n models.

The key trick is IsoFLOP profiling (Chinchilla Method 2). Instead of training models for an unknown duration, you pick a target compute budget C and train every model in your sweep for exactly D = C/(6N) tokens. This ensures that every point in your sweep lies on the same IsoFLOP surface — the constraint C = 6ND is satisfied by construction.

IsoFLOP = constant compute per experiment. You choose 3–5 FLOPs budgets (e.g., C = 10¹⁹, 10²⁰, 10²¹). For each budget, you train 5–10 models of different sizes N — each for exactly C/(6N) tokens. You find the minimum-loss N on each budget curve. Connect the minima: that's your optimal-N trajectory.

Here's how to extract the optimal N from an IsoFLOP slice. At fixed C, train models with N in {10M, 30M, 100M, 300M, 1B} each for D = C/(6N) tokens. Plot loss vs log(N). The curve is U-shaped: too small an N and the model lacks capacity; too large an N and D is tiny and the model is data-starved. The minimum is N*(C).

Doing this for 3–5 budgets gives 3–5 (C, N*) pairs. Since Chinchilla theory predicts N* ∝ C^0.5, you fit a line on log-log axes to get the constant of proportionality. That plus a similar fit for D* gives your full scaling recipe.

python — IsoFLOP sweep: train and find optimal N
import numpy as np

def isoflop_sweep(C_budget, N_sizes, train_fn):
    """
    C_budget: total FLOPs (e.g. 1e20)
    N_sizes:  list of model sizes to try
    train_fn: function(N, D) -> final_loss
    Returns: (N_opt, loss_opt, all_results)
    """
    results = []
    for N in N_sizes:
        D = C_budget / (6 * N)   # Tokens so FLOPs = 6ND = C_budget
        if D < 1e6:               # Skip if too few tokens (unstable)
            continue
        L = train_fn(N, D)
        results.append((N, D, L))

    # Find minimum-loss N
    best = min(results, key=lambda x: x[2])
    return best[0], best[2], results

# Example with simulated loss using Chinchilla constants
E, A, B, alpha, beta = 1.69, 406.4, 410.7, 0.34, 0.28
def sim_loss(N, D):
    return E + A*N**(-alpha) + B*D**(-beta)

N_grid = [1e7, 3e7, 1e8, 3e8, 1e9]
N_opt, L_opt, pts = isoflop_sweep(1e21, N_grid, sim_loss)
# N_opt ≈ 7.7e8 (770M) for C = 1e21

IsoFLOP curves: find the loss minimum at each compute budget

Each curve holds total FLOPs C = 6ND constant. The minimum of each curve gives the optimal N*(C). Drag the budget slider and watch the minimum shift right as C grows.

Highlighted log₁₀ C 10²¹

Exponent α 0.34

Exponent β 0.28

In an IsoFLOP sweep at C = 10²¹ FLOPs, you try N = 1B parameters. How many tokens D does that model train on?

D = 10²¹ tokens (the full budget). D = 20 × 1B = 20B tokens (the D=20N rule). D = 10²¹ / (6 × 10⁹) ≈ 167B tokens. D = √(10²¹) ≈ 3.2 × 10¹⁰ tokens.

Chapter 3: Three Chinchilla Methods

The original Chinchilla paper actually proposed three different methods to estimate compute-optimal allocation. They give similar answers — but each has different cost and reliability tradeoffs. Understanding all three is essential if you're designing your own scaling study.

Method	What you do	What you fit	Cost
Method 1: Envelope	Train many models at many (N,D) combos; for each N, find the D* that achieves minimum loss	Plot N vs D* on log-log; fit D* ∝ N	High (all combos)
Method 2: IsoFLOP	For each compute budget C, sweep N and set D=C/6N; find optimal N*(C)	Plot log C vs log N; fit N ∝ C^a	Medium
Method 3: Parametric fit	Train a grid of (N,D) pairs fully; fit L(N,D)=E+A/N^α+B/D^β	Differentiate analytically to get N(C), D(C)	Low (once fits)

Method 3 is the most powerful. Once you've fitted the five parameters (E, A, B, α, β), you can compute optimal N and D for any future compute budget analytically — no new training runs needed. Methods 1 and 2 only give you the optimal point at the budgets you ran. But Method 3 requires that the power-law form actually fits your data well, which should always be verified.

Method 1 — the lower envelope: Plot D* vs N on a log-log scatter. The "lower envelope" means: for each N, look across all training runs and take the minimum-loss D. You're asking "given this model size, what's the minimum tokens needed to achieve the best possible loss?" Fit a line on log-log: log D* = m log N + b. If m ≈ 1 you get D* ∝ N (the 20N rule). CerebrasGPT and MiniCPM used Method 1 and found D*/N ratios from 20× to 192×, depending on data quality.

Method 2 — IsoFLOP minima: Covered in Chapter 2. DeepSeek used this approach. The benefit: each point in your grid is guaranteed to be at a well-defined compute budget, making the fit cleaner. The cost: you need multiple complete sweeps (not just one), and each sweep uses a different budget.

Method 3 — the joint fit: The most principled. Fit all five parameters simultaneously from all your (N, D, L) observations. MiniCPM used this as their "primary" method. The result: they found D*/N ratios of 192 — far above GPT-3's 1.7 — suggesting that with better data (The Pile, RedPajama, high-quality filtered web), optimal allocation favors much more data per parameter than the original Chinchilla estimated.

Why do different methods give different answers? They're measuring slightly different things. Method 1 finds the D that minimizes loss for a fixed N. Method 2 finds the N that minimizes loss for a fixed C. Method 3 finds the globally optimal (N, D) pair analytically. When the power law is a perfect fit, all three agree. When it isn't — when there's a "kink" in the scaling at some scale, or when data quality changes — they diverge. This is a calibration check: if your three methods give wildly different D*/N ratios, your data or fitting has a problem.

MiniCPM's Method 3 fit gives D*/N = 192. The original Chinchilla gives D*/N ≈ 20. What is the most likely explanation for this 10× difference?

Chinchilla's math was wrong — the correct ratio was always 192. Higher-quality or larger-scale data gives a different optimal balance — better data per token means more tokens are worth training on before the model saturates its capacity. MiniCPM's models were too small to fit the 20N rule accurately. The WSD learning rate schedule artificially inflates the optimal D.

Chapter 4: Running a Scaling Study

The theory is clean. Running the actual study is full of gotchas. This chapter covers the practical workflow used by CerebrasGPT, MiniCPM, and DeepSeek — three papers that published enough detail to replicate.

Step 1 — define a training API. Before sweeping model sizes, fix every hyperparameter that isn't being studied: optimizer (AdamW), β₁ = 0.9, β₂ = 0.95, weight decay = 0.1, gradient clipping = 1.0, learning rate schedule, vocab size, tokenizer. This ensures that observed differences in loss come from N and D, not confounders.

Step 2 — set model aspect ratios. At each size N, you need to pick depth d and width w. The scaling law treats models as a single number N, but real models have shapes. Most labs fix the aspect ratio (d/w) and scale it: width ∝ N^0.5, depth ∝ N^0.5 roughly. CerebrasGPT used a standard "square" ratio. MiniCPM explicitly fixed the ratio and only varied overall scale.

Stability across scales. A nasty practical problem: the optimal learning rate for a 30M model may be very different from the optimal LR for a 7B model. If you scale up using the wrong LR, your big model trains at a non-optimal point, making extrapolation unreliable. This is exactly the problem muP (Chapter 6) solves.

Step 3 — choose the learning rate schedule carefully. Standard cosine decay requires the final number of training tokens to be set at the start. But in a scaling study, you want to explore different D values from one training run. The WSD (Warmup-Stable-Decay) schedule, introduced by MiniCPM and also used by DeepSeek, solves this. The LR warms up, stays flat for a long stable phase, then decays in a short final phase (≈10% of total tokens).

WSD key property. Because the stable phase doesn't touch the LR, you can checkpoint at the end of the stable phase and replay different decay lengths. This lets one training run produce multiple (D, L) data points — dramatically reducing scaling study cost. With cosine, you'd need to retrain from scratch for each D.

Here's the WSD schedule in code:

python — WSD learning rate schedule
def wsd_lr(step, total_steps, lr_max, warmup_frac=0.01, decay_frac=0.10):
    """
    Warmup-Stable-Decay: warmup → flat stable → cosine decay.
    Key: decay phase is only ~10% of total, saving compute.
    """
    warmup_steps = int(total_steps * warmup_frac)
    decay_start  = int(total_steps * (1.0 - decay_frac))

    if step < warmup_steps:
        return lr_max * step / warmup_steps          # linear warmup
    elif step < decay_start:
        return lr_max                                  # stable phase
    else:
        # Cosine decay from decay_start to total_steps
        progress = (step - decay_start) / (total_steps - decay_start)
        import math
        return lr_max * (1 + math.cos(math.pi * progress)) / 2

# Checkpoint at end of stable phase → resume with different decay windows
# Each resume is a cheap extra data point for the scaling law fit

Step 4 — validate the fit. Once fitted, test whether the law predicts held-out models. Train a model at a compute budget you did NOT include in the fit. Compare predicted vs actual loss. If the error is <1%, you can trust the extrapolation. If it's >5%, something is wrong — either the power-law form breaks at that scale, or your small-scale runs had numerical issues.

Why does the WSD learning rate schedule make scaling studies cheaper compared to standard cosine decay?

WSD converges faster, so total training steps are fewer. WSD eliminates the need for warmup, saving early-step compute. One WSD training run produces multiple usable (D, L) data points by checkpointing at the stable phase and replaying different decay windows — each replay is a new point at very low additional cost. WSD uses a smaller batch size during the stable phase, reducing total memory consumption.

Chapter 5: LR & Batch Size Scaling

Suppose you've fitted your scaling law from models up to 1B parameters, and you want to train a 7B model. You know the optimal N and D. But what learning rate and batch size should you use? The scaling law for L(N,D) says nothing about hyperparameters. You need separate rules for those.

Batch size scaling. There's a well-known empirical rule (from Kaplan 2020 and later MiniCPM): the optimal batch size grows as loss decreases. Specifically:

B_opt ≈ B₀ · L^−k

where B₀ and k are fitted from small-scale runs. The intuition: early in training, the gradient is noisy and large batches average it out. As loss decreases, the gradient becomes less noisy and smaller batches are fine. So optimal batch size is not a constant — it's a function of where you are in training.

Batch size scales with compute budget, not model size. MiniCPM found B_opt ∝ L⁻⁴ approximately. Since L decreases with more compute, larger runs use larger batches. This is why GPT-3 used a batch of 3.2M tokens and small models use 256K — it's not arbitrary, it's optimal.

Learning rate scaling — the naive approach. DeepSeek's strategy: assume LR is roughly scale-invariant (because most Transformer hyperparameters are), run a small-scale sweep to find optimal LR, and transfer that LR directly to the big model. In practice, DeepSeek found LR needs only a mild adjustment with scale: the optimal LR decreases very slowly as N grows, roughly LR ∝ N^−0.1.

Learning rate scaling — the principled approach (muP). Standard parameterization (μP_stand) sets weight init ∝ 1/√n and LR = const. But as width n grows, the effective learning signal to each neuron changes — making the optimal LR a function of width. The maximum update parameterization (muP) corrects this so that the optimal LR stays constant across width. Chapter 6 derives this carefully.

LR vs width: standard parameterization vs muP

Under standard (SP) parameterization, optimal LR drifts down with width. Under muP, it stays flat. Adjust the SP exponent to see how bad the drift is — and why it matters for transferring hyperparameters from small proxy models.

SP LR exponent (should be ~−0.5 for SGD, ~0 for Adam/SP) −0.50

Width range (min log₁₀ n) 100

Why does optimal batch size grow as training loss decreases during a long run?

GPU memory becomes available as the model compresses, allowing larger batches. Gradient clipping is less active at low loss, so larger batches don't diverge. At low loss, the gradient is less noisy (sharper loss surface), so fewer samples per step are needed to get an accurate gradient direction — meaning you can use a larger batch to maintain the same effective noise level. Learning rate warmup has completed by then, unlocking higher batch sizes.

Chapter 6: muP: Scale-Invariant Hyperparameter Tuning

The dream: tune your hyperparameters on a tiny 10M parameter "proxy model," then transfer them directly to a 70B model — no re-tuning, no expensive sweeps at large scale. That's the promise of the Maximum Update Parametrization (muP).

The key insight behind muP is that standard neural network parameterizations cause the magnitude of activations and updates to change as you increase width. When activations blow up or shrink with width, the optimal learning rate must change to compensate. muP corrects the parameterization so that both activations and gradient updates remain O(1) as width n → ∞, making the optimal LR width-independent.

Two conditions for scale invariance. muP requires: (A1) activations at initialization should remain Θ(1) — the network starts in a well-conditioned regime regardless of width. (A2) after one gradient step, the change in activations should also be Θ(1) — updates neither explode nor vanish. Standard parameterization satisfies A1 but NOT A2 for large widths.

Deriving the init rule (A1). Consider a deep linear network h_l = W_l h_l-1. If W_l ∼ N(0, σ²I), then by random matrix theory the operator norm ||W_l|| → σ√(n_l-1 + n_l). To keep ||h_l|| = Θ(√n_l) under this, we need:

σ = Θ( 1 / min( √n_l-1, √(n_l/n_l-1) ) )

For square layers (n_l = n_l-1 = n), this simplifies to σ = Θ(1/√n) — the standard He init. So standard init satisfies A1. ✓

Deriving the LR rule (A2). The gradient update to W_l (for SGD) is:

ΔW_l = −η_l ∇_{h_l}ℓ · h_l-1^T

This is a rank-1 outer product. Its operator norm scales as ||ΔW_l||* ∝ ||∇_{h_l}ℓ|| · ||h_l-1|| ∝ η_l · √n_l. For the activation update Δh_l = ΔW_l · h_l-1 to be Θ(√n_l), we need:

||ΔW_l||* · ||h_l-1|| = Θ(√n_l) ⇒ η_l = Θ( n_l / n_l-1 )

For a square layer with n_l = n_l-1 = n, this gives η_l = Θ(1) — the optimal learning rate should be constant with width. Standard parameterization sets η = const(1) and this works. But for layers with different fan-in and fan-out (like the attention projection from d_head·H to d_model), the ratio n_l/n_l-1 changes with scale, so the LR must be adjusted per-layer.

Adam muP rule. With Adam, the update magnitude is O(1) per coordinate regardless of gradient magnitude (due to the 1/√v normalization). The muP rule for Adam simplifies: set LR ∝ 1/n_l-1 (the fan-in), not n_l/n_l-1. In practice for a standard Transformer with equal width d throughout: LR_muP ∝ 1/d, vs LR_SP = const. This is the one significant difference.

What does muP actually change in the code?

python — muP vs Standard Parametrization differences
# Standard Parametrization (SP)
# Embedding: init std = 1.0  (or 0.02 in GPT-2)
# All weight matrices: init std = 1/sqrt(fan_in)
# Learning rate: uniform η for all layers

# muP changes:
# 1. Weight matrices: init std = 1/fan_in  (not 1/sqrt(fan_in))
# 2. Per-layer LR: scale by 1/width for hidden layers (Adam version)
# 3. Output logit scaling: multiply by 1/d (prevents logit explosion)

class MuPLinear(nn.Linear):
    def __init__(self, fan_in, fan_out, width_mult=1.0):
        super().__init__(fan_in, fan_out)
        self.width_mult = width_mult
        nn.init.normal_(self.weight, std=1.0 / fan_in)  # muP init

    def lr_scale(self):
        # This per-layer scale multiplies the global LR
        return 1.0 / self.width_mult   # LR shrinks with width

# CerebrasGPT values: scale_emb=10, lr=6e-3, init_base=0.08
# MiniCPM values:     scale_emb=12, lr=0.01, init_std=0.1
# Key: these are stable across 9M → 13B parameter range (CerebrasGPT)

What muP is NOT robust to. The derivation assumes a simple deep linear network with standard gradient descent. Real LLMs deviate: SwiGLU and squared ReLU activations (fine — they have the same optimal LR), variable batch sizes (fine — muP doesn't model batch), RMSNorm with learnable gains (breaks muP — gains must be removed or set to 1), and strong weight decay ≥ 0.1 (the only significant practical failure). If you use weight decay = 0.1, muP's LR transfer degrades noticeably.

Under muP, what happens to the optimal learning rate as you double the model width?

It doubles (width-proportional scaling). It stays roughly constant — that is the entire point of muP. It halves (1/width scaling, as in the Adam rule). It follows a square-root rule: LR ∝ 1/√width.

Chapter 7: Showcase: IsoFLOP Explorer

This is the payoff. You're running a real scaling study. Below you can configure the study parameters — number of IsoFLOP budgets, model sizes to sweep, and the true underlying scaling law — and watch the entire fitting + prediction pipeline execute live. The tool fits L(N,D) from the simulated training runs and predicts the optimal N and D for any target budget.

What to explore. (1) Add noise to see how stable the fit is. (2) Reduce the number of budgets from 5 to 2 — does the extrapolation still work? (3) Change α and β and watch the optimal-N trajectory shift. (4) Set a large target budget and see how far the extrapolation reaches beyond the observed range.

IsoFLOP fitting pipeline — scatter of runs → parametric fit → extrapolation

Simulated training runs (dots) at several IsoFLOP budgets. Fitted parabolas show L(N) at each budget. The orange envelope traces the optimal N*(C) across budgets. The prediction box shows the fitted parameters and forecasted loss at the target budget.

Target log₁₀ C 10²³

Obs. noise σ 0.010

Budgets in study (2–5) 4

Chapter 8: Data-Constrained Scaling

The Chinchilla formula assumes you have unlimited unique tokens. But what if you don't? LLaMA 3 trained on 15T tokens — and the high-quality filtered web is perhaps 5T unique tokens. That means some data gets repeated 2–3 times. The simple scaling law breaks down, and you need a correction.

Repeating data isn't as bad as it sounds, but it's not free either. A model trained on 2 epochs of 5T tokens is not equivalent to one trained on 10T unique tokens. The second epoch has diminishing returns because the model has already memorized the low-loss patterns. The correction term is empirically well-studied by Muennighoff et al. (2023):

L(N, D_eff) = E + A · N^−α + B · D_eff^−β

where D_eff is not the raw token count D but an effective token count that accounts for diminishing returns from repeated data:

D_eff = D_unique · f( D / D_unique )

Here D is total tokens (counting repeats), D_unique is unique tokens in the corpus, and f is a diminishing-returns function. Empirically, f(r) ≈ r^0.7 for moderate repetition (r = 1 to 4 epochs). So 4 epochs of 1T tokens is equivalent to about 4^0.7 ≈ 2.64 epochs in terms of effective tokens — a 34% discount.

The crossover point. For very small repetition (r < 1.5), D_eff ≈ D — repeating a small amount is nearly free. For r > 4, the discount grows: 4 epochs ≈ 2.64 effective, 8 epochs ≈ 4.4 effective. After about 4 epochs the marginal gain per extra repeat is very low. The practical takeaway: 1–2 epochs of high-quality data is nearly always better than 4+ epochs of lower-quality data.

Gadre et al. (DataComp-LM, 2024) propose an equivalent "penalty" view: overtraining on a corpus with r > 1 incurs an expected loss penalty ΔL that grows predictably with r and can be estimated from small-scale experiments. This lets you evaluate "is it worth sourcing more unique data vs repeating what I have?"

python — effective token count with repetition discount
import numpy as np

def effective_tokens(D_unique, D_total, repeat_exp=0.7):
    """
    D_unique: unique tokens in corpus
    D_total:  tokens actually trained on (may be > D_unique)
    repeat_exp: empirical discount exponent (~0.7 from Muennighoff et al)
    Returns D_eff: effective token count for scaling law
    """
    r = D_total / D_unique          # repetition ratio (1.0 = one epoch)
    f = r ** repeat_exp             # diminishing returns factor
    return D_unique * f             # effective tokens

# Example: 1T unique tokens, train for 4 epochs (4T total)
D_eff = effective_tokens(1e12, 4e12)
# D_eff = 1e12 * 4^0.7 = 1e12 * 2.64 = 2.64e12
# vs naive: 4T actual tokens. Effective discount: 34%

# To decide: is it worth sourcing more data?
# Cost(more data) < Cost(running extra epochs) → usually yes
# But: at 1-2 epochs, D_eff ≈ D — repeating once is nearly free

Data-constrained scaling: loss vs training epochs

Adjust model size N and unique token count D_unique. The teal curve shows loss vs epochs under the diminishing-returns correction. The orange dashed curve shows naive scaling law (ignoring repetition). See how they diverge past 1 epoch.

Model N (log₁₀ params) 10⁹

D_unique (log₁₀ tokens) 10¹²

Repeat exponent f(r) = r^exp 0.70

Your corpus has 2T unique high-quality tokens. You plan to train for 8 epochs (16T total). Using the r^0.7 discount, what is D_eff?

16T — repetition doesn't reduce effectiveness. 8T — each epoch after the first counts for half. 2T × 8^0.7 ≈ 2T × 4.6 = 9.2T effective tokens. 2T — only the first epoch is truly unique.

Chapter 9: Connections & Cheat Sheet

You can now run a complete, principled scaling study. Here's the full recipe and the connections to what comes next.

Scaling study recipe (5 steps).

Fix your training API: tokenizer, optimizer, weight decay, gradient clipping. Use WSD LR schedule.
Choose 3–5 IsoFLOP budgets spanning 1–2 orders of magnitude below your target.
For each budget, sweep 5–8 model sizes. Set D = C/(6N) for each. Train to convergence using WSD.
Fit L(N,D) = E + A/N^α + B/D^β using Huber loss on log-space residuals via L-BFGS-B.
Predict N*, D*, and L* for your target budget. Validate on one held-out run. If error <1%, you're good.

Challenge	Solution	Used by
LR changes with scale	muP parametrization (LR stays constant across width)	CerebrasGPT, MiniCPM
Each training run needs full decay	WSD schedule — checkpoint at stable phase, replay decay	MiniCPM, DeepSeek
Batch size unknown at large scale	Fit B_opt ∝ L^−k from small runs	MiniCPM, Kaplan 2020
Fit is sensitive to outlier runs	Huber loss in log-space for fitting	Chinchilla, Hoffmann 2022
Data is scarce — must repeat	D_eff = D_unique · r^0.7 correction	Muennighoff 2023, LLaMA 3
Need D*/N estimate without full study	Use D = 20N as baseline; adjust based on data quality	GPT-3 era → LLaMA 3 went to 93N

Caveats — what scaling laws cannot tell you. Scaling laws predict language modeling loss at a given compute budget. They do not predict: (1) downstream task performance (MMLU, HumanEval, etc.) — the loss-to-downstream correlation breaks down at some tasks; (2) emergent capabilities that appear suddenly at scale without warning; (3) the effect of training data quality (same formula, different A/B constants); (4) performance after RLHF/GRPO fine-tuning. Always validate predictions against downstream metrics, not just loss.

Key numbers to remember:

Quantity	Chinchilla (2022)	LLaMA 3 (2024)	MiniCPM (2024)
D*/N rule	≈ 20N	≈ 93N	≈ 192N
α (param exponent)	0.34	not published	fitted
β (data exponent)	0.28	not published	fitted
Method	All 3	IsoFLOP (Method 2)	Method 1 + 3
muP	No	No	Yes

What comes next:

CS336 Lec 9 — The basics: C=6ND, L(N,D), IsoFLOP curves, D=20N rule derivation.
CS336 Lec 12 — Evaluation: how to measure whether your model is actually good downstream.
Model Selection — Connecting scaling predictions to architecture and hyperparameter choices.

"Before you spend $10M, spend $100K to find out what the $10M will buy you."
— the spirit of scaling law methodology