Workbook — Language Models from Scratch · CS336

Build a Language Model From Scratch

Every number a language model engineer must be able to derive cold: tokenization tradeoffs, 6ND FLOPs, AdamW memory, RoPE geometry, MoE routing, roofline arithmetic intensity, ZeRO stages, Chinchilla scaling, KV-cache sizing, and alignment math. All verifiable in-browser with instant feedback.

Prerequisites: Basic linear algebra + Python familiarity + heard of the Transformer. Everything else we derive.
10
Chapters
51
Exercises
5
Exercise Types
Mastery
0 / 51 exercises (0%)
0
Day Streak
Best: 0

Chapter 0: Tokenization & BPE

Every byte that enters a language model passes through a tokenizer first. The tokenizer's job is to map raw text to a sequence of integer IDs. The choices here — vocabulary size, merge count, coverage — cascade through every other number in the system: sequence length, embedding table size, and softmax cost.

BPE vocabulary size:
|V| = |Σ0| + number of merges
where |Σ0| is the base alphabet (256 for byte-level BPE).

Compression ratio:
r = (chars in corpus) / (tokens in corpus)

Vocabulary-sequence tradeoff:
Larger V → fewer tokens T per sequence (better for attention O(T2))
Larger V → bigger embedding table & output projection costs
Byte-level BPE starts from 256. GPT-2 used byte-level BPE: the base alphabet is all 256 possible bytes, so even the rarest Unicode characters can be encoded. GPT-2 then learned 50,000 merges → |V| = 50,256. Llama 3 uses 128,000 merges from a 256-base alphabet → |V| = 128,256.
Exercise 0.1: BPE Vocabulary Size Derive

A byte-level BPE tokenizer starts with a 256-symbol base alphabet and learns 32,000 merges. What is the final vocabulary size?

tokens
Show derivation
|V| = 256 + 32,000 = 32,256

Each BPE merge adds exactly one new token (the merged pair) to the vocabulary. Starting from 256 bytes, 32,000 merges yields 32,256 total tokens. This is the vocabulary of the original LLaMA-1 tokenizer.

Exercise 0.2: Compression Ratio Derive

A tokenizer encodes 1,200 characters of English text into 300 tokens. What is the compression ratio (characters per token)?

chars/token
Show derivation
r = 1,200 / 300 = 4.0 chars/token

English text typically compresses at 3.5–4.5 chars/token with a well-trained BPE tokenizer. A ratio of 4 is normal. Code tokenizes worse (~2–3 chars/token) because of many special characters. Chinese/Japanese tokenize better per character but each character is often 3 UTF-8 bytes.

Exercise 0.3: Sequence Length After Tokenization Derive

You have a corpus of 400 billion characters. Your tokenizer has a compression ratio of 4 chars/token. How many tokens is the corpus (in billions)?

billion tokens
Show derivation
T = 400B / 4 = 100 billion tokens

400 billion characters at 4 chars/token = 100 billion tokens. This is roughly the size of the Pile dataset (825 GB, ~300B tokens). The compression ratio is the bridge between "raw data size" and "training tokens" — a critical number for dataset planning.

Exercise 0.4: Vocabulary Size Tradeoff Trace
Why does increasing vocabulary size from 32K to 128K tokens generally improve performance on multilingual text, but at a computational cost?
Show explanation

The embedding table is shape [|V|, dmodel] and the output projection is [dmodel, |V|]. Both scale linearly with |V|. Going from 32K to 128K quadruples these two tensors — for a 7B model with d=4096, the embedding table alone grows from 32K×4096×2 bytes = 256 MB to 128K×4096×2 = 1 GB. But sequence lengths shrink 4× for multilingual text, which reduces attention cost quadratically. The tradeoff depends on which effect dominates for your use case.

Exercise 0.5: BPE Merge Count for GPT-2 Derive

GPT-2 uses byte-level BPE with a vocabulary of 50,257 tokens (50,256 merges + 1 special token, but treat it as 50,256 merges from 256 base). How many total merges did GPT-2 learn?

merges
Show derivation
merges = 50,257 − 256 − 1 (special) ≈ 50,000

GPT-2 learned 50,000 BPE merges from a base alphabet of 256 bytes. The remaining 1 token is <|endoftext|>. This gives |V|=50,257. GPT-3 and GPT-4 use the same tokenizer (cl100k_base added 50K more merges for code). CS336 asks you to implement BPE from scratch — the merge count is the key hyperparameter you control.

Exercise 0.6: Implement count_pairs() Build

BPE training begins by counting all adjacent token pairs. Write a function that counts all bigrams in a list of token sequences.

Loop over each sequence, then each adjacent pair. Use `${a},${b}` as the key.
Show solution
javascript
function count_pairs(sequences) {
  const counts = {};
  for (const seq of sequences) {
    for (let i = 0; i < seq.length - 1; i++) {
      const key = seq[i] + "," + seq[i+1];
      counts[key] = (counts[key] || 0) + 1;
    }
  }
  return counts;
}

Chapter 1: Resource Accounting

Before writing a single line of training code, you must know how much memory and compute your model will consume. The numbers come from first principles: matrix shapes, dtype widths, and the 6ND rule.

6ND rule: C ≈ 6 · N · D FLOPs
where N = parameters, D = training tokens
Forward: 2ND, Backward: 4ND (grad-weights + grad-inputs)

AdamW memory per parameter (mixed-precision training):
bf16 weights: 2 bytes/param
bf16 gradients: 2 bytes/param
fp32 master copy: 4 bytes/param
fp32 Adam m (1st moment): 4 bytes/param
fp32 Adam v (2nd moment): 4 bytes/param
Total: 16 bytes/param

Matmul FLOPs: [M × K] × [K × N] = 2MKN FLOPs
Why 6, not 2. The factor of 6 comes from: 2 (forward) + 2 (grad w.r.t. weights) + 2 (grad w.r.t. inputs) = 6. The backward pass costs exactly 2× the forward pass. This is a mechanical consequence of the chain rule — each weight participates in two separate backward matrix multiplies. Verified against GPT-3: 6 × 175B × 300B ≈ 3.15 × 1023 FLOPs, matching OpenAI's reported 3.14 × 1023.
Exercise 1.1: Matmul FLOPs Derive

A linear layer multiplies [1024, 4096] by [4096, 4096]. How many FLOPs? (Give your answer in millions, i.e. divide by 106.)

Use 2MKN for a [M×K] × [K×N] matmul.

MFLOPs
Show derivation
FLOPs = 2 × M × K × N = 2 × 1024 × 4096 × 4096
= 2 × 1024 × 16,777,216 = 34,359,738,368 ≈ 34,360 MFLOPs

Each output element requires K=4096 multiply-adds = 2K FLOPs. Total output elements = M×N = 1024×4096 = 4,194,304. Total = 2×4096×4,194,304 ≈ 34.4 GFLOPs for a single token batch of 1024.

Exercise 1.2: The 6ND Rule for a Small Model Derive

You train a 1.3B parameter model on 26 billion tokens. How many FLOPs does training take? Give the answer as a power of 10 (what is the exponent x such that C ≈ 10x)?

C = 6ND. Then compute log10(C).

exponent x (10x FLOPs)
Show derivation
C = 6 × 1.3 × 109 × 26 × 109
= 6 × 33.8 × 1018 = 202.8 × 1018 ≈ 2 × 1020
log10(2 × 1020) ≈ 20.3

Rounded: C ≈ 1023 is the Chinchilla-optimal region for 1.3B. But our D=26B at N=1.3B satisfies D=20N exactly (Chinchilla rule), so C = 6 × 1.3B × 26B ≈ 2×1020. Exponent ≈ 20.3. The answer 23 is the closest integer — but a strict check: the question says "what exponent" — log10(2.03×1020) = 20.3. Accept ~20.

Exercise 1.3: AdamW Memory for a 7B Model Derive

Training a 7 billion parameter model with mixed-precision AdamW (16 bytes/param total). How many GB of GPU memory does the optimizer state + weights + gradients occupy?

GB
Show derivation
Memory = 16 bytes/param × 7 × 109 params
= 112 × 109 bytes = 112 GB

Breakdown: bf16 weights (14 GB) + bf16 gradients (14 GB) + fp32 master weights (28 GB) + fp32 Adam m (28 GB) + fp32 Adam v (28 GB) = 112 GB. This alone exceeds one A100 (80 GB), which is why multi-GPU training is non-negotiable even for a "small" 7B model.

Exercise 1.4: Activation Memory for a Transformer Layer Derive

A Transformer layer processes batch size B=4, sequence length T=2048, hidden dim d=4096, in bf16 (2 bytes/element). Attention activations require storing Q, K, V: three tensors of shape [B, T, d]. How many MB just for these three tensors?

MB
Show derivation
One tensor: 4 × 2048 × 4096 × 2 bytes = 67,108,864 bytes ≈ 64 MB
Three tensors (Q, K, V): 3 × 64 MB = 192 MB

Q, K, V each have shape [B, T, d] = [4, 2048, 4096]. In bf16 (2 bytes): 4×2048×4096×2 = 67M bytes = 64 MiB each. Three of them = 192 MiB. This is just the projection outputs — the attention score matrix [B, H, T, T] adds more, and you need to store all activations for the backward pass. This is why activation checkpointing trades recomputation for memory.

Exercise 1.5: Identify the Memory Bottleneck Trace
You are training a 13B parameter model on a single node of 8 A100-80GB GPUs (640 GB total). Without any model/optimizer sharding, which component fills the GPUs first?
Show explanation

13B params × 16 bytes/param = 208 GB. A single A100 has 80 GB, so the optimizer state alone needs at least 3 GPUs (and realistically more, because activations and KV cache pile on top). Without sharding (ZeRO or pipeline), you can't even fit the optimizer state, let alone run forward passes.

Exercise 1.6: Compute Token Budget from FLOP Budget Derive

You have 1022 FLOPs to spend and a 3B parameter model. Using the Chinchilla-optimal recipe (D = 20N), how many tokens should you train on? (Give in billions.)

billion tokens
Show derivation
Chinchilla: D* = 20N = 20 × 3B = 60 billion tokens
Check: C = 6 × 3B × 60B = 6 × 180 × 1018 = 1.08 × 1021 FLOPs

D=20N yields 60B tokens, costing ~1021 FLOPs — one order of magnitude less than our 1022 budget. With budget C=1022 and D=20N, we could train a larger model: N = C/(6×20N) → N2 = 1022/120 → N ≈ 9.1B. The answer to the stated question is 60B tokens for N=3B following D=20N.

Chapter 2: Architecture Details

Modern transformers differ from the original in four ways: RMSNorm, SwiGLU, RoPE, and GQA. Each changes a number you must be able to compute.

RMSNorm: RMS(x) = √((1/d) ∑ xi2) — no mean, no β
SwiGLU iso-FLOP: dff = (8/3) × dmodel
(3 projections × 2/3 shrink = same FLOPs as 2 standard)
GQA KV bytes: 2 × L × nkv × T × dhead × 2 (bf16)
Why 8/3 for SwiGLU. SwiGLU has three projections (W1, Wgate, W2) vs. two for standard ReLU FFN. To stay iso-FLOP: shrink dff by 2/3. LLaMA-2 7B: d=4096 → dff=(8/3)×4096≈10923 rounded to 11008.
Exercise 2.1: RMSNorm Parameter Savings per Layer Derive

dmodel=4096, 32 layers, 2 norm modules per layer. LayerNorm has γ+β (2d params); RMSNorm has only γ (d params). How many fewer parameters per layer?

params saved per layer
Show derivation
Each LN module saves d=4096 params (drops β)
2 modules/layer × 4096 = 8192 params saved per layer
Exercise 2.2: SwiGLU Iso-FLOP d_ff Derive

dmodel=2048. What is dff under the iso-FLOP SwiGLU constraint? Round to nearest multiple of 256.

d_ff
Show derivation
(8/3) × 2048 = 5461.3 → round to 5632 (= 22 × 256)
Exercise 2.3: RoPE Angle at k=0 Derive

θbase=10000, d=128, k=0, m=1. Compute θ0 = m × θbase−2k/d in radians.

radians
Show derivation
θ0 = 1 × 100000 = 1.0 radian

At k=0 the exponent is 0, giving 1. At k=63: 10000−126/128≈0.000102 — extremely slow rotation for long-range position encoding.

Exercise 2.4: GQA KV-Cache Size Derive

32 query heads, nkv=8 KV heads, dhead=128, T=4096, L=32 layers, bf16 (2 bytes). KV cache MB?

Bytes = 2 × L × nkv × T × dhead × 2.

MB
Show derivation
2 × 32 × 8 × 4096 × 128 × 2 = 536,870,912 bytes = 512 MB

Full MHA (nkv=32) would be 2048 MB. GQA reduces by 4×.

Exercise 2.5: Pre-Norm Layer Order Design

Order the four steps of one Pre-Norm Transformer decoder layer.

?
?
?
?
h = RMSNorm(x) a = Attn(h) x = x + a x = x + FFN(RMSNorm(x))
Show explanation

Pre-Norm: normalize before each sublayer (not after). The residual stream x is never directly normalized — only sublayer inputs are. This enables stable training at large scale because gradients flow through the residual path unobstructed.

Chapter 3: Mixture of Experts

MoE decouples total parameters from active parameters. A dense model uses all N params for every token. A MoE model routes each token to a sparse subset of experts — active params < total params.

Ntotal = Nshared + nexperts × Nexpert
Nactive = Nshared + top-k × Nexpert
FLOPs/token = same as dense model with Nactive
Capacity = Cf × (top-k / n) × T tokens per expert
Mixtral 8×7B numbers. 8 experts each ~7B + ~3.5B shared = 46.7B total. top-2 active: ~12.9B. FLOPs match 12.9B dense; memory holds 46.7B. The MoE tax: 3.6× memory overhead for the same compute cost.
Exercise 3.1: Active Parameters per Token Derive

16 experts, 2B params each, 1B shared, top-k=2. Active params per token (in billions)?

B active params
Show derivation
Nactive = 1B + 2 × 2B = 5B
(Ntotal = 1B + 16 × 2B = 33B)
Exercise 3.2: MoE FFN FLOPs per Token Derive

dmodel=4096, dff=8192, top-k=2. Two projections per expert (each: 2 × dmodel × dff FLOPs). Total MFLOPs/token?

MFLOPs/token
Show derivation
One expert: 2 × 2 × 4096 × 8192 = 134 MFLOPs
top-k=2: 2 × 134 = 268 MFLOPs per token
Exercise 3.3: Expert Token Capacity Derive

n=8 experts, top-k=2, T=1024 tokens, Cf=1.5. Token capacity per expert?

tokens
Show derivation
Expected = (2/8) × 1024 = 256; Capacity = 1.5 × 256 = 384
Exercise 3.4: Load Balance Loss Gradient Trace
Laux = n × ∑i fi × Pi. fi is a token fraction (argmax-based). How does gradient flow to the router?
Show explanation

fi is a hard count with zero gradient. Pi is the mean softmax probability — differentiable. When expert i is overloaded (high fi), the loss term fi×Pi is large; gradient pushes Pi down, reducing future assignment to expert i.

Exercise 3.5: Debug MoE Router Debug

This router always sends all tokens to the same single expert. Find the bug.

function route(x_flat, W_router, top_k) {
  const logits = matmul(x_flat, W_router);   // [B*T, n_experts]
  const probs  = softmax(logits, dim=-1);     // [B*T, n_experts]
  const topk   = probs.topk(1, dim=-1);       // hardcoded k=1 !
  const gates  = softmax(logits, dim=0);      // wrong axis: across tokens
  return { topk_idx: topk.indices, gates };
}
Show explanation

Line 4: topk(1, dim=-1) hardcodes k=1 — use topk(top_k, dim=-1). Line 5: softmax(..., dim=0) normalizes across tokens (wrong) instead of across experts (dim=-1). Both bugs cause degenerate routing where effectively one expert always wins.

Chapter 4: GPUs & the Roofline

Every GPU has two limits: how fast it can compute (peak FLOP/s) and how fast it can move data (memory bandwidth). The roofline model tells you which limit controls your workload. The number that bridges them is arithmetic intensity: FLOPs per byte transferred.

Arithmetic intensity: I = FLOPs / bytes_from_HBM
Ridge point: I* = peak_FLOPs / bandwidth
  H100 SXM5: 989 TFLOPs/s (bf16) ÷ 3.35 TB/s = 295 FLOPs/byte
Decode: B=1, T=1 token → I = 1 FLOP/byte (295× below ridge)
Prefill: B=32, T=2048 → I ≈ 65,536 ≫ 295 (compute-bound)
Coalesced access: 32 threads in a warp read consecutive addresses
Decode is 295× below the roofline. During decode (single-token generation), we read every weight once and do only one multiply per weight. Intensity = 1 FLOPs/byte vs. H100 ridge of 295 FLOPs/byte. The GPU is almost entirely idle, waiting for memory. This is why batching decodes together is the primary throughput knob in production inference.
Exercise 4.1: H100 Ridge Point Derive

H100 SXM5 peak: 989 TFLOPs/s (bf16 Tensor Core), bandwidth: 3.35 TB/s. Compute the ridge point I* in FLOPs/byte (round to nearest integer).

FLOPs/byte
Show derivation
I* = 989 × 1012 / 3.35 × 1012 = 989/3.35 ≈ 295.2 FLOPs/byte

Any workload with I < 295 FLOPs/byte is memory-bound on the H100. Any workload with I > 295 is compute-bound. Prefill is compute-bound; decode is deeply memory-bound.

Exercise 4.2: Decode Arithmetic Intensity Derive

During decode: B=1 (batch size), T=1 (token). A weight matrix has shape [D, F]. FLOPs = 2×1×D×F. Bytes read ≈ 2×D×F (bf16 weights). What is the arithmetic intensity?

FLOPs/byte
Show derivation
I = 2×D×F / (2×D×F) = 1 FLOPs/byte

Exactly 1 FLOPs per byte — 295× below the H100 ridge. Each weight is loaded once and used for exactly one multiply. To reach the ridge: batch B=295 tokens simultaneously (I≈295). This is why batching is so critical for decode throughput.

Exercise 4.3: Prefill Arithmetic Intensity Derive

Prefill: B=1, T=2048 tokens, weight matrix [D=4096, F=4096]. FLOPs = 2×B×T×D×F. Bytes ≈ 2×D×F (weight reads dominate). What is the arithmetic intensity?

FLOPs/byte
Show derivation
FLOPs = 2 × 1 × 2048 × 4096 × 4096
Bytes = 2 × 4096 × 4096 (weight read, bf16)
I = FLOPs/Bytes = (2 × 2048 × D × F) / (2 × D × F) = 2048

I=2048 ≫ 295 (H100 ridge) → prefill is compute-bound. Each weight is used for 2048 multiply-accumulates (one per input token), amortizing the load cost. Compute-bound is good — the GPU stays fully utilized.

Exercise 4.4: Warp and Thread Block Trace
A kernel launches thread blocks of 256 threads on an A100. How many warps per block, and can threads in different warps of the same block communicate through shared memory?
Show explanation

Warp size is always 32 on NVIDIA GPUs (A100, H100). 256/32 = 8 warps per block. All threads in the same block share the SM's SRAM — this is how threads communicate without going to HBM. Threads in different blocks cannot share SRAM directly.

Exercise 4.5: HBM Bandwidth Bottleneck in LayerNorm Derive

RMSNorm requires reading d=4096 bf16 values (2 bytes each) twice: once to compute RMS, once to normalize. How many bytes of HBM access per token per RMSNorm call? (Answer: just the two passes over x.)

bytes
Show derivation
2 passes × 4096 elements × 2 bytes/element = 16,384 bytes = 16 KB

LayerNorm needs 3 passes (mean, variance, normalize): 3×4096×2 = 24 KB. RMSNorm saves one pass (no mean). Flash Attention similarly fuses attention into one pass to avoid re-reading the activation matrix from HBM — the same principle.

Chapter 5: Parallelism

When a model doesn't fit on one GPU, you distribute it. There are three orthogonal strategies: data parallelism (replicate model, split batch), tensor parallelism (split weight matrices), and pipeline parallelism (split layers). ZeRO eliminates redundant optimizer state across data-parallel replicas.

ZeRO stage memory per GPU (P GPUs):
Stage 0 (none): 16 bytes/param (full copy on each GPU)
Stage 1 (optimizer): 16/P bytes from Adam state + 4 bytes/param
Stage 2 (+ gradients): 16/P bytes from state + 2 bytes/param gradients
Stage 3 (+ params): (16+2+2)/P + 0 = 20/P bytes/param total

Ring all-reduce cost per GPU:
Bytes transferred = 2 × (P−1)/P × M ≈ 2M for large P
where M = gradient buffer size in bytes

Pipeline bubble fraction:
Bubble = (P−1) / (m + P − 1)
where P = pipeline stages, m = microbatches
ZeRO stages save memory but not compute. ZeRO-3 shards parameters, gradients, and optimizer state across P GPUs. Each GPU stores only 1/P of everything. But each forward/backward still accesses all parameters — they are all-gathered on demand. ZeRO saves memory; the communication cost (all-gather + reduce-scatter per layer) is the overhead.
Exercise 5.1: ZeRO Stage 1 Memory per GPU Derive

7B model, 8 GPUs, ZeRO Stage 1 (shards optimizer state only). Stage 1: Adam m+v+fp32 master = 12 bytes/param sharded, plus bf16 weights (2 bytes) and bf16 gradients (2 bytes) unsharded. How many GB per GPU?

Per-GPU = (12/8 + 2 + 2) bytes/param × 7B params.

GB
Show derivation
Per-param bytes = 12/8 + 2 + 2 = 1.5 + 4 = 5.5 bytes/param
Total = 5.5 × 7 × 109 = 38.5 × 109 bytes = 38.5 GB per GPU

Compare to 112 GB without ZeRO (16 bytes/param). ZeRO-1 reduces memory 2.9× by sharding the large optimizer state. Each of the 8 GPUs now fits in an A100-80GB with room for activations.

Exercise 5.2: Ring All-Reduce Volume Derive

7B parameter model, bf16 gradients (2 bytes/param), 8 GPUs. Ring all-reduce sends 2×(P−1)/P×M bytes per GPU. How many GB does each GPU send/receive? (P=8, M = 7B × 2 bytes.)

GB per GPU
Show derivation
M = 7 × 109 × 2 = 14 GB (gradient buffer)
Bytes/GPU = 2 × (8−1)/8 × 14 = 2 × 0.875 × 14 = 24.5 GB

Each GPU sends/receives 24.5 GB per training step. At NVLink 600 GB/s bandwidth: 24.5/600 ≈ 41 ms for the all-reduce. At InfiniBand 25 GB/s: 24.5/25 = 980 ms — almost a second just for synchronization, dwarfing the forward/backward pass.

Exercise 5.3: Pipeline Bubble Fraction Derive

Pipeline parallelism: P=4 stages, m=8 microbatches. What fraction of time is the pipeline idle (bubble fraction)?

Bubble = (P−1) / (m + P − 1).

bubble fraction
Show derivation
Bubble = (4−1) / (8 + 4 − 1) = 3/11 ≈ 0.273

27.3% idle. To halve the bubble: double m to 16 → 3/19 ≈ 15.8%. More microbatches = smaller bubble but larger effective batch size, which can hurt convergence. Interleaved pipeline schedules (Megatron-LM) reduce bubble to ~(P−1)/(2mV) for V chunks per stage.

Exercise 5.4: Tensor Parallelism Communication Trace
In tensor parallelism (column-then-row split of the MLP), why should you keep TP within a single node (up to 8 GPUs) rather than spanning nodes?
Show explanation

TP requires an all-reduce every transformer block (after the row-parallel linear). With 32+ layers, that's 32 all-reduces per step. Cross-node at 25 GB/s makes each synchronization ~24× slower than NVLink. The compute cannot proceed until synchronization completes, making cross-node TP a catastrophic bottleneck.

Exercise 5.5: ZeRO-3 Memory per GPU Derive

ZeRO-3 shards all 16 bytes/param across P=8 GPUs. For a 7B parameter model, how many GB does each GPU hold?

GB per GPU
Show derivation
Per GPU = 16 bytes/param × 7B / 8 = 2 bytes/param × 7B = 14 GB

112 GB total / 8 GPUs = 14 GB per GPU — an 8× reduction. But parameters must be all-gathered for each forward/backward layer pass, adding communication overhead. ZeRO-3 trades memory for communication latency.

Chapter 6: Scaling Laws

Scaling laws let you predict model performance before spending the compute. The key formulas: C = 6ND (compute from parameters and tokens), L(N, D) = E + A/Nα + B/Dβ (loss from both), and the Chinchilla optimal allocation D* = 20N.

Joint loss: L(N, D) = E + A⁄Nα + B⁄Dβ
Chinchilla constants: E=1.69, A=406.4, α=0.34, B=410.7, β=0.28

Chinchilla optimal: D* = 20N (equal scaling rule)
⇒ N* = C/(6 × 20N) ⇒ N* = √(C/120)

IsoFLOP optimum: at fixed C=6ND, minimize L by setting
∂L/∂N = 0 ⇒ α×A/Nα+1 = β×B/Dβ+1
Chinchilla result. Deepmind (Hoffmann et al. 2022) showed GPT-3 (175B params, 300B tokens) was under-trained by 10×. The Chinchilla-optimal model for the same compute: 70B params, 1.4T tokens. Chinchilla 70B matched or exceeded GPT-3 on most benchmarks despite 2.5× fewer parameters. The lesson: data is as important as model size.
Exercise 6.1: GPT-3 Compute Derive

GPT-3: N=175B params, D=300B tokens. Compute C=6ND. Express as X×1023 — what is X to 2 decimal places?

×1023 FLOPs
Show derivation
C = 6 × 175 × 109 × 300 × 109
= 6 × 52,500 × 1018 = 315,000 × 1018 = 3.15 × 1023

OpenAI reported 3.14×1023 FLOPs — our formula is accurate to within 0.3%. The 6ND rule holds because attention costs are <5% of total FLOPs for standard context lengths.

Exercise 6.2: Chinchilla Optimal N* Derive

Given GPT-3's compute budget C=3.15×1023 FLOPs, what is the Chinchilla-optimal number of parameters? Use N* = √(C/120). Give in billions.

billion params
Show derivation
N* = √(C/120) = √(3.15×1023 / 120)
= √(2.625×1021) = 1.620×1010.5 ≈ 51.2×109 = 51.2B

Chinchilla-optimal: ~51B params × 20 = ~1.02T tokens. Deepmind actually ran 70B with 1.4T tokens — close to this optimum. GPT-3 at 175B is over-parameterized by ~3.4× for its compute budget.

Exercise 6.3: Loss Improvement from Doubling Data Derive

A model is trained on D=100B tokens. The data term of the loss is B/Dβ with B=410.7, β=0.28. Compute the data term at D=100B, then at D=200B. How much does the data term decrease? (Express as fraction of the D=100B value; round to 2 decimal places.)

fractional decrease
Show derivation
At D1: B/D1β. At D2=2D1: B/(2D1)β = B/D1β × 2−β
Fractional decrease = 1 − 2−0.28 = 1 − 0.825 = 0.175

Doubling data reduces the data term by 17.5%, regardless of the absolute value of D. This is the power law: each doubling gives the same proportional improvement. The β=0.28 exponent is small — data improvements are incremental, not exponential.

Exercise 6.4: IsoFLOP Optimum Tradeoff Trace
On an IsoFLOP curve (fixed C=6ND), as you increase N and decrease D proportionally, what happens to validation loss?
Show explanation

L(N,D) = E + A/Nα + B/Dβ. As N→∞ (and D→0), B/Dβ→∞ (data starvation). As N→0 (and D→∞), A/Nα→∞ (capacity starvation). There's an optimum in between where the gradient of both terms balance. The IsoFLOP plot is U-shaped — the bottom is the Chinchilla optimum.

Chapter 7: Inference & KV Cache

Inference has two distinct phases: prefill (process the prompt in parallel — compute-bound) and decode (generate one token at a time — memory-bound). The KV cache eliminates redundant recomputation during decode.

KV cache bytes (per request):
2 (K,V) × L × nkv × Tctx × dhead × 2 bytes (bf16)

Decode arithmetic intensity: I = B×T ≈ B (batch size)
Need B ≥ I* = 295 (H100) to be compute-bound

Speculative decoding speedup:
Speedup ≈ γ × acceptance_rate / (1 + γ × cost_ratio)
where γ = draft tokens per verify step
KV cache is the dominant memory cost at long context. A 7B LLaMA model has ~32 GB of weights. With T=32,768 context, nkv=32 heads, dhead=128, 32 layers: KV = 2×32×32×32768×128×2 = 17 GB — more than half the model weights, for a single request. With 8-bit quantization of the KV cache this halves to 8.5 GB.
Exercise 7.1: KV Cache per Request Derive

7B LLaMA (32 layers, 32 KV heads, dhead=128), T=8192 tokens, bf16. How many GB is the KV cache for one request?

Bytes = 2 × 32 × 32 × 8192 × 128 × 2.

GB
Show derivation
2 × 32 × 32 × 8192 × 128 × 2 = 4,294,967,296 bytes ≈ 4.29 GB

4.29 GB for one 8K request. With 16 concurrent requests: 16×4.29 = 68.6 GB — nearly one full A100-80GB just for KV caches. Long-context serving is fundamentally a memory management problem.

Exercise 7.2: Batch Size for Compute-Bound Decode Derive

H100 ridge: I*=295 FLOPs/byte. During decode, I≈batch_size B. What minimum batch size makes decode compute-bound on an H100?

minimum batch size
Show derivation
Need I = B ≥ I* = 295

Batch 295+ requests together for compute-bound decode. In practice you'd use B=256 or B=512 (powers of 2). This is why LLM serving systems (vLLM, TensorRT-LLM) invest heavily in continuous batching — to maintain large effective batch sizes even with different request lengths.

Exercise 7.3: int8 Quantization Memory Savings Derive

A 7B model in bf16 weights = 14 GB. Quantizing weights to int8 (1 byte/param). How many GB are saved, and what is the percentage reduction?

GB saved
Show derivation
bf16: 7B × 2 bytes = 14 GB
int8: 7B × 1 byte = 7 GB
Saved: 14 − 7 = 7 GB (50% reduction)

int8 quantization halves weight memory. Arithmetic intensity during decode also doubles (halved bytes for same FLOPs). GPTQ and AWQ achieve <0.5% perplexity degradation with int4 quantization, saving 75%.

Exercise 7.4: Prefill vs. Decode Phase Trace
A user sends a 500-token prompt; the model generates 200 tokens of response. Which phase is compute-bound and which is memory-bound, and why?
Show explanation

Arithmetic intensity ≈ B×T (tokens processed). Prefill: T=500 ≫ 295 → compute-bound. Decode: T=1 at each step → I=1 ≪ 295 → memory-bound. Different hardware metrics matter for each: prefill optimization targets TFLOP/s utilization; decode optimization targets memory bandwidth and batching.

Chapter 8: Alignment Math

After pretraining comes alignment: teaching the model to be helpful, harmless, and honest. SFT fine-tunes on curated demonstrations. RLHF uses a reward model + PPO. DPO and GRPO eliminate the reward model entirely by optimizing preference probabilities directly.

Bradley-Terry preference model:
P(yw ≻ yl) = σ(r(x,yw) − r(x,yl)) = 1/(1 + e−Δr)

DPO loss:
LDPO = −log σ(β × (log πθ(yw|x) − log πref(yw|x))
                − β × (log πθ(yl|x) − log πref(yl|x)))

GRPO advantage (group relative):
Ai = (ri − mean(r)) / std(r)
DPO insight. DPO sidesteps reward model training entirely. It shows that the optimal policy under RLHF's KL-constrained objective can be expressed directly in terms of the reference model: r(x,y) = β × log(πθ(y|x) / πref(y|x)) + Z(x). Plugging this into the Bradley-Terry loss eliminates the reward model and gives a tractable direct loss on the policy.
Exercise 8.1: Bradley-Terry Preference Probability Derive

A reward model gives r(x, ywin)=3.0 and r(x, ylose)=1.5. Using P(yw≻yl) = σ(Δr) = 1/(1+e−Δr), what is the probability the winner is preferred? (Round to 2 decimal places.)

probability
Show derivation
Δr = 3.0 − 1.5 = 1.5
P = 1/(1 + e−1.5) = 1/(1 + 0.2231) = 1/1.2231 ≈ 0.818

A reward gap of 1.5 nats gives 82% preference probability. At Δr=0 the model is indifferent (50%). At Δr=4.6 (ln99) the probability is 99%. The Bradley-Terry model is essentially a logistic regression on reward differences.

Exercise 8.2: GRPO Advantage Normalization Derive

A group of 4 samples has rewards [1, 3, 2, 6]. Compute the normalized advantage Ai = (ri − mean) / std for r4=6. (Use population std, not sample std.)

A4
Show derivation
mean = (1+3+2+6)/4 = 12/4 = 3.0
variance = [(1-3)2+(3-3)2+(2-3)2+(6-3)2]/4 = [4+0+1+9]/4 = 3.5
std = √3.5 ≈ 1.871
A4 = (6−3)/1.871 = 3/1.871 ≈ 1.604 ≈ 1.60

The best response (r=6) gets advantage +1.60. The worst (r=1) gets (1-3)/1.871 = -1.07. Normalization keeps gradient magnitudes stable regardless of the absolute reward scale — crucial because reward functions differ wildly across tasks.

Exercise 8.3: SFT Loss = NLL Loss Trace
SFT fine-tuning uses negative log-likelihood loss on (prompt, response) pairs. Which tokens are included in the loss computation?
Show explanation

During SFT, the prompt tokens are fed as context with their loss masked (loss_mask=0). Only response tokens have loss_mask=1. This focuses learning signal on the quality of the completion. Including prompt tokens in the loss would penalize the model for "not generating" the prompt, which is counterproductive.

Exercise 8.4: DPO β Parameter Effect Trace
In DPO, the β parameter controls the KL penalty between the learned policy and the reference model. What happens when β→0?
Show explanation

β controls how strongly the KL divergence from reference penalizes large policy updates. At β→0, the model is free to exploit any pattern that increases preference margin, even degenerate ones. Typical practice: β=0.1 to 0.5. Too large (β→∞) and the model barely moves from reference. Too small risks training instability and reward hacking.

Exercise 8.5: Bloom Filter False Positive Rate Derive

Deduplication uses Bloom filters to check if a document was already seen. A Bloom filter with m=109 bits, n=107 items, and k=7 hash functions has false positive rate ≈ (1−e−kn/m)k. Compute (1−e−7×107/109)7. Round to 3 decimal places.

FP rate
Show derivation
kn/m = 7 × 107 / 109 = 0.07
1 − e−0.07 = 1 − 0.9324 = 0.0676
(0.0676)7 = e7×ln(0.0676) ≈ e7×(−2.694) = e−18.86 ≈ 6.3×10−9

With these parameters FP rate ≈ 0.006 (using standard Bloom filter formula approximation). Practically near-zero false positives with 1 GB of bits and 7 hashes per 10M items. The Pile and C4 both use Bloom-filter-based deduplication at this scale.

Chapter 9: Capstone — Size and Cost a Full Run

You have a budget of $1M. H100-SXM5: 3.96×1015 BF16 FLOPs/GPU/day at ~MFU=0.45 (effective 1.78×1015 FLOPs/GPU/day), rental cost $2/GPU/hour. Design a training run end-to-end: pick a model size, compute required training time, verify it fits the budget, and estimate serving cost.

Useful GPU throughput: MFU × peak_FLOPs/s
H100 at MFU=0.45: 0.45 × 989 TFLOPs/s ≈ 445 TFLOPs/s effective
Training time: t = C / (nGPUs × FLOPs/GPU/s)
Budget check: cost = nGPUs × t(hours) × $/GPU/hour
Tokens/day/GPU: MFU × peak / (6N) tokens/s × 86400
The $1M calculation. $1M at $2/GPU/hr = 500,000 GPU-hours. 500,000/24 = 20,833 GPU-days. At 1.78×1015 effective FLOPs/GPU/day: total C = 20,833 × 1.78×1015 ≈ 3.7×1019 FLOPs. Chinchilla-optimal: N*=√(C/120)≈√(3.1×1017)≈560M params, D*=20N≈11.2B tokens. A solid 0.5B-class model.
Exercise 9.1: Effective GPU Throughput Derive

An H100 peaks at 989 TFLOPs/s (bf16). At MFU=0.45, what is the effective throughput in TFLOPs/s? (Round to nearest integer.)

TFLOPs/s
Show derivation
Effective = 0.45 × 989 = 445.05 ≈ 445 TFLOPs/s

MFU (Model FLOP Utilization) accounts for all overhead: data loading, optimizer steps, communication, idle time. State-of-the-art training achieves MFU 0.35–0.55. Megatron-LM reports 0.52 on GPT-3 scale. 0.45 is a realistic estimate for a single-node run.

Exercise 9.2: Training Time for 1.3B Model Derive

N=1.3B, D=26B (Chinchilla-optimal). C=6ND≈2.03×1020 FLOPs. Using 8 H100s at 445 TFLOPs/s each. How many hours does training take?

t = C / (n × 445×1012 FLOPs/s) / 3600 seconds/hour.

hours
Show derivation
FLOPs/s cluster = 8 × 445×1012 = 3.56×1015 FLOPs/s
t(s) = 2.03×1020 / 3.56×1015 = 57,022 s
t(hr) = 57,022 / 3600 ≈ 15.8 hours

Just 16 hours on 8 H100s for a Chinchilla-optimal 1.3B model! At $2/GPU/hr: 8×16×$2=$256. Incredibly cheap by frontier model standards.

Exercise 9.3: Training Cost Derive

Using the result from 9.2: 8 H100s for 16 hours at $2/GPU/hour. What is the total training cost in dollars?

dollars
Show derivation
Cost = 8 GPUs × 16 hours × $2/GPU/hour = $256

$256 for a 1.3B Chinchilla-optimal model. Llama 2 7B training reportedly cost ~$5,000–$10,000 on comparable hardware. GPT-3 cost ~$10M. These orders of magnitude are the key intuition for budgeting AI research.

Exercise 9.4: Full Pipeline Order Design

Arrange the five stages of building a language model from scratch in the correct order.

?
?
?
?
?
Data collection & dedup Train tokenizer Pretrain (C=6ND) SFT + alignment Deploy & serve
Show explanation

Data first (raw corpus defines vocabulary coverage). Tokenizer second (must be trained on the final data to capture its distribution). Pretraining third (costs the most compute). SFT + alignment fourth (orders of magnitude cheaper than pretraining, fine-tunes behavior). Serving last (inference optimization: quantization, KV cache, batching).

Exercise 9.5: Serving Throughput Derive

Decode throughput: an H100 at effective 445 TFLOPs/s, serving a 1.3B model (N=1.3B). Each token costs 2N=2.6B FLOPs (forward pass only, half of training). How many tokens per second can one H100 generate if fully loaded?

tokens/sec
Show derivation
FLOPs per token = 2N = 2 × 1.3×109 = 2.6×109
tokens/s = 445×1012 / 2.6×109 = 171,154 ≈ 171,000 tokens/s

171K tokens/sec compute-bound upper bound. In reality, decode is memory-bound at small batch sizes. With batch=1: only ~85K tokens/sec (memory bandwidth limited). With batch=512: approaches the compute bound. The gap between memory-limited and compute-bound performance is the motivation for continuous batching in production.

Exercise 9.6: Total Parameters Accounting Derive

A transformer with d=1024, n_heads=16, d_ff=2730 (SwiGLU iso-FLOP), L=24 layers, vocab=32256. Count parameters: (a) embedding table [V,d]; (b) each layer: Q,K,V,O projections [d,d] each, FFN W1,W2,Wgate [d,d_ff],[d_ff,d],[d,d_ff], 2 RMSNorms [d] each. Multiply (b) by 24 layers. What is the total in millions? (Ignore biases.)

million params
Show derivation
Embedding: 32256 × 1024 = 33,030,144 ≈ 33M
Per layer attention (Q+K+V+O): 4 × 1024 × 1024 = 4,194,304 ≈ 4.19M
Per layer FFN SwiGLU (W1+W2+Wgate): 2×1024×2730 + 2730×1024 = 5,591,040+2,795,520 = 8,386,560 ≈ 8.39M
Per layer RMSNorm (2 × d): 2 × 1024 = 2048 ≈ negligible
Per layer total: 4.19M + 8.39M + 0.002M ≈ 12.58M
24 layers: 24 × 12.58M = 301.9M
Total: 33M + 301.9M + final LM head (output proj, often tied) ≈ 335M–460M

With separate LM head (untied, [d, V]=33M): 335M+33M≈368M. With weight tying (embedding = LM head): 335M. Either way, the model is in the “400M” class. The calculation gives 459M accepting some rounding.

The proof of work. If you completed every chapter — counted merge operations in BPE, derived 6ND from first principles, computed SwiGLU hidden dimensions, routed tokens to MoE experts, positioned yourself on the roofline, sharded optimizer state with ZeRO, solved the Chinchilla optimum, sized KV caches for long context, normalized GRPO advantages, and costed a full training run — you understand the math that runs every frontier language model. "What I cannot create, I do not understand."

Related Lessons

TopicLesson
Tokenization & BPECS336 L1: Overview & Tokenization
Resource accountingCS336 L2: PyTorch & Resource Accounting
ArchitecturesCS336 L3: Architectures & Hyperparameters
Scaling lawsCS336 L9: Scaling Laws
InferenceCS336 L10: Inference