CS336: Build a Language Model From Scratch Workbook

Chapter 0: Tokenization & BPE

Every byte that enters a language model passes through a tokenizer first. The tokenizer's job is to map raw text to a sequence of integer IDs. The choices here — vocabulary size, merge count, coverage — cascade through every other number in the system: sequence length, embedding table size, and softmax cost.

BPE vocabulary size:
|V| = |Σ₀| + number of merges
where |Σ₀| is the base alphabet (256 for byte-level BPE).

Compression ratio:
r = (chars in corpus) / (tokens in corpus)

Vocabulary-sequence tradeoff:
Larger V → fewer tokens T per sequence (better for attention O(T²))
Larger V → bigger embedding table & output projection costs

Byte-level BPE starts from 256. GPT-2 used byte-level BPE: the base alphabet is all 256 possible bytes, so even the rarest Unicode characters can be encoded. GPT-2 then learned 50,000 merges → |V| = 50,256. Llama 3 uses 128,000 merges from a 256-base alphabet → |V| = 128,256.

Exercise 0.1: BPE Vocabulary Size Derive

A byte-level BPE tokenizer starts with a 256-symbol base alphabet and learns 32,000 merges. What is the final vocabulary size?

tokens

Show derivation

|V| = 256 + 32,000 = 32,256

Each BPE merge adds exactly one new token (the merged pair) to the vocabulary. Starting from 256 bytes, 32,000 merges yields 32,256 total tokens. This is the vocabulary of the original LLaMA-1 tokenizer.

Exercise 0.2: Compression Ratio Derive

A tokenizer encodes 1,200 characters of English text into 300 tokens. What is the compression ratio (characters per token)?

chars/token

Show derivation

r = 1,200 / 300 = 4.0 chars/token

English text typically compresses at 3.5–4.5 chars/token with a well-trained BPE tokenizer. A ratio of 4 is normal. Code tokenizes worse (~2–3 chars/token) because of many special characters. Chinese/Japanese tokenize better per character but each character is often 3 UTF-8 bytes.

Exercise 0.3: Sequence Length After Tokenization Derive

You have a corpus of 400 billion characters. Your tokenizer has a compression ratio of 4 chars/token. How many tokens is the corpus (in billions)?

billion tokens

Show derivation

T = 400B / 4 = 100 billion tokens

400 billion characters at 4 chars/token = 100 billion tokens. This is roughly the size of the Pile dataset (825 GB, ~300B tokens). The compression ratio is the bridge between "raw data size" and "training tokens" — a critical number for dataset planning.

Exercise 0.4: Vocabulary Size Tradeoff Trace

Why does increasing vocabulary size from 32K to 128K tokens generally improve performance on multilingual text, but at a computational cost?

Because the tokenizer itself takes longer to run with more tokens. Larger vocabulary encodes rare scripts and words as single tokens (shorter sequences), but the embedding table and output linear layer both grow by 4×, increasing memory and output projection FLOPs proportionally. Larger vocabulary requires more training data to learn each token embedding. Larger vocabularies cause attention to become quadratically more expensive.

Show explanation

The embedding table is shape [|V|, d_model] and the output projection is [d_model, |V|]. Both scale linearly with |V|. Going from 32K to 128K quadruples these two tensors — for a 7B model with d=4096, the embedding table alone grows from 32K×4096×2 bytes = 256 MB to 128K×4096×2 = 1 GB. But sequence lengths shrink 4× for multilingual text, which reduces attention cost quadratically. The tradeoff depends on which effect dominates for your use case.

Exercise 0.5: BPE Merge Count for GPT-2 Derive

GPT-2 uses byte-level BPE with a vocabulary of 50,257 tokens (50,256 merges + 1 special token, but treat it as 50,256 merges from 256 base). How many total merges did GPT-2 learn?

merges

Show derivation

merges = 50,257 − 256 − 1 (special) ≈ 50,000

GPT-2 learned 50,000 BPE merges from a base alphabet of 256 bytes. The remaining 1 token is <|endoftext|>. This gives |V|=50,257. GPT-3 and GPT-4 use the same tokenizer (cl100k_base added 50K more merges for code). CS336 asks you to implement BPE from scratch — the merge count is the key hyperparameter you control.

Exercise 0.6: Implement count_pairs() Build

BPE training begins by counting all adjacent token pairs. Write a function that counts all bigrams in a list of token sequences.

Loop over each sequence, then each adjacent pair. Use `${a},${b}` as the key.

Show solution

javascript
function count_pairs(sequences) {
  const counts = {};
  for (const seq of sequences) {
    for (let i = 0; i < seq.length - 1; i++) {
      const key = seq[i] + "," + seq[i+1];
      counts[key] = (counts[key] || 0) + 1;
    }
  }
  return counts;
}

Chapter 1: Resource Accounting

Before writing a single line of training code, you must know how much memory and compute your model will consume. The numbers come from first principles: matrix shapes, dtype widths, and the 6ND rule.

6ND rule: C ≈ 6 · N · D FLOPs
where N = parameters, D = training tokens
Forward: 2ND, Backward: 4ND (grad-weights + grad-inputs)

AdamW memory per parameter (mixed-precision training):
bf16 weights: 2 bytes/param
bf16 gradients: 2 bytes/param
fp32 master copy: 4 bytes/param
fp32 Adam m (1st moment): 4 bytes/param
fp32 Adam v (2nd moment): 4 bytes/param
Total: 16 bytes/param

Matmul FLOPs: [M × K] × [K × N] = 2MKN FLOPs

Why 6, not 2. The factor of 6 comes from: 2 (forward) + 2 (grad w.r.t. weights) + 2 (grad w.r.t. inputs) = 6. The backward pass costs exactly 2× the forward pass. This is a mechanical consequence of the chain rule — each weight participates in two separate backward matrix multiplies. Verified against GPT-3: 6 × 175B × 300B ≈ 3.15 × 10²³ FLOPs, matching OpenAI's reported 3.14 × 10²³.

Exercise 1.1: Matmul FLOPs Derive

A linear layer multiplies [1024, 4096] by [4096, 4096]. How many FLOPs? (Give your answer in millions, i.e. divide by 10⁶.)

Use 2MKN for a [M×K] × [K×N] matmul.

MFLOPs

Show derivation

FLOPs = 2 × M × K × N = 2 × 1024 × 4096 × 4096

= 2 × 1024 × 16,777,216 = 34,359,738,368 ≈ 34,360 MFLOPs

Each output element requires K=4096 multiply-adds = 2K FLOPs. Total output elements = M×N = 1024×4096 = 4,194,304. Total = 2×4096×4,194,304 ≈ 34.4 GFLOPs for a single token batch of 1024.

Exercise 1.2: The 6ND Rule for a Small Model Derive

You train a 1.3B parameter model on 26 billion tokens. How many FLOPs does training take? Give the answer as a power of 10 (what is the exponent x such that C ≈ 10^x)?

C = 6ND. Then compute log₁₀(C).

exponent x (10^x FLOPs)

Show derivation

C = 6 × 1.3 × 10⁹ × 26 × 10⁹

= 6 × 33.8 × 10¹⁸ = 202.8 × 10¹⁸ ≈ 2 × 10²⁰

log₁₀(2 × 10²⁰) ≈ 20.3

Rounded: C ≈ 10²³ is the Chinchilla-optimal region for 1.3B. But our D=26B at N=1.3B satisfies D=20N exactly (Chinchilla rule), so C = 6 × 1.3B × 26B ≈ 2×10²⁰. Exponent ≈ 20.3. The answer 23 is the closest integer — but a strict check: the question says "what exponent" — log₁₀(2.03×10²⁰) = 20.3. Accept ~20.

Exercise 1.3: AdamW Memory for a 7B Model Derive

Training a 7 billion parameter model with mixed-precision AdamW (16 bytes/param total). How many GB of GPU memory does the optimizer state + weights + gradients occupy?

Show derivation

Memory = 16 bytes/param × 7 × 10⁹ params

= 112 × 10⁹ bytes = 112 GB

Breakdown: bf16 weights (14 GB) + bf16 gradients (14 GB) + fp32 master weights (28 GB) + fp32 Adam m (28 GB) + fp32 Adam v (28 GB) = 112 GB. This alone exceeds one A100 (80 GB), which is why multi-GPU training is non-negotiable even for a "small" 7B model.

Exercise 1.4: Activation Memory for a Transformer Layer Derive

A Transformer layer processes batch size B=4, sequence length T=2048, hidden dim d=4096, in bf16 (2 bytes/element). Attention activations require storing Q, K, V: three tensors of shape [B, T, d]. How many MB just for these three tensors?

Show derivation

One tensor: 4 × 2048 × 4096 × 2 bytes = 67,108,864 bytes ≈ 64 MB

Three tensors (Q, K, V): 3 × 64 MB = 192 MB

Q, K, V each have shape [B, T, d] = [4, 2048, 4096]. In bf16 (2 bytes): 4×2048×4096×2 = 67M bytes = 64 MiB each. Three of them = 192 MiB. This is just the projection outputs — the attention score matrix [B, H, T, T] adds more, and you need to store all activations for the backward pass. This is why activation checkpointing trades recomputation for memory.

Exercise 1.5: Identify the Memory Bottleneck Trace

You are training a 13B parameter model on a single node of 8 A100-80GB GPUs (640 GB total). Without any model/optimizer sharding, which component fills the GPUs first?

Activations — each forward pass generates GBs of intermediate tensors. Optimizer state — AdamW at 16 bytes/param means 13B params × 16 = 208 GB, which already exceeds 2.5 A100s before a single activation is stored. The embedding table — with a 128K vocabulary and d=5120, it takes 128K×5120×4 = 2.6 GB. The KV cache — even at 2048 context length it dominates training memory.

Show explanation

13B params × 16 bytes/param = 208 GB. A single A100 has 80 GB, so the optimizer state alone needs at least 3 GPUs (and realistically more, because activations and KV cache pile on top). Without sharding (ZeRO or pipeline), you can't even fit the optimizer state, let alone run forward passes.

Exercise 1.6: Compute Token Budget from FLOP Budget Derive

You have 10²² FLOPs to spend and a 3B parameter model. Using the Chinchilla-optimal recipe (D = 20N), how many tokens should you train on? (Give in billions.)

billion tokens

Show derivation

Chinchilla: D* = 20N = 20 × 3B = 60 billion tokens

Check: C = 6 × 3B × 60B = 6 × 180 × 10¹⁸ = 1.08 × 10²¹ FLOPs

D=20N yields 60B tokens, costing ~10²¹ FLOPs — one order of magnitude less than our 10²² budget. With budget C=10²² and D=20N, we could train a larger model: N = C/(6×20N) → N² = 10²²/120 → N ≈ 9.1B. The answer to the stated question is 60B tokens for N=3B following D=20N.

Chapter 2: Architecture Details

Modern transformers differ from the original in four ways: RMSNorm, SwiGLU, RoPE, and GQA. Each changes a number you must be able to compute.

RMSNorm: RMS(x) = √((1/d) ∑ x_i²) — no mean, no β
SwiGLU iso-FLOP: d_ff = (8/3) × d_model
(3 projections × 2/3 shrink = same FLOPs as 2 standard)
GQA KV bytes: 2 × L × n_kv × T × d_head × 2 (bf16)

Why 8/3 for SwiGLU. SwiGLU has three projections (W₁, W_gate, W₂) vs. two for standard ReLU FFN. To stay iso-FLOP: shrink d_ff by 2/3. LLaMA-2 7B: d=4096 → d_ff=(8/3)×4096≈10923 rounded to 11008.

Exercise 2.1: RMSNorm Parameter Savings per Layer Derive

d_model=4096, 32 layers, 2 norm modules per layer. LayerNorm has γ+β (2d params); RMSNorm has only γ (d params). How many fewer parameters per layer?

params saved per layer

Show derivation

Each LN module saves d=4096 params (drops β)

2 modules/layer × 4096 = 8192 params saved per layer

Exercise 2.2: SwiGLU Iso-FLOP d_ff Derive

d_model=2048. What is d_ff under the iso-FLOP SwiGLU constraint? Round to nearest multiple of 256.

d_ff

Show derivation

(8/3) × 2048 = 5461.3 → round to 5632 (= 22 × 256)

Exercise 2.3: RoPE Angle at k=0 Derive

θ_base=10000, d=128, k=0, m=1. Compute θ₀ = m × θ_base^−2k/d in radians.

radians

Show derivation

θ₀ = 1 × 10000⁰ = 1.0 radian

At k=0 the exponent is 0, giving 1. At k=63: 10000^−126/128≈0.000102 — extremely slow rotation for long-range position encoding.

Exercise 2.4: GQA KV-Cache Size Derive

32 query heads, n_kv=8 KV heads, d_head=128, T=4096, L=32 layers, bf16 (2 bytes). KV cache MB?

Bytes = 2 × L × n_kv × T × d_head × 2.

Show derivation

2 × 32 × 8 × 4096 × 128 × 2 = 536,870,912 bytes = 512 MB

Full MHA (n_kv=32) would be 2048 MB. GQA reduces by 4×.

Exercise 2.5: Pre-Norm Layer Order Design

Order the four steps of one Pre-Norm Transformer decoder layer.

→

h = RMSNorm(x) a = Attn(h) x = x + a x = x + FFN(RMSNorm(x))

Show explanation

Pre-Norm: normalize before each sublayer (not after). The residual stream x is never directly normalized — only sublayer inputs are. This enables stable training at large scale because gradients flow through the residual path unobstructed.

Chapter 3: Mixture of Experts

MoE decouples total parameters from active parameters. A dense model uses all N params for every token. A MoE model routes each token to a sparse subset of experts — active params < total params.

N_total = N_shared + n_experts × N_expert
N_active = N_shared + top-k × N_expert
FLOPs/token = same as dense model with N_active
Capacity = C_f × (top-k / n) × T tokens per expert

Mixtral 8×7B numbers. 8 experts each ~7B + ~3.5B shared = 46.7B total. top-2 active: ~12.9B. FLOPs match 12.9B dense; memory holds 46.7B. The MoE tax: 3.6× memory overhead for the same compute cost.

Exercise 3.1: Active Parameters per Token Derive

16 experts, 2B params each, 1B shared, top-k=2. Active params per token (in billions)?

B active params

Show derivation

N_active = 1B + 2 × 2B = 5B

(N_total = 1B + 16 × 2B = 33B)

Exercise 3.2: MoE FFN FLOPs per Token Derive

d_model=4096, d_ff=8192, top-k=2. Two projections per expert (each: 2 × d_model × d_ff FLOPs). Total MFLOPs/token?

MFLOPs/token

Show derivation

One expert: 2 × 2 × 4096 × 8192 = 134 MFLOPs

top-k=2: 2 × 134 = 268 MFLOPs per token

Exercise 3.3: Expert Token Capacity Derive

n=8 experts, top-k=2, T=1024 tokens, C_f=1.5. Token capacity per expert?

tokens

Show derivation

Expected = (2/8) × 1024 = 256; Capacity = 1.5 × 256 = 384

Exercise 3.4: Load Balance Loss Gradient Trace

L_aux = n × ∑_i f_i × P_i. f_i is a token fraction (argmax-based). How does gradient flow to the router?

Via f_i — the hard count is treated as a continuous variable. Via P_i — the mean gate probability is differentiable; f_i is a stop-gradient constant that amplifies P_i's gradient signal for overloaded experts. Via Gumbel-softmax relaxation of f_i. There is no gradient — the loss is a constraint, not a penalty.

Show explanation

f_i is a hard count with zero gradient. P_i is the mean softmax probability — differentiable. When expert i is overloaded (high f_i), the loss term f_i×P_i is large; gradient pushes P_i down, reducing future assignment to expert i.

Exercise 3.5: Debug MoE Router Debug

This router always sends all tokens to the same single expert. Find the bug.

function route(x_flat, W_router, top_k) {
  const logits = matmul(x_flat, W_router);   // [B*T, n_experts]
  const probs  = softmax(logits, dim=-1);     // [B*T, n_experts]
  const topk   = probs.topk(1, dim=-1);       // hardcoded k=1 !
  const gates  = softmax(logits, dim=0);      // wrong axis: across tokens
  return { topk_idx: topk.indices, gates };
}

Show explanation

Line 4: topk(1, dim=-1) hardcodes k=1 — use topk(top_k, dim=-1). Line 5: softmax(..., dim=0) normalizes across tokens (wrong) instead of across experts (dim=-1). Both bugs cause degenerate routing where effectively one expert always wins.

Chapter 4: GPUs & the Roofline

Every GPU has two limits: how fast it can compute (peak FLOP/s) and how fast it can move data (memory bandwidth). The roofline model tells you which limit controls your workload. The number that bridges them is arithmetic intensity: FLOPs per byte transferred.

Arithmetic intensity: I = FLOPs / bytes_from_HBM
Ridge point: I* = peak_FLOPs / bandwidth
H100 SXM5: 989 TFLOPs/s (bf16) ÷ 3.35 TB/s = 295 FLOPs/byte
Decode: B=1, T=1 token → I = 1 FLOP/byte (295× below ridge)
Prefill: B=32, T=2048 → I ≈ 65,536 ≫ 295 (compute-bound)
Coalesced access: 32 threads in a warp read consecutive addresses

Decode is 295× below the roofline. During decode (single-token generation), we read every weight once and do only one multiply per weight. Intensity = 1 FLOPs/byte vs. H100 ridge of 295 FLOPs/byte. The GPU is almost entirely idle, waiting for memory. This is why batching decodes together is the primary throughput knob in production inference.

Exercise 4.1: H100 Ridge Point Derive

H100 SXM5 peak: 989 TFLOPs/s (bf16 Tensor Core), bandwidth: 3.35 TB/s. Compute the ridge point I* in FLOPs/byte (round to nearest integer).

FLOPs/byte

Show derivation

I* = 989 × 10¹² / 3.35 × 10¹² = 989/3.35 ≈ 295.2 FLOPs/byte

Any workload with I < 295 FLOPs/byte is memory-bound on the H100. Any workload with I > 295 is compute-bound. Prefill is compute-bound; decode is deeply memory-bound.

Exercise 4.2: Decode Arithmetic Intensity Derive

During decode: B=1 (batch size), T=1 (token). A weight matrix has shape [D, F]. FLOPs = 2×1×D×F. Bytes read ≈ 2×D×F (bf16 weights). What is the arithmetic intensity?

FLOPs/byte

Show derivation

I = 2×D×F / (2×D×F) = 1 FLOPs/byte

Exactly 1 FLOPs per byte — 295× below the H100 ridge. Each weight is loaded once and used for exactly one multiply. To reach the ridge: batch B=295 tokens simultaneously (I≈295). This is why batching is so critical for decode throughput.

Exercise 4.3: Prefill Arithmetic Intensity Derive

Prefill: B=1, T=2048 tokens, weight matrix [D=4096, F=4096]. FLOPs = 2×B×T×D×F. Bytes ≈ 2×D×F (weight reads dominate). What is the arithmetic intensity?

FLOPs/byte

Show derivation

FLOPs = 2 × 1 × 2048 × 4096 × 4096

Bytes = 2 × 4096 × 4096 (weight read, bf16)

I = FLOPs/Bytes = (2 × 2048 × D × F) / (2 × D × F) = 2048

I=2048 ≫ 295 (H100 ridge) → prefill is compute-bound. Each weight is used for 2048 multiply-accumulates (one per input token), amortizing the load cost. Compute-bound is good — the GPU stays fully utilized.

Exercise 4.4: Warp and Thread Block Trace

A kernel launches thread blocks of 256 threads on an A100. How many warps per block, and can threads in different warps of the same block communicate through shared memory?

8 warps; no — only threads in the same warp share SRAM. 8 warps (256/32); yes — all threads in the same block share the same SRAM, regardless of warp. 4 warps (256/64); yes — warp size is 64 on A100. 256 warps; each thread is its own warp.

Show explanation

Warp size is always 32 on NVIDIA GPUs (A100, H100). 256/32 = 8 warps per block. All threads in the same block share the SM's SRAM — this is how threads communicate without going to HBM. Threads in different blocks cannot share SRAM directly.

Exercise 4.5: HBM Bandwidth Bottleneck in LayerNorm Derive

RMSNorm requires reading d=4096 bf16 values (2 bytes each) twice: once to compute RMS, once to normalize. How many bytes of HBM access per token per RMSNorm call? (Answer: just the two passes over x.)

bytes

Show derivation

2 passes × 4096 elements × 2 bytes/element = 16,384 bytes = 16 KB

LayerNorm needs 3 passes (mean, variance, normalize): 3×4096×2 = 24 KB. RMSNorm saves one pass (no mean). Flash Attention similarly fuses attention into one pass to avoid re-reading the activation matrix from HBM — the same principle.

Chapter 5: Parallelism

When a model doesn't fit on one GPU, you distribute it. There are three orthogonal strategies: data parallelism (replicate model, split batch), tensor parallelism (split weight matrices), and pipeline parallelism (split layers). ZeRO eliminates redundant optimizer state across data-parallel replicas.

ZeRO stage memory per GPU (P GPUs):
Stage 0 (none): 16 bytes/param (full copy on each GPU)
Stage 1 (optimizer): 16/P bytes from Adam state + 4 bytes/param
Stage 2 (+ gradients): 16/P bytes from state + 2 bytes/param gradients
Stage 3 (+ params): (16+2+2)/P + 0 = 20/P bytes/param total

Ring all-reduce cost per GPU:
Bytes transferred = 2 × (P−1)/P × M ≈ 2M for large P
where M = gradient buffer size in bytes

Pipeline bubble fraction:
Bubble = (P−1) / (m + P − 1)
where P = pipeline stages, m = microbatches

ZeRO stages save memory but not compute. ZeRO-3 shards parameters, gradients, and optimizer state across P GPUs. Each GPU stores only 1/P of everything. But each forward/backward still accesses all parameters — they are all-gathered on demand. ZeRO saves memory; the communication cost (all-gather + reduce-scatter per layer) is the overhead.

Exercise 5.1: ZeRO Stage 1 Memory per GPU Derive

7B model, 8 GPUs, ZeRO Stage 1 (shards optimizer state only). Stage 1: Adam m+v+fp32 master = 12 bytes/param sharded, plus bf16 weights (2 bytes) and bf16 gradients (2 bytes) unsharded. How many GB per GPU?

Per-GPU = (12/8 + 2 + 2) bytes/param × 7B params.

Show derivation

Per-param bytes = 12/8 + 2 + 2 = 1.5 + 4 = 5.5 bytes/param

Total = 5.5 × 7 × 10⁹ = 38.5 × 10⁹ bytes = 38.5 GB per GPU

Compare to 112 GB without ZeRO (16 bytes/param). ZeRO-1 reduces memory 2.9× by sharding the large optimizer state. Each of the 8 GPUs now fits in an A100-80GB with room for activations.

Exercise 5.2: Ring All-Reduce Volume Derive

7B parameter model, bf16 gradients (2 bytes/param), 8 GPUs. Ring all-reduce sends 2×(P−1)/P×M bytes per GPU. How many GB does each GPU send/receive? (P=8, M = 7B × 2 bytes.)

GB per GPU

Show derivation

M = 7 × 10⁹ × 2 = 14 GB (gradient buffer)

Bytes/GPU = 2 × (8−1)/8 × 14 = 2 × 0.875 × 14 = 24.5 GB

Each GPU sends/receives 24.5 GB per training step. At NVLink 600 GB/s bandwidth: 24.5/600 ≈ 41 ms for the all-reduce. At InfiniBand 25 GB/s: 24.5/25 = 980 ms — almost a second just for synchronization, dwarfing the forward/backward pass.

Exercise 5.3: Pipeline Bubble Fraction Derive

Pipeline parallelism: P=4 stages, m=8 microbatches. What fraction of time is the pipeline idle (bubble fraction)?

Bubble = (P−1) / (m + P − 1).

bubble fraction

Show derivation

Bubble = (4−1) / (8 + 4 − 1) = 3/11 ≈ 0.273

27.3% idle. To halve the bubble: double m to 16 → 3/19 ≈ 15.8%. More microbatches = smaller bubble but larger effective batch size, which can hurt convergence. Interleaved pipeline schedules (Megatron-LM) reduce bubble to ~(P−1)/(2mV) for V chunks per stage.

Exercise 5.4: Tensor Parallelism Communication Trace

In tensor parallelism (column-then-row split of the MLP), why should you keep TP within a single node (up to 8 GPUs) rather than spanning nodes?

Because tensor shards are too large to fit in InfiniBand packets. TP requires an all-reduce after every transformer block; within-node NVLink (~600 GB/s) makes this fast, but cross-node InfiniBand (~25 GB/s) would make every block's all-reduce 24× slower, blocking compute. Weight shards must remain on contiguous GPU memory, which only exists within a node. Tensor parallelism requires synchronous checkpointing, which spans nodes.

Show explanation

TP requires an all-reduce every transformer block (after the row-parallel linear). With 32+ layers, that's 32 all-reduces per step. Cross-node at 25 GB/s makes each synchronization ~24× slower than NVLink. The compute cannot proceed until synchronization completes, making cross-node TP a catastrophic bottleneck.

Exercise 5.5: ZeRO-3 Memory per GPU Derive

ZeRO-3 shards all 16 bytes/param across P=8 GPUs. For a 7B parameter model, how many GB does each GPU hold?

GB per GPU

Show derivation

Per GPU = 16 bytes/param × 7B / 8 = 2 bytes/param × 7B = 14 GB

112 GB total / 8 GPUs = 14 GB per GPU — an 8× reduction. But parameters must be all-gathered for each forward/backward layer pass, adding communication overhead. ZeRO-3 trades memory for communication latency.

Chapter 6: Scaling Laws

Scaling laws let you predict model performance before spending the compute. The key formulas: C = 6ND (compute from parameters and tokens), L(N, D) = E + A/N^α + B/D^β (loss from both), and the Chinchilla optimal allocation D* = 20N.

Joint loss: L(N, D) = E + A⁄N^α + B⁄D^β
Chinchilla constants: E=1.69, A=406.4, α=0.34, B=410.7, β=0.28

Chinchilla optimal: D* = 20N (equal scaling rule)
⇒ N* = C/(6 × 20N) ⇒ N* = √(C/120)

IsoFLOP optimum: at fixed C=6ND, minimize L by setting
∂L/∂N = 0 ⇒ α×A/N^α+1 = β×B/D^β+1

Chinchilla result. Deepmind (Hoffmann et al. 2022) showed GPT-3 (175B params, 300B tokens) was under-trained by 10×. The Chinchilla-optimal model for the same compute: 70B params, 1.4T tokens. Chinchilla 70B matched or exceeded GPT-3 on most benchmarks despite 2.5× fewer parameters. The lesson: data is as important as model size.

Exercise 6.1: GPT-3 Compute Derive

GPT-3: N=175B params, D=300B tokens. Compute C=6ND. Express as X×10²³ — what is X to 2 decimal places?

×10²³ FLOPs

Show derivation

C = 6 × 175 × 10⁹ × 300 × 10⁹

= 6 × 52,500 × 10¹⁸ = 315,000 × 10¹⁸ = 3.15 × 10²³

OpenAI reported 3.14×10²³ FLOPs — our formula is accurate to within 0.3%. The 6ND rule holds because attention costs are <5% of total FLOPs for standard context lengths.

Exercise 6.2: Chinchilla Optimal N* Derive

Given GPT-3's compute budget C=3.15×10²³ FLOPs, what is the Chinchilla-optimal number of parameters? Use N* = √(C/120). Give in billions.

billion params

Show derivation

N* = √(C/120) = √(3.15×10²³ / 120)

= √(2.625×10²¹) = 1.620×10^10.5 ≈ 51.2×10⁹ = 51.2B

Chinchilla-optimal: ~51B params × 20 = ~1.02T tokens. Deepmind actually ran 70B with 1.4T tokens — close to this optimum. GPT-3 at 175B is over-parameterized by ~3.4× for its compute budget.

Exercise 6.3: Loss Improvement from Doubling Data Derive

A model is trained on D=100B tokens. The data term of the loss is B/D^β with B=410.7, β=0.28. Compute the data term at D=100B, then at D=200B. How much does the data term decrease? (Express as fraction of the D=100B value; round to 2 decimal places.)

fractional decrease

Show derivation

At D₁: B/D₁^β. At D₂=2D₁: B/(2D₁)^β = B/D₁^β × 2^−β

Fractional decrease = 1 − 2^−0.28 = 1 − 0.825 = 0.175

Doubling data reduces the data term by 17.5%, regardless of the absolute value of D. This is the power law: each doubling gives the same proportional improvement. The β=0.28 exponent is small — data improvements are incremental, not exponential.

Exercise 6.4: IsoFLOP Optimum Tradeoff Trace

On an IsoFLOP curve (fixed C=6ND), as you increase N and decrease D proportionally, what happens to validation loss?

Loss decreases monotonically — more parameters always help. Loss increases monotonically — fewer tokens hurt more than bigger model helps. Loss forms a U-shaped curve — there is an optimal N (not too small, not too large) that minimizes loss at fixed compute, because both terms A/N^α and B/D^β increase when their respective resource is reduced. Loss stays flat across the IsoFLOP curve; only absolute compute matters.

Show explanation

L(N,D) = E + A/N^α + B/D^β. As N→∞ (and D→0), B/D^β→∞ (data starvation). As N→0 (and D→∞), A/N^α→∞ (capacity starvation). There's an optimum in between where the gradient of both terms balance. The IsoFLOP plot is U-shaped — the bottom is the Chinchilla optimum.

Chapter 7: Inference & KV Cache

Inference has two distinct phases: prefill (process the prompt in parallel — compute-bound) and decode (generate one token at a time — memory-bound). The KV cache eliminates redundant recomputation during decode.

KV cache bytes (per request):
2 (K,V) × L × n_kv × T_ctx × d_head × 2 bytes (bf16)

Decode arithmetic intensity: I = B×T ≈ B (batch size)
Need B ≥ I* = 295 (H100) to be compute-bound

Speculative decoding speedup:
Speedup ≈ γ × acceptance_rate / (1 + γ × cost_ratio)
where γ = draft tokens per verify step

KV cache is the dominant memory cost at long context. A 7B LLaMA model has ~32 GB of weights. With T=32,768 context, n_kv=32 heads, d_head=128, 32 layers: KV = 2×32×32×32768×128×2 = 17 GB — more than half the model weights, for a single request. With 8-bit quantization of the KV cache this halves to 8.5 GB.

Exercise 7.1: KV Cache per Request Derive

7B LLaMA (32 layers, 32 KV heads, d_head=128), T=8192 tokens, bf16. How many GB is the KV cache for one request?

Bytes = 2 × 32 × 32 × 8192 × 128 × 2.

Show derivation

2 × 32 × 32 × 8192 × 128 × 2 = 4,294,967,296 bytes ≈ 4.29 GB

4.29 GB for one 8K request. With 16 concurrent requests: 16×4.29 = 68.6 GB — nearly one full A100-80GB just for KV caches. Long-context serving is fundamentally a memory management problem.

Exercise 7.2: Batch Size for Compute-Bound Decode Derive

H100 ridge: I*=295 FLOPs/byte. During decode, I≈batch_size B. What minimum batch size makes decode compute-bound on an H100?

minimum batch size

Show derivation

Need I = B ≥ I* = 295

Batch 295+ requests together for compute-bound decode. In practice you'd use B=256 or B=512 (powers of 2). This is why LLM serving systems (vLLM, TensorRT-LLM) invest heavily in continuous batching — to maintain large effective batch sizes even with different request lengths.

Exercise 7.3: int8 Quantization Memory Savings Derive

A 7B model in bf16 weights = 14 GB. Quantizing weights to int8 (1 byte/param). How many GB are saved, and what is the percentage reduction?

GB saved

Show derivation

bf16: 7B × 2 bytes = 14 GB

int8: 7B × 1 byte = 7 GB

Saved: 14 − 7 = 7 GB (50% reduction)

int8 quantization halves weight memory. Arithmetic intensity during decode also doubles (halved bytes for same FLOPs). GPTQ and AWQ achieve <0.5% perplexity degradation with int4 quantization, saving 75%.

Exercise 7.4: Prefill vs. Decode Phase Trace

A user sends a 500-token prompt; the model generates 200 tokens of response. Which phase is compute-bound and which is memory-bound, and why?

Both phases are compute-bound because the transformer is parameter-heavy. Prefill is compute-bound (all 500 tokens processed in parallel, I≈500 ≫ 295); decode is memory-bound (1 token per step, I≈1 ≪ 295). Prefill is memory-bound because it reads all 500 prompt embeddings; decode is compute-bound because it runs 200 forward passes. Both phases are memory-bound because weight loading dominates in all cases.

Show explanation

Arithmetic intensity ≈ B×T (tokens processed). Prefill: T=500 ≫ 295 → compute-bound. Decode: T=1 at each step → I=1 ≪ 295 → memory-bound. Different hardware metrics matter for each: prefill optimization targets TFLOP/s utilization; decode optimization targets memory bandwidth and batching.

Chapter 8: Alignment Math

After pretraining comes alignment: teaching the model to be helpful, harmless, and honest. SFT fine-tunes on curated demonstrations. RLHF uses a reward model + PPO. DPO and GRPO eliminate the reward model entirely by optimizing preference probabilities directly.

Bradley-Terry preference model:
P(y_w ≻ y_l) = σ(r(x,y_w) − r(x,y_l)) = 1/(1 + e^−Δr)

DPO loss:
L_DPO = −log σ(β × (log π_θ(y_w|x) − log π_ref(y_w|x))
− β × (log π_θ(y_l|x) − log π_ref(y_l|x)))

GRPO advantage (group relative):
A_i = (r_i − mean(r)) / std(r)

DPO insight. DPO sidesteps reward model training entirely. It shows that the optimal policy under RLHF's KL-constrained objective can be expressed directly in terms of the reference model: r(x,y) = β × log(π_θ(y|x) / π_ref(y|x)) + Z(x). Plugging this into the Bradley-Terry loss eliminates the reward model and gives a tractable direct loss on the policy.

Exercise 8.1: Bradley-Terry Preference Probability Derive

A reward model gives r(x, y_win)=3.0 and r(x, y_lose)=1.5. Using P(y_w≻y_l) = σ(Δr) = 1/(1+e^−Δr), what is the probability the winner is preferred? (Round to 2 decimal places.)

probability

Show derivation

Δr = 3.0 − 1.5 = 1.5

P = 1/(1 + e^−1.5) = 1/(1 + 0.2231) = 1/1.2231 ≈ 0.818

A reward gap of 1.5 nats gives 82% preference probability. At Δr=0 the model is indifferent (50%). At Δr=4.6 (ln99) the probability is 99%. The Bradley-Terry model is essentially a logistic regression on reward differences.

Exercise 8.2: GRPO Advantage Normalization Derive

A group of 4 samples has rewards [1, 3, 2, 6]. Compute the normalized advantage A_i = (r_i − mean) / std for r₄=6. (Use population std, not sample std.)

A₄

Show derivation

mean = (1+3+2+6)/4 = 12/4 = 3.0

variance = [(1-3)²+(3-3)²+(2-3)²+(6-3)²]/4 = [4+0+1+9]/4 = 3.5

std = √3.5 ≈ 1.871

A₄ = (6−3)/1.871 = 3/1.871 ≈ 1.604 ≈ 1.60

The best response (r=6) gets advantage +1.60. The worst (r=1) gets (1-3)/1.871 = -1.07. Normalization keeps gradient magnitudes stable regardless of the absolute reward scale — crucial because reward functions differ wildly across tasks.

Exercise 8.3: SFT Loss = NLL Loss Trace

SFT fine-tuning uses negative log-likelihood loss on (prompt, response) pairs. Which tokens are included in the loss computation?

All tokens including the prompt — the model must learn to predict prompt tokens too. Only the response tokens — the prompt is provided as context but its tokens are masked from the loss, because we want the model to learn what good completions look like, not to memorize prompts. Only the last token of the response — the final answer token. Randomly sampled 50% of all tokens to reduce overfitting.

Show explanation

During SFT, the prompt tokens are fed as context with their loss masked (loss_mask=0). Only response tokens have loss_mask=1. This focuses learning signal on the quality of the completion. Including prompt tokens in the loss would penalize the model for "not generating" the prompt, which is counterproductive.

Exercise 8.4: DPO β Parameter Effect Trace

In DPO, the β parameter controls the KL penalty between the learned policy and the reference model. What happens when β→0?

The model ignores preference data and stays identical to the reference. The KL constraint disappears — the model can move arbitrarily far from reference to maximize preference margin, risking reward hacking and mode collapse. The model converges faster because there is no regularization overhead. β=0 makes DPO equivalent to SFT on the winning responses only.

Show explanation

β controls how strongly the KL divergence from reference penalizes large policy updates. At β→0, the model is free to exploit any pattern that increases preference margin, even degenerate ones. Typical practice: β=0.1 to 0.5. Too large (β→∞) and the model barely moves from reference. Too small risks training instability and reward hacking.

Exercise 8.5: Bloom Filter False Positive Rate Derive

Deduplication uses Bloom filters to check if a document was already seen. A Bloom filter with m=10⁹ bits, n=10⁷ items, and k=7 hash functions has false positive rate ≈ (1−e^−kn/m)^k. Compute (1−e^{−7×10⁷/10⁹})⁷. Round to 3 decimal places.

FP rate

Show derivation

kn/m = 7 × 10⁷ / 10⁹ = 0.07

1 − e^−0.07 = 1 − 0.9324 = 0.0676

(0.0676)⁷ = e^{7×ln(0.0676)} ≈ e^{7×(−2.694)} = e^−18.86 ≈ 6.3×10⁻⁹

With these parameters FP rate ≈ 0.006 (using standard Bloom filter formula approximation). Practically near-zero false positives with 1 GB of bits and 7 hashes per 10M items. The Pile and C4 both use Bloom-filter-based deduplication at this scale.

Chapter 9: Capstone — Size and Cost a Full Run

You have a budget of $1M. H100-SXM5: 3.96×10¹⁵ BF16 FLOPs/GPU/day at ~MFU=0.45 (effective 1.78×10¹⁵ FLOPs/GPU/day), rental cost $2/GPU/hour. Design a training run end-to-end: pick a model size, compute required training time, verify it fits the budget, and estimate serving cost.

Useful GPU throughput: MFU × peak_FLOPs/s
H100 at MFU=0.45: 0.45 × 989 TFLOPs/s ≈ 445 TFLOPs/s effective
Training time: t = C / (n_GPUs × FLOPs/GPU/s)
Budget check: cost = n_GPUs × t(hours) × $/GPU/hour
Tokens/day/GPU: MFU × peak / (6N) tokens/s × 86400

The $1M calculation. $1M at $2/GPU/hr = 500,000 GPU-hours. 500,000/24 = 20,833 GPU-days. At 1.78×10¹⁵ effective FLOPs/GPU/day: total C = 20,833 × 1.78×10¹⁵ ≈ 3.7×10¹⁹ FLOPs. Chinchilla-optimal: N*=√(C/120)≈√(3.1×10¹⁷)≈560M params, D*=20N≈11.2B tokens. A solid 0.5B-class model.

Exercise 9.1: Effective GPU Throughput Derive

An H100 peaks at 989 TFLOPs/s (bf16). At MFU=0.45, what is the effective throughput in TFLOPs/s? (Round to nearest integer.)

TFLOPs/s

Show derivation

Effective = 0.45 × 989 = 445.05 ≈ 445 TFLOPs/s

MFU (Model FLOP Utilization) accounts for all overhead: data loading, optimizer steps, communication, idle time. State-of-the-art training achieves MFU 0.35–0.55. Megatron-LM reports 0.52 on GPT-3 scale. 0.45 is a realistic estimate for a single-node run.

Exercise 9.2: Training Time for 1.3B Model Derive

N=1.3B, D=26B (Chinchilla-optimal). C=6ND≈2.03×10²⁰ FLOPs. Using 8 H100s at 445 TFLOPs/s each. How many hours does training take?

t = C / (n × 445×10¹² FLOPs/s) / 3600 seconds/hour.

hours

Show derivation

FLOPs/s cluster = 8 × 445×10¹² = 3.56×10¹⁵ FLOPs/s

t(s) = 2.03×10²⁰ / 3.56×10¹⁵ = 57,022 s

t(hr) = 57,022 / 3600 ≈ 15.8 hours

Just 16 hours on 8 H100s for a Chinchilla-optimal 1.3B model! At $2/GPU/hr: 8×16×$2=$256. Incredibly cheap by frontier model standards.

Exercise 9.3: Training Cost Derive

Using the result from 9.2: 8 H100s for 16 hours at $2/GPU/hour. What is the total training cost in dollars?

dollars

Show derivation

Cost = 8 GPUs × 16 hours × $2/GPU/hour = $256

$256 for a 1.3B Chinchilla-optimal model. Llama 2 7B training reportedly cost ~$5,000–$10,000 on comparable hardware. GPT-3 cost ~$10M. These orders of magnitude are the key intuition for budgeting AI research.

Exercise 9.4: Full Pipeline Order Design

Arrange the five stages of building a language model from scratch in the correct order.

→

Data collection & dedup Train tokenizer Pretrain (C=6ND) SFT + alignment Deploy & serve

Show explanation

Data first (raw corpus defines vocabulary coverage). Tokenizer second (must be trained on the final data to capture its distribution). Pretraining third (costs the most compute). SFT + alignment fourth (orders of magnitude cheaper than pretraining, fine-tunes behavior). Serving last (inference optimization: quantization, KV cache, batching).

Exercise 9.5: Serving Throughput Derive

Decode throughput: an H100 at effective 445 TFLOPs/s, serving a 1.3B model (N=1.3B). Each token costs 2N=2.6B FLOPs (forward pass only, half of training). How many tokens per second can one H100 generate if fully loaded?

tokens/sec

Show derivation

FLOPs per token = 2N = 2 × 1.3×10⁹ = 2.6×10⁹

tokens/s = 445×10¹² / 2.6×10⁹ = 171,154 ≈ 171,000 tokens/s

171K tokens/sec compute-bound upper bound. In reality, decode is memory-bound at small batch sizes. With batch=1: only ~85K tokens/sec (memory bandwidth limited). With batch=512: approaches the compute bound. The gap between memory-limited and compute-bound performance is the motivation for continuous batching in production.

Exercise 9.6: Total Parameters Accounting Derive

A transformer with d=1024, n_heads=16, d_ff=2730 (SwiGLU iso-FLOP), L=24 layers, vocab=32256. Count parameters: (a) embedding table [V,d]; (b) each layer: Q,K,V,O projections [d,d] each, FFN W1,W2,Wgate [d,d_ff],[d_ff,d],[d,d_ff], 2 RMSNorms [d] each. Multiply (b) by 24 layers. What is the total in millions? (Ignore biases.)

million params

Show derivation

Embedding: 32256 × 1024 = 33,030,144 ≈ 33M

Per layer attention (Q+K+V+O): 4 × 1024 × 1024 = 4,194,304 ≈ 4.19M

Per layer FFN SwiGLU (W1+W2+Wgate): 2×1024×2730 + 2730×1024 = 5,591,040+2,795,520 = 8,386,560 ≈ 8.39M

Per layer RMSNorm (2 × d): 2 × 1024 = 2048 ≈ negligible

Per layer total: 4.19M + 8.39M + 0.002M ≈ 12.58M

24 layers: 24 × 12.58M = 301.9M

Total: 33M + 301.9M + final LM head (output proj, often tied) ≈ 335M–460M

With separate LM head (untied, [d, V]=33M): 335M+33M≈368M. With weight tying (embedding = LM head): 335M. Either way, the model is in the “400M” class. The calculation gives 459M accepting some rounding.

The proof of work. If you completed every chapter — counted merge operations in BPE, derived 6ND from first principles, computed SwiGLU hidden dimensions, routed tokens to MoE experts, positioned yourself on the roofline, sharded optimizer state with ZeRO, solved the Chinchilla optimum, sized KV caches for long context, normalized GRPO advantages, and costed a full training run — you understand the math that runs every frontier language model. "What I cannot create, I do not understand."

Related Lessons

Topic	Lesson
Tokenization & BPE	CS336 L1: Overview & Tokenization
Resource accounting	CS336 L2: PyTorch & Resource Accounting
Architectures	CS336 L3: Architectures & Hyperparameters
Scaling laws	CS336 L9: Scaling Laws
Inference	CS336 L10: Inference