Every number a language model engineer must be able to derive cold: tokenization tradeoffs, 6ND FLOPs, AdamW memory, RoPE geometry, MoE routing, roofline arithmetic intensity, ZeRO stages, Chinchilla scaling, KV-cache sizing, and alignment math. All verifiable in-browser with instant feedback.
Every byte that enters a language model passes through a tokenizer first. The tokenizer's job is to map raw text to a sequence of integer IDs. The choices here — vocabulary size, merge count, coverage — cascade through every other number in the system: sequence length, embedding table size, and softmax cost.
A byte-level BPE tokenizer starts with a 256-symbol base alphabet and learns 32,000 merges. What is the final vocabulary size?
Each BPE merge adds exactly one new token (the merged pair) to the vocabulary. Starting from 256 bytes, 32,000 merges yields 32,256 total tokens. This is the vocabulary of the original LLaMA-1 tokenizer.
A tokenizer encodes 1,200 characters of English text into 300 tokens. What is the compression ratio (characters per token)?
English text typically compresses at 3.5–4.5 chars/token with a well-trained BPE tokenizer. A ratio of 4 is normal. Code tokenizes worse (~2–3 chars/token) because of many special characters. Chinese/Japanese tokenize better per character but each character is often 3 UTF-8 bytes.
You have a corpus of 400 billion characters. Your tokenizer has a compression ratio of 4 chars/token. How many tokens is the corpus (in billions)?
400 billion characters at 4 chars/token = 100 billion tokens. This is roughly the size of the Pile dataset (825 GB, ~300B tokens). The compression ratio is the bridge between "raw data size" and "training tokens" — a critical number for dataset planning.
The embedding table is shape [|V|, dmodel] and the output projection is [dmodel, |V|]. Both scale linearly with |V|. Going from 32K to 128K quadruples these two tensors — for a 7B model with d=4096, the embedding table alone grows from 32K×4096×2 bytes = 256 MB to 128K×4096×2 = 1 GB. But sequence lengths shrink 4× for multilingual text, which reduces attention cost quadratically. The tradeoff depends on which effect dominates for your use case.
GPT-2 uses byte-level BPE with a vocabulary of 50,257 tokens (50,256 merges + 1 special token, but treat it as 50,256 merges from 256 base). How many total merges did GPT-2 learn?
GPT-2 learned 50,000 BPE merges from a base alphabet of 256 bytes. The remaining 1 token is <|endoftext|>. This gives |V|=50,257. GPT-3 and GPT-4 use the same tokenizer (cl100k_base added 50K more merges for code). CS336 asks you to implement BPE from scratch — the merge count is the key hyperparameter you control.
BPE training begins by counting all adjacent token pairs. Write a function that counts all bigrams in a list of token sequences.
javascript function count_pairs(sequences) { const counts = {}; for (const seq of sequences) { for (let i = 0; i < seq.length - 1; i++) { const key = seq[i] + "," + seq[i+1]; counts[key] = (counts[key] || 0) + 1; } } return counts; }
Before writing a single line of training code, you must know how much memory and compute your model will consume. The numbers come from first principles: matrix shapes, dtype widths, and the 6ND rule.
A linear layer multiplies [1024, 4096] by [4096, 4096]. How many FLOPs? (Give your answer in millions, i.e. divide by 106.)
Use 2MKN for a [M×K] × [K×N] matmul.
Each output element requires K=4096 multiply-adds = 2K FLOPs. Total output elements = M×N = 1024×4096 = 4,194,304. Total = 2×4096×4,194,304 ≈ 34.4 GFLOPs for a single token batch of 1024.
You train a 1.3B parameter model on 26 billion tokens. How many FLOPs does training take? Give the answer as a power of 10 (what is the exponent x such that C ≈ 10x)?
C = 6ND. Then compute log10(C).
Rounded: C ≈ 1023 is the Chinchilla-optimal region for 1.3B. But our D=26B at N=1.3B satisfies D=20N exactly (Chinchilla rule), so C = 6 × 1.3B × 26B ≈ 2×1020. Exponent ≈ 20.3. The answer 23 is the closest integer — but a strict check: the question says "what exponent" — log10(2.03×1020) = 20.3. Accept ~20.
Training a 7 billion parameter model with mixed-precision AdamW (16 bytes/param total). How many GB of GPU memory does the optimizer state + weights + gradients occupy?
Breakdown: bf16 weights (14 GB) + bf16 gradients (14 GB) + fp32 master weights (28 GB) + fp32 Adam m (28 GB) + fp32 Adam v (28 GB) = 112 GB. This alone exceeds one A100 (80 GB), which is why multi-GPU training is non-negotiable even for a "small" 7B model.
A Transformer layer processes batch size B=4, sequence length T=2048, hidden dim d=4096, in bf16 (2 bytes/element). Attention activations require storing Q, K, V: three tensors of shape [B, T, d]. How many MB just for these three tensors?
Q, K, V each have shape [B, T, d] = [4, 2048, 4096]. In bf16 (2 bytes): 4×2048×4096×2 = 67M bytes = 64 MiB each. Three of them = 192 MiB. This is just the projection outputs — the attention score matrix [B, H, T, T] adds more, and you need to store all activations for the backward pass. This is why activation checkpointing trades recomputation for memory.
13B params × 16 bytes/param = 208 GB. A single A100 has 80 GB, so the optimizer state alone needs at least 3 GPUs (and realistically more, because activations and KV cache pile on top). Without sharding (ZeRO or pipeline), you can't even fit the optimizer state, let alone run forward passes.
You have 1022 FLOPs to spend and a 3B parameter model. Using the Chinchilla-optimal recipe (D = 20N), how many tokens should you train on? (Give in billions.)
D=20N yields 60B tokens, costing ~1021 FLOPs — one order of magnitude less than our 1022 budget. With budget C=1022 and D=20N, we could train a larger model: N = C/(6×20N) → N2 = 1022/120 → N ≈ 9.1B. The answer to the stated question is 60B tokens for N=3B following D=20N.
Modern transformers differ from the original in four ways: RMSNorm, SwiGLU, RoPE, and GQA. Each changes a number you must be able to compute.
dmodel=4096, 32 layers, 2 norm modules per layer. LayerNorm has γ+β (2d params); RMSNorm has only γ (d params). How many fewer parameters per layer?
dmodel=2048. What is dff under the iso-FLOP SwiGLU constraint? Round to nearest multiple of 256.
θbase=10000, d=128, k=0, m=1. Compute θ0 = m × θbase−2k/d in radians.
At k=0 the exponent is 0, giving 1. At k=63: 10000−126/128≈0.000102 — extremely slow rotation for long-range position encoding.
32 query heads, nkv=8 KV heads, dhead=128, T=4096, L=32 layers, bf16 (2 bytes). KV cache MB?
Bytes = 2 × L × nkv × T × dhead × 2.
Full MHA (nkv=32) would be 2048 MB. GQA reduces by 4×.
Order the four steps of one Pre-Norm Transformer decoder layer.
Pre-Norm: normalize before each sublayer (not after). The residual stream x is never directly normalized — only sublayer inputs are. This enables stable training at large scale because gradients flow through the residual path unobstructed.
MoE decouples total parameters from active parameters. A dense model uses all N params for every token. A MoE model routes each token to a sparse subset of experts — active params < total params.
16 experts, 2B params each, 1B shared, top-k=2. Active params per token (in billions)?
dmodel=4096, dff=8192, top-k=2. Two projections per expert (each: 2 × dmodel × dff FLOPs). Total MFLOPs/token?
n=8 experts, top-k=2, T=1024 tokens, Cf=1.5. Token capacity per expert?
fi is a hard count with zero gradient. Pi is the mean softmax probability — differentiable. When expert i is overloaded (high fi), the loss term fi×Pi is large; gradient pushes Pi down, reducing future assignment to expert i.
This router always sends all tokens to the same single expert. Find the bug.
function route(x_flat, W_router, top_k) { const logits = matmul(x_flat, W_router); // [B*T, n_experts] const probs = softmax(logits, dim=-1); // [B*T, n_experts] const topk = probs.topk(1, dim=-1); // hardcoded k=1 ! const gates = softmax(logits, dim=0); // wrong axis: across tokens return { topk_idx: topk.indices, gates }; }
Line 4: topk(1, dim=-1) hardcodes k=1 — use topk(top_k, dim=-1). Line 5: softmax(..., dim=0) normalizes across tokens (wrong) instead of across experts (dim=-1). Both bugs cause degenerate routing where effectively one expert always wins.
Every GPU has two limits: how fast it can compute (peak FLOP/s) and how fast it can move data (memory bandwidth). The roofline model tells you which limit controls your workload. The number that bridges them is arithmetic intensity: FLOPs per byte transferred.
H100 SXM5 peak: 989 TFLOPs/s (bf16 Tensor Core), bandwidth: 3.35 TB/s. Compute the ridge point I* in FLOPs/byte (round to nearest integer).
Any workload with I < 295 FLOPs/byte is memory-bound on the H100. Any workload with I > 295 is compute-bound. Prefill is compute-bound; decode is deeply memory-bound.
During decode: B=1 (batch size), T=1 (token). A weight matrix has shape [D, F]. FLOPs = 2×1×D×F. Bytes read ≈ 2×D×F (bf16 weights). What is the arithmetic intensity?
Exactly 1 FLOPs per byte — 295× below the H100 ridge. Each weight is loaded once and used for exactly one multiply. To reach the ridge: batch B=295 tokens simultaneously (I≈295). This is why batching is so critical for decode throughput.
Prefill: B=1, T=2048 tokens, weight matrix [D=4096, F=4096]. FLOPs = 2×B×T×D×F. Bytes ≈ 2×D×F (weight reads dominate). What is the arithmetic intensity?
I=2048 ≫ 295 (H100 ridge) → prefill is compute-bound. Each weight is used for 2048 multiply-accumulates (one per input token), amortizing the load cost. Compute-bound is good — the GPU stays fully utilized.
Warp size is always 32 on NVIDIA GPUs (A100, H100). 256/32 = 8 warps per block. All threads in the same block share the SM's SRAM — this is how threads communicate without going to HBM. Threads in different blocks cannot share SRAM directly.
RMSNorm requires reading d=4096 bf16 values (2 bytes each) twice: once to compute RMS, once to normalize. How many bytes of HBM access per token per RMSNorm call? (Answer: just the two passes over x.)
LayerNorm needs 3 passes (mean, variance, normalize): 3×4096×2 = 24 KB. RMSNorm saves one pass (no mean). Flash Attention similarly fuses attention into one pass to avoid re-reading the activation matrix from HBM — the same principle.
When a model doesn't fit on one GPU, you distribute it. There are three orthogonal strategies: data parallelism (replicate model, split batch), tensor parallelism (split weight matrices), and pipeline parallelism (split layers). ZeRO eliminates redundant optimizer state across data-parallel replicas.
7B model, 8 GPUs, ZeRO Stage 1 (shards optimizer state only). Stage 1: Adam m+v+fp32 master = 12 bytes/param sharded, plus bf16 weights (2 bytes) and bf16 gradients (2 bytes) unsharded. How many GB per GPU?
Per-GPU = (12/8 + 2 + 2) bytes/param × 7B params.
Compare to 112 GB without ZeRO (16 bytes/param). ZeRO-1 reduces memory 2.9× by sharding the large optimizer state. Each of the 8 GPUs now fits in an A100-80GB with room for activations.
7B parameter model, bf16 gradients (2 bytes/param), 8 GPUs. Ring all-reduce sends 2×(P−1)/P×M bytes per GPU. How many GB does each GPU send/receive? (P=8, M = 7B × 2 bytes.)
Each GPU sends/receives 24.5 GB per training step. At NVLink 600 GB/s bandwidth: 24.5/600 ≈ 41 ms for the all-reduce. At InfiniBand 25 GB/s: 24.5/25 = 980 ms — almost a second just for synchronization, dwarfing the forward/backward pass.
Pipeline parallelism: P=4 stages, m=8 microbatches. What fraction of time is the pipeline idle (bubble fraction)?
Bubble = (P−1) / (m + P − 1).
27.3% idle. To halve the bubble: double m to 16 → 3/19 ≈ 15.8%. More microbatches = smaller bubble but larger effective batch size, which can hurt convergence. Interleaved pipeline schedules (Megatron-LM) reduce bubble to ~(P−1)/(2mV) for V chunks per stage.
TP requires an all-reduce every transformer block (after the row-parallel linear). With 32+ layers, that's 32 all-reduces per step. Cross-node at 25 GB/s makes each synchronization ~24× slower than NVLink. The compute cannot proceed until synchronization completes, making cross-node TP a catastrophic bottleneck.
ZeRO-3 shards all 16 bytes/param across P=8 GPUs. For a 7B parameter model, how many GB does each GPU hold?
112 GB total / 8 GPUs = 14 GB per GPU — an 8× reduction. But parameters must be all-gathered for each forward/backward layer pass, adding communication overhead. ZeRO-3 trades memory for communication latency.
Scaling laws let you predict model performance before spending the compute. The key formulas: C = 6ND (compute from parameters and tokens), L(N, D) = E + A/Nα + B/Dβ (loss from both), and the Chinchilla optimal allocation D* = 20N.
GPT-3: N=175B params, D=300B tokens. Compute C=6ND. Express as X×1023 — what is X to 2 decimal places?
OpenAI reported 3.14×1023 FLOPs — our formula is accurate to within 0.3%. The 6ND rule holds because attention costs are <5% of total FLOPs for standard context lengths.
Given GPT-3's compute budget C=3.15×1023 FLOPs, what is the Chinchilla-optimal number of parameters? Use N* = √(C/120). Give in billions.
Chinchilla-optimal: ~51B params × 20 = ~1.02T tokens. Deepmind actually ran 70B with 1.4T tokens — close to this optimum. GPT-3 at 175B is over-parameterized by ~3.4× for its compute budget.
A model is trained on D=100B tokens. The data term of the loss is B/Dβ with B=410.7, β=0.28. Compute the data term at D=100B, then at D=200B. How much does the data term decrease? (Express as fraction of the D=100B value; round to 2 decimal places.)
Doubling data reduces the data term by 17.5%, regardless of the absolute value of D. This is the power law: each doubling gives the same proportional improvement. The β=0.28 exponent is small — data improvements are incremental, not exponential.
L(N,D) = E + A/Nα + B/Dβ. As N→∞ (and D→0), B/Dβ→∞ (data starvation). As N→0 (and D→∞), A/Nα→∞ (capacity starvation). There's an optimum in between where the gradient of both terms balance. The IsoFLOP plot is U-shaped — the bottom is the Chinchilla optimum.
Inference has two distinct phases: prefill (process the prompt in parallel — compute-bound) and decode (generate one token at a time — memory-bound). The KV cache eliminates redundant recomputation during decode.
7B LLaMA (32 layers, 32 KV heads, dhead=128), T=8192 tokens, bf16. How many GB is the KV cache for one request?
Bytes = 2 × 32 × 32 × 8192 × 128 × 2.
4.29 GB for one 8K request. With 16 concurrent requests: 16×4.29 = 68.6 GB — nearly one full A100-80GB just for KV caches. Long-context serving is fundamentally a memory management problem.
H100 ridge: I*=295 FLOPs/byte. During decode, I≈batch_size B. What minimum batch size makes decode compute-bound on an H100?
Batch 295+ requests together for compute-bound decode. In practice you'd use B=256 or B=512 (powers of 2). This is why LLM serving systems (vLLM, TensorRT-LLM) invest heavily in continuous batching — to maintain large effective batch sizes even with different request lengths.
A 7B model in bf16 weights = 14 GB. Quantizing weights to int8 (1 byte/param). How many GB are saved, and what is the percentage reduction?
int8 quantization halves weight memory. Arithmetic intensity during decode also doubles (halved bytes for same FLOPs). GPTQ and AWQ achieve <0.5% perplexity degradation with int4 quantization, saving 75%.
Arithmetic intensity ≈ B×T (tokens processed). Prefill: T=500 ≫ 295 → compute-bound. Decode: T=1 at each step → I=1 ≪ 295 → memory-bound. Different hardware metrics matter for each: prefill optimization targets TFLOP/s utilization; decode optimization targets memory bandwidth and batching.
After pretraining comes alignment: teaching the model to be helpful, harmless, and honest. SFT fine-tunes on curated demonstrations. RLHF uses a reward model + PPO. DPO and GRPO eliminate the reward model entirely by optimizing preference probabilities directly.
A reward model gives r(x, ywin)=3.0 and r(x, ylose)=1.5. Using P(yw≻yl) = σ(Δr) = 1/(1+e−Δr), what is the probability the winner is preferred? (Round to 2 decimal places.)
A reward gap of 1.5 nats gives 82% preference probability. At Δr=0 the model is indifferent (50%). At Δr=4.6 (ln99) the probability is 99%. The Bradley-Terry model is essentially a logistic regression on reward differences.
A group of 4 samples has rewards [1, 3, 2, 6]. Compute the normalized advantage Ai = (ri − mean) / std for r4=6. (Use population std, not sample std.)
The best response (r=6) gets advantage +1.60. The worst (r=1) gets (1-3)/1.871 = -1.07. Normalization keeps gradient magnitudes stable regardless of the absolute reward scale — crucial because reward functions differ wildly across tasks.
During SFT, the prompt tokens are fed as context with their loss masked (loss_mask=0). Only response tokens have loss_mask=1. This focuses learning signal on the quality of the completion. Including prompt tokens in the loss would penalize the model for "not generating" the prompt, which is counterproductive.
β controls how strongly the KL divergence from reference penalizes large policy updates. At β→0, the model is free to exploit any pattern that increases preference margin, even degenerate ones. Typical practice: β=0.1 to 0.5. Too large (β→∞) and the model barely moves from reference. Too small risks training instability and reward hacking.
Deduplication uses Bloom filters to check if a document was already seen. A Bloom filter with m=109 bits, n=107 items, and k=7 hash functions has false positive rate ≈ (1−e−kn/m)k. Compute (1−e−7×107/109)7. Round to 3 decimal places.
With these parameters FP rate ≈ 0.006 (using standard Bloom filter formula approximation). Practically near-zero false positives with 1 GB of bits and 7 hashes per 10M items. The Pile and C4 both use Bloom-filter-based deduplication at this scale.
You have a budget of $1M. H100-SXM5: 3.96×1015 BF16 FLOPs/GPU/day at ~MFU=0.45 (effective 1.78×1015 FLOPs/GPU/day), rental cost $2/GPU/hour. Design a training run end-to-end: pick a model size, compute required training time, verify it fits the budget, and estimate serving cost.
An H100 peaks at 989 TFLOPs/s (bf16). At MFU=0.45, what is the effective throughput in TFLOPs/s? (Round to nearest integer.)
MFU (Model FLOP Utilization) accounts for all overhead: data loading, optimizer steps, communication, idle time. State-of-the-art training achieves MFU 0.35–0.55. Megatron-LM reports 0.52 on GPT-3 scale. 0.45 is a realistic estimate for a single-node run.
N=1.3B, D=26B (Chinchilla-optimal). C=6ND≈2.03×1020 FLOPs. Using 8 H100s at 445 TFLOPs/s each. How many hours does training take?
t = C / (n × 445×1012 FLOPs/s) / 3600 seconds/hour.
Just 16 hours on 8 H100s for a Chinchilla-optimal 1.3B model! At $2/GPU/hr: 8×16×$2=$256. Incredibly cheap by frontier model standards.
Using the result from 9.2: 8 H100s for 16 hours at $2/GPU/hour. What is the total training cost in dollars?
$256 for a 1.3B Chinchilla-optimal model. Llama 2 7B training reportedly cost ~$5,000–$10,000 on comparable hardware. GPT-3 cost ~$10M. These orders of magnitude are the key intuition for budgeting AI research.
Arrange the five stages of building a language model from scratch in the correct order.
Data first (raw corpus defines vocabulary coverage). Tokenizer second (must be trained on the final data to capture its distribution). Pretraining third (costs the most compute). SFT + alignment fourth (orders of magnitude cheaper than pretraining, fine-tunes behavior). Serving last (inference optimization: quantization, KV cache, batching).
Decode throughput: an H100 at effective 445 TFLOPs/s, serving a 1.3B model (N=1.3B). Each token costs 2N=2.6B FLOPs (forward pass only, half of training). How many tokens per second can one H100 generate if fully loaded?
171K tokens/sec compute-bound upper bound. In reality, decode is memory-bound at small batch sizes. With batch=1: only ~85K tokens/sec (memory bandwidth limited). With batch=512: approaches the compute bound. The gap between memory-limited and compute-bound performance is the motivation for continuous batching in production.
A transformer with d=1024, n_heads=16, d_ff=2730 (SwiGLU iso-FLOP), L=24 layers, vocab=32256. Count parameters: (a) embedding table [V,d]; (b) each layer: Q,K,V,O projections [d,d] each, FFN W1,W2,Wgate [d,d_ff],[d_ff,d],[d,d_ff], 2 RMSNorms [d] each. Multiply (b) by 24 layers. What is the total in millions? (Ignore biases.)
With separate LM head (untied, [d, V]=33M): 335M+33M≈368M. With weight tying (embedding = LM head): 335M. Either way, the model is in the “400M” class. The calculation gives 459M accepting some rounding.
| Topic | Lesson |
|---|---|
| Tokenization & BPE | CS336 L1: Overview & Tokenization |
| Resource accounting | CS336 L2: PyTorch & Resource Accounting |
| Architectures | CS336 L3: Architectures & Hyperparameters |
| Scaling laws | CS336 L9: Scaling Laws |
| Inference | CS336 L10: Inference |