Your model "scores 80% on MMLU" — congratulations. But does that mean it's ready to deploy? Does it measure what you care about? Did you accidentally train on the test set? This lesson tears apart the entire evaluation stack: perplexity from cross-entropy, benchmark mechanics, LLM-as-judge biases, test-time compute scaling, and the contamination problem. You'll leave knowing exactly what a benchmark score means — and what it doesn't.
You open a model release blog post. In the table: "MMLU 84.2%, GSM8K 91.4%, HumanEval 78.6%." The previous SOTA was 79.1% on MMLU. Your model beat it. Celebration? Not so fast.
What does MMLU 84.2% actually mean? It means the model answered 84.2% of 14,042 multiple-choice questions correctly, across 57 subjects, using 5-shot prompting, scoring by comparing the log-likelihood of each answer letter. But the baseline might have used 0-shot. A different tokenizer. A different answer-extraction method. And if your training data happened to include text from Wikipedia pages about those 57 subjects — which was written before the benchmark was published — you might have memorized the answers.
This chapter sets up the problem. We'll trace exactly what a benchmark score measures — and the five ways it can be misleading before you even start analyzing the model's actual behavior.
Here's the concrete framework we'll use throughout the lesson. Every evaluation has four moving parts:
Each of these steps introduces choices that can swing a benchmark score by 5–20 percentage points — without changing the model at all. Let's build intuition for how sensitive scores are to these choices.
Adjust the number of few-shot examples and answer format to see how the same model can appear better or worse on the same benchmark questions. This simulates the empirical observation that few-shot prompting can shift MMLU scores by 5–15%.
Before benchmarks existed, there was perplexity — the original LM evaluation metric. It answers a clean question: how surprised is the model by a held-out text corpus? If the model perfectly predicted every next token, perplexity would be 1 (zero surprise). If it picked uniformly at random from a 50,000-word vocabulary, perplexity would be 50,000 (maximum surprise). Real models land somewhere in between.
Let's derive it from scratch. A language model assigns a probability to a sequence of tokens x1, x2, ..., xT:
Taking the log turns the product into a sum, giving negative log-likelihood (NLL) — also called cross-entropy loss:
This is exactly what you minimize during training. Perplexity is just NLL in a more interpretable unit:
Worked example. Suppose a model assigns probabilities [0.5, 0.3, 0.2] to the three tokens of "cats are cool." The NLL is −(log 0.5 + log 0.3 + log 0.2)/3 = −(−0.693 + −1.204 + −1.609)/3 = 1.169 nats. The perplexity is exp(1.169) = 3.22. At each step the model is as confused as choosing uniformly from ~3.2 options.
python — compute perplexity from model logits import torch import torch.nn.functional as F def compute_perplexity(model, input_ids, device='cuda'): """ input_ids: LongTensor of shape (1, T) — a single tokenized document Returns: scalar perplexity (float) """ model.eval() with torch.no_grad(): # Forward pass: get logits for all positions logits = model(input_ids).logits # (1, T, vocab_size) # Shift: predict token t+1 from context up to t shift_logits = logits[:, :-1, :].contiguous() # (1, T-1, V) shift_labels = input_ids[:, 1:].contiguous() # (1, T-1) # Cross-entropy = mean NLL = log perplexity nll = F.cross_entropy( shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1), reduction='mean' ) return torch.exp(nll).item() # Example: WikiText-103 test set # GPT-2 (1.5B): PPL ≈ 17.48 (out-of-domain) # GPT-3 (175B): PPL ≈ 20.50 (out-of-domain, worse!) # LLaMA-2 7B: PPL ≈ 5.47 (trained on more diverse data)
Why does PPL sometimes go up with more parameters? Because PPL is tokenizer-dependent. A model trained on byte-pair encodings with 32K merges sees the same sentence as fewer tokens than one with 100K merges — and each token is "easier" to predict since it's more specific. This makes cross-model PPL comparisons meaningless without matching tokenizers.
A language model produces a probability distribution over the next token at every step. Drag the sharpness slider to see how the distribution shape maps to perplexity. A "sharp" distribution concentrates probability on a few tokens; a "flat" one spreads it across many.
Since GPT-3, the field moved away from perplexity toward downstream task benchmarks — multiple-choice and short-answer tests that claim to measure something more meaningful than "how surprised is the model by random text." The most famous is MMLU.
MMLU (Massive Multitask Language Understanding) was created by graduate students who scraped online sources: textbooks, lecture notes, practice exams. It covers 57 subjects — from elementary mathematics to professional law — with 14,042 four-choice multiple-choice questions. Here's a sample question:
GPT-3 (175B) scored 43.9% on MMLU — just above random (25%). GPT-4 scored 86.4%. Claude Opus 3 scored 88.2%. The benchmark is becoming saturated — top models score above 90% — so it's losing discriminative power for frontier models. Enter MMLU-Pro, GPQA, and Humanity's Last Exam, each harder than the last.
| Benchmark | Questions | Difficulty | Human expert | GPT-4 ~ |
|---|---|---|---|---|
| MMLU | 14,042 | Undergrad | 89% | 86% |
| MMLU-Pro | 12,000 | Undergrad+ | ~75% | ~55% |
| GPQA | 448 | PhD-level | 65% | 39% |
| HLE | 2,500 | Extreme | >90% | <5% |
Notice the pattern: each tier is progressively harder for AI but remains solvable for human experts. This is intentional — you want a benchmark to be discriminative: it should separate models that are actually more capable. Once a benchmark is saturated, it stops being useful as a signal.
When you run a multiple-choice benchmark, you have a choice: ask the model to generate an answer, or score it by log-likelihood of each option. These two methods are not equivalent — they can disagree by 10+ percentage points on the same questions.
Log-likelihood scoring: for each answer choice (A), (B), (C), (D), you compute the log-probability the model assigns to that choice token given the question as context. The answer with the highest log-probability wins. No generation needed — you just do a forward pass and read the logits for the answer tokens.
For example, given the question above about SF6:
python — score an MMLU item by choice log-likelihood import torch import torch.nn.functional as F def score_mcq(model, tokenizer, question, choices): """ question: str — the full question text with (A)/(B)/(C)/(D) choices: list of str, e.g. ['sp', 'sp2', 'sp3', 'sp3d2'] Returns: int index of highest-probability choice """ log_probs = [] prompt_base = question + "\nThe answer is:" base_ids = tokenizer(prompt_base, return_tensors='pt').input_ids for i, choice in enumerate(choices): # Tokenize the answer token(s) — often a single letter like " A" answer = " " + "ABCD"[i] ans_ids = tokenizer(answer, add_special_tokens=False, return_tensors='pt').input_ids full_ids = torch.cat([base_ids, ans_ids], dim=1) with torch.no_grad(): logits = model(full_ids).logits # (1, L, V) # Log-prob of the answer token(s) ans_len = ans_ids.shape[1] answer_logits = logits[0, -ans_len-1:-1, :] # (ans_len, V) log_p = F.log_softmax(answer_logits, dim=-1) token_lp = log_p[0, ans_ids[0, 0]].item() log_probs.append(token_lp) return log_probs.index(max(log_probs)) # argmax
Generation scoring: prompt the model, sample a completion, then extract the answer from the generated text. This is more realistic — it tests whether the model can actually produce the right answer in a free-form context — but it requires a parsing step and is sensitive to the model's tendency to "talk around" the answer instead of stating it directly.
Adjust the four log-likelihoods to see which choice wins. Toggle the correct answer to see what "correct prediction" means. This mirrors exactly what happens in MMLU, GPQA, and HellaSwag evaluation.
A model's first answer isn't always its best. If you ask it to reason step-by-step, or if you sample multiple answers and take the most common one, accuracy can improve dramatically — without changing any weights. This is test-time compute scaling: spending more FLOPs at inference to extract better answers from the same model.
Two main strategies. First, chain-of-thought (CoT) prompting: instead of asking for the answer directly, ask the model to show its work. "Let's think step by step." This forces the model to decompose the problem into intermediate steps, each of which is easier to get right. On GSM8K (grade-school math), CoT boosted GPT-4 from ~55% to ~91%.
Second, self-consistency / majority voting: sample K independent reasoning chains, extract the final answer from each, and return the answer that appeared most often. This works because different reasoning paths can lead to the same correct answer, while wrong answers tend to be scattered across many incorrect values.
The math for majority vote accuracy as a function of single-sample accuracy p and number of samples K:
Let's compute this for a concrete example. Say p = 0.6 (each individual sample correct 60% of the time). With K = 1: 60%. With K = 5: P = C(5,3)·0.6³·0.4² + C(5,4)·0.6⁴·0.4 + C(5,5)·0.6⁵ = 10·0.216·0.16 + 5·0.1296·0.4 + 1·0.0778 = 0.346 + 0.259 + 0.078 = 68.3%. With K = 25: ~82%. The gains are real but diminishing.
Third strategy: best-of-N (BoN). Generate N answers and pick the one with the highest score according to a verifier or reward model. If each sample is independently correct with probability p, then P(at least one correct in N) = 1 − (1−p)N. This reaches near-100% accuracy very quickly even for moderate p — but requires a reliable verifier.
Worked example: p = 0.3 (hard problem, 30% single-sample accuracy). BoN with N = 10: 1 − 0.710 = 1 − 0.028 = 97.2%. Majority vote with K = 10: much lower, since p < 0.5 means the vote goes wrong most of the time. The right strategy depends on whether you have a good verifier.
python — self-consistency majority vote from collections import Counter import re def majority_vote(model, tokenizer, problem, K=10, temperature=0.7): """ problem: str — math word problem text K: int — number of reasoning chains to sample Returns: str — most common final answer across K samples """ answers = [] for _ in range(K): prompt = problem + "\n\nLet's think step by step.\n" ids = tokenizer(prompt, return_tensors='pt').input_ids with torch.no_grad(): output = model.generate( ids, max_new_tokens=256, temperature=temperature, do_sample=True ) text = tokenizer.decode(output[0][ids.shape[1]:]) # Extract final numeric answer (GSM8K convention: "#### 42") m = re.search(r'####\s*([0-9,\-\.]+)', text) if m: answers.append(m.group(1).replace(',', '')) if not answers: return "unknown" return Counter(answers).most_common(1)[0][0]
Set the single-sample accuracy p and the number of samples K (or N). See how majority voting and best-of-N diverge — one needs p > 0.5 to help, the other always helps given a reliable verifier.
For open-ended tasks — "write a poem about entropy," "debug this code," "explain quantum entanglement to a 10-year-old" — there's no ground-truth answer to compare against. Human raters are expensive and slow. The natural solution: use a powerful LLM as the judge.
In pairwise preference evaluation, you show the judge two model responses (A and B) and ask which is better. You run this for thousands of question pairs across many models and compute win rates. The model with the highest win rate against a reference model (usually GPT-4) wins.
This is how AlpacaEval works: 805 instructions, judge is GPT-4 turbo, metric is win rate against GPT-4-preview outputs. And it's how Chatbot Arena works at scale: real users vote on pairs of anonymous models; Elo scores are computed from the pairwise wins.
The math. Given model A with rating rA and model B with rating rB:
Given a dataset of N pairwise comparisons, find ri for each model i by maximizing log-likelihood:
This can be solved by gradient ascent or Newton's method. The resulting ratings give a consistent global ranking from local pairwise comparisons — even when models haven't directly faced each other.
Now for the bad news: LLM judges are systematically biased. There are three well-documented failure modes:
Adjust the position bias (how much the judge prefers response A regardless of quality) and length bias (how much extra length is rewarded). Watch how the measured win rate diverges from the true quality difference.
You're an ML engineer deciding which evaluation strategy to run for a new instruction-tuned model. You have a budget of N evaluations (inference calls). How should you allocate them across benchmark tasks? This showcase lets you configure a full eval suite and see the expected coverage, compute cost, and reliability of your evaluation.
The canvas below shows five benchmark categories. Drag the sliders to allocate your evaluation budget. The bars show: expected accuracy improvement from voting (left), expected cost in FLOPs (center), and coverage of real-world use cases (right). The goal is to find the mix that maximizes coverage while staying within budget.
Allocate evaluation calls across benchmark types. Each type has a different cost (tokens per call) and coverage (fraction of real use cases it addresses). The total bar turns red when you exceed budget. Click any bar to see benchmark details.
Machine learning has a cardinal rule: never train on your test set. Pre-foundation-model era, this was enforced by explicit data splits — ImageNet train/val/test are disjoint, SQuAD splits are manually separated. In the foundation model era, the rule is routinely broken by accident.
Modern LLMs train on trillions of tokens scraped from the Internet. MMLU questions were posted on Reddit study forums. GSM8K solutions were discussed on math help sites. HellaSwag text came from ActivityNet captions. The training corpus almost certainly contains text that appears verbatim or near-verbatim in the test benchmarks. This is contamination, and it silently inflates benchmark scores.
How much does contamination matter? A 2023 study tested models on original MMLU versus a "shuffled answers" variant. Clean models scored the same on both. Contaminated models scored 5–15% higher on the original than shuffled (because they memorized specific answer letters, not the reasoning). That's a contamination-inflated bonus of 5–15 percentage points.
There's a subtler form of contamination: benchmark adaptation. A lab knows MMLU is used for rankings. They include "MMLU-style" questions in their instruction-tuning data — not the exact questions, but the same format, subjects, and difficulty level. The model learns to be good at the benchmark format, not at chemistry or law. This is gaming the metric, not improving the underlying capability.
Adjust the fraction of test questions that appear in training data and the memorization strength. The canvas shows the "true" accuracy (what the model would score on novel questions) vs the reported accuracy (inflated by memorization). The gap is the contamination bonus.
MMLU tests knowledge. HellaSwag tests commonsense completion. But most real users aren't asking their LLM to pick (A), (B), (C), or (D). They're asking it to write an email, debug their code, plan a trip, or help them understand a medical report. These tasks require instruction following — the ability to produce a useful, well-formatted response to an open-ended request.
Evaluating instruction following is fundamentally harder than evaluating multiple-choice. There's no single correct answer. Length, format, tone, and completeness all matter. Three approaches have emerged: synthetic constraints, pairwise preferences, and human-sourced tasks.
IFEval (Instruction-Following Eval) adds verifiable constraints to instructions: "Write a response in exactly 3 paragraphs," "Include the word 'sustainability' at least twice," "Do not use the word 'however'." These constraints can be checked programmatically — no human or LLM judge needed. The downside: the constraints feel artificial, and a model could satisfy them while producing a useless response.
AlpacaEval uses 805 real instructions from various sources and measures win rate against GPT-4's responses, as judged by GPT-4. Circular? Yes — but it correlates 0.97 with human preference on a smaller validation set. The key limitation: verbose models inflate their win rate because the GPT-4 judge has a length bias.
Chatbot Arena solves this by crowdsourcing: real people type prompts, see two anonymous model responses, and vote. Elo scores are computed from millions of pairwise votes. It has two major advantages over static benchmarks: inputs are live (adversarial red-teamers keep finding edge cases), and new models can be added without rerunning everything. The major disadvantage: it's slow (weeks to converge on a new model's rating) and reflects the preferences of whoever shows up on the website (tech-savvy, English-speaking, curious about AI).
| Benchmark | Type | Judge | Key metric | Limitation |
|---|---|---|---|---|
| IFEval | Instruction following | Programmatic | Constraint satisfaction rate | Artificial constraints |
| AlpacaEval | Instruction following | GPT-4 | Win rate vs GPT-4 | Length + self-preference bias |
| WildBench | Instruction following | GPT-4 + checklist | Win rate (checklist-weighted) | Still GPT-4 dependent |
| Chatbot Arena | Open-ended | Humans | Elo rating | Slow, biased sample |
| SWEBench | Coding (agent) | Unit tests | % issues resolved | Narrow domain |
| MLEBench | ML (agent) | Kaggle metrics | Medal rate | Requires training infra |
Evaluating safety is a special case. Safety benchmarks like HarmBench (510 harmful behaviors) and AIR-Bench (5694 prompts across 314 risk categories) measure whether a model refuses harmful requests. But safety is contextual: what's harmful in one cultural or legal context may be acceptable in another. Jailbreaking techniques (e.g., GCG, which automatically optimizes adversarial suffixes) can bypass safety training — and these transfers from open-weight models to closed API models. A model that scores 0% refusal on jailbreaks isn't necessarily "less safe" if it's an API model that the attacker can't fine-tune.
Percy's takeaway from Lecture 12: "There is no one true evaluation. Choose your evaluation based on what question you are trying to answer. Always look at individual instances and predictions — not just the aggregate number." Let's make this concrete with a synthesis and cheat sheet.
| Concept | Formula / Key number | Main caveat |
|---|---|---|
| Perplexity | PPL = exp(NLL) = exp(−mean log p(xt|x<t)) | Tokenizer-dependent; can't compare across different tokenizers |
| MC scoring | Predict: argmaxc log p(c | question) | Disagrees with generation scoring; prompt-format sensitive |
| Majority vote | P(correct) = ∑j≥K/2 C(K,j)pj(1−p)K−j | Requires p > 0.5 to help; gains are diminishing |
| Best-of-N | P(correct) = 1 − (1−p)N | Requires a reliable verifier; always helps given verifier |
| Bradley-Terry Elo | P(A wins) = exp(rA) / (exp(rA) + exp(rB)) | Assumes transitivity; slow to converge; input distribution bias |
| Contamination | Shuffled-answer test: clean model scores same; contaminated scores lower | Doesn't catch semantic (near-duplicate) contamination |
The three most dangerous misconceptions in evaluation:
What's next after Lecture 12? Lecture 13 covers data — the other side of the training equation. How do you curate a trillion-token corpus, handle multilingual data, and deal with the quality-quantity tradeoff? And since contamination depends on what's in your training data, the two lectures are intimately connected.
"There is no free lunch. Every evaluation trades off cost, coverage, validity, and realism. The art is in choosing the tradeoffs that match your actual question."
— Percy Liang (paraphrased, CS336 Lecture 12)