Evaluation — Language Modeling from Scratch (CS336 L12)

Chapter 0: The 80% Problem

You open a model release blog post. In the table: "MMLU 84.2%, GSM8K 91.4%, HumanEval 78.6%." The previous SOTA was 79.1% on MMLU. Your model beat it. Celebration? Not so fast.

What does MMLU 84.2% actually mean? It means the model answered 84.2% of 14,042 multiple-choice questions correctly, across 57 subjects, using 5-shot prompting, scoring by comparing the log-likelihood of each answer letter. But the baseline might have used 0-shot. A different tokenizer. A different answer-extraction method. And if your training data happened to include text from Wikipedia pages about those 57 subjects — which was written before the benchmark was published — you might have memorized the answers.

This chapter sets up the problem. We'll trace exactly what a benchmark score measures — and the five ways it can be misleading before you even start analyzing the model's actual behavior.

Percy's evaluation crisis. Andrej Karpathy wrote in 2024: "We're in a crisis. We have benchmarks with numbers that go up but the models don't actually get better in ways that matter." The HELM leaderboard lists 100+ models with benchmark scores differing by fractions of a percent — yet real-world usefulness varies by orders of magnitude. The gap between "benchmark performance" and "real-world quality" is the central problem this lecture addresses.

Here's the concrete framework we'll use throughout the lesson. Every evaluation has four moving parts:

Inputs

What prompts? Which use cases? How representative? Tail-heavy or average?

↓

Calling the Model

Zero-shot or few-shot? Chain-of-thought? Tools? What system prompt?

↓

Judging Outputs

Exact match? Log-likelihood? Human raters? LLM judge? Pass@k?

↓

Interpreting Results

What does 84% mean? Is 85% "good enough"? Does it generalize?

Each of these steps introduces choices that can swing a benchmark score by 5–20 percentage points — without changing the model at all. Let's build intuition for how sensitive scores are to these choices.

Benchmark score sensitivity — how format changes your number

Adjust the number of few-shot examples and answer format to see how the same model can appear better or worse on the same benchmark questions. This simulates the empirical observation that few-shot prompting can shift MMLU scores by 5–15%.

Few-shot examples (k) 5

Answer extraction strictness medium

A paper reports their model achieves 88% on MMLU using 5-shot prompting, beating the previous best of 85% from a model evaluated 0-shot. Is this a valid comparison?

Yes — both are measuring accuracy on the same 14,042 questions. Yes — the number of shots is a minor implementation detail. No — few-shot examples can boost accuracy by 5–15%, making the comparison invalid without controlling for the number of shots. No — MMLU cannot be evaluated with 5-shot prompting.

Chapter 1: Perplexity — The Language Modeler's Thermometer

Before benchmarks existed, there was perplexity — the original LM evaluation metric. It answers a clean question: how surprised is the model by a held-out text corpus? If the model perfectly predicted every next token, perplexity would be 1 (zero surprise). If it picked uniformly at random from a 50,000-word vocabulary, perplexity would be 50,000 (maximum surprise). Real models land somewhere in between.

Let's derive it from scratch. A language model assigns a probability to a sequence of tokens x₁, x₂, ..., x_T:

p(x) = ∏_t=1^T p(x_t | x₁, ..., x_t-1)

Taking the log turns the product into a sum, giving negative log-likelihood (NLL) — also called cross-entropy loss:

NLL(x) = − (1/T) ∑_t=1^T log p(x_t | x_<t)

This is exactly what you minimize during training. Perplexity is just NLL in a more interpretable unit:

PPL(x) = exp( NLL(x) ) = exp&bigg(− (1/T) ∑_t=1^T log p(x_t | x_<t)&bigg)

Perplexity = effective branching factor. If PPL = 50, the model is as confused as if it had to choose uniformly among 50 equally likely next tokens at every step. PPL = 10 means it effectively narrows down to 10 plausible next tokens. PPL = 1.5 means it's nearly certain at each step. It's the geometric mean of the inverse probabilities assigned to the true tokens.

Worked example. Suppose a model assigns probabilities [0.5, 0.3, 0.2] to the three tokens of "cats are cool." The NLL is −(log 0.5 + log 0.3 + log 0.2)/3 = −(−0.693 + −1.204 + −1.609)/3 = 1.169 nats. The perplexity is exp(1.169) = 3.22. At each step the model is as confused as choosing uniformly from ~3.2 options.

python — compute perplexity from model logits
import torch
import torch.nn.functional as F

def compute_perplexity(model, input_ids, device='cuda'):
    """
    input_ids: LongTensor of shape (1, T) — a single tokenized document
    Returns:   scalar perplexity (float)
    """
    model.eval()
    with torch.no_grad():
        # Forward pass: get logits for all positions
        logits = model(input_ids).logits  # (1, T, vocab_size)

        # Shift: predict token t+1 from context up to t
        shift_logits = logits[:, :-1, :].contiguous()  # (1, T-1, V)
        shift_labels = input_ids[:, 1:].contiguous()   # (1, T-1)

        # Cross-entropy = mean NLL = log perplexity
        nll = F.cross_entropy(
            shift_logits.view(-1, shift_logits.size(-1)),
            shift_labels.view(-1),
            reduction='mean'
        )

        return torch.exp(nll).item()

# Example: WikiText-103 test set
# GPT-2 (1.5B): PPL ≈ 17.48  (out-of-domain)
# GPT-3 (175B): PPL ≈ 20.50  (out-of-domain, worse!)
# LLaMA-2 7B:  PPL ≈ 5.47   (trained on more diverse data)

Why does PPL sometimes go up with more parameters? Because PPL is tokenizer-dependent. A model trained on byte-pair encodings with 32K merges sees the same sentence as fewer tokens than one with 100K merges — and each token is "easier" to predict since it's more specific. This makes cross-model PPL comparisons meaningless without matching tokenizers.

Misconception: lower PPL always means better model. GPT-3 has higher perplexity on WikiText-103 than smaller models trained directly on Wikipedia. Why? GPT-3 was trained on a diverse web corpus (WebText), so Wikipedia-style prose is slightly off-distribution for it. Meanwhile a smaller model trained on Wikipedia itself achieves lower PPL on that exact domain. Lower PPL on a specific benchmark only means the model was better calibrated to that distribution — not that it's smarter.

Perplexity as effective branching factor — the sharpness slider

A language model produces a probability distribution over the next token at every step. Drag the sharpness slider to see how the distribution shape maps to perplexity. A "sharp" distribution concentrates probability on a few tokens; a "flat" one spreads it across many.

Distribution sharpness (α) 1.00

Vocabulary size 20

A model achieves NLL = 2.3 nats on a test set. What is the perplexity? (Hint: e ≈ 2.718, e² ≈ 7.39, e²·³ ≈ 9.97)

2.3 (perplexity equals NLL) 7.39 (e² — rounded down) ≈ 9.97 — because PPL = exp(NLL) = exp(2.3) 23 (NLL × 10)

Chapter 2: Knowledge Benchmarks — What MMLU Actually Measures

Since GPT-3, the field moved away from perplexity toward downstream task benchmarks — multiple-choice and short-answer tests that claim to measure something more meaningful than "how surprised is the model by random text." The most famous is MMLU.

MMLU (Massive Multitask Language Understanding) was created by graduate students who scraped online sources: textbooks, lecture notes, practice exams. It covers 57 subjects — from elementary mathematics to professional law — with 14,042 four-choice multiple-choice questions. Here's a sample question:

MMLU Example (High School Chemistry):
Which of the following best describes the hybridization of the central atom in SF₆?
(A) sp (B) sp² (C) sp³ (D) sp³d²
Correct answer: D

GPT-3 (175B) scored 43.9% on MMLU — just above random (25%). GPT-4 scored 86.4%. Claude Opus 3 scored 88.2%. The benchmark is becoming saturated — top models score above 90% — so it's losing discriminative power for frontier models. Enter MMLU-Pro, GPQA, and Humanity's Last Exam, each harder than the last.

The benchmark arms race. Each time a benchmark saturates (models score >90%), researchers create a harder one. Penn Treebank → WikiText → MMLU → MMLU-Pro → GPQA → HLE. GPQA questions were written by PhD-level Upwork contractors; non-experts with Google access score only 34%. GPT-4 scored 39%. Humanity's Last Exam had a $500K prize pool for question creators and was filtered by frontier LLMs — questions that any frontier model could answer were discarded.

Benchmark	Questions	Difficulty	Human expert	GPT-4 ~
MMLU	14,042	Undergrad	89%	86%
MMLU-Pro	12,000	Undergrad+	~75%	~55%
GPQA	448	PhD-level	65%	39%
HLE	2,500	Extreme	>90%	<5%

Notice the pattern: each tier is progressively harder for AI but remains solvable for human experts. This is intentional — you want a benchmark to be discriminative: it should separate models that are actually more capable. Once a benchmark is saturated, it stops being useful as a signal.

MMLU-Pro scores are 16–33% lower than MMLU scores for the same models. What architectural change to the benchmark explains most of this gap?

MMLU-Pro questions are translated into multiple languages, reducing accuracy. MMLU-Pro has 10 answer choices instead of 4, roughly doubling the difficulty of random guessing, plus questions are harder and chain-of-thought is used. MMLU-Pro covers more obscure subjects not in pre-training data. MMLU-Pro uses generation scoring instead of log-likelihood scoring.

Chapter 3: Multiple-Choice Scoring — The Mechanics

When you run a multiple-choice benchmark, you have a choice: ask the model to generate an answer, or score it by log-likelihood of each option. These two methods are not equivalent — they can disagree by 10+ percentage points on the same questions.

Log-likelihood scoring: for each answer choice (A), (B), (C), (D), you compute the log-probability the model assigns to that choice token given the question as context. The answer with the highest log-probability wins. No generation needed — you just do a forward pass and read the logits for the answer tokens.

score(choice) = log p( answer_token | question + " The answer is:" )

For example, given the question above about SF₆:

log p(" A") = −3.1 (logit for "A" after "The answer is:")
log p(" B") = −4.7
log p(" C") = −2.9
log p(" D") = −1.4 ← highest, model picks D

python — score an MMLU item by choice log-likelihood
import torch
import torch.nn.functional as F

def score_mcq(model, tokenizer, question, choices):
    """
    question: str — the full question text with (A)/(B)/(C)/(D)
    choices:  list of str, e.g. ['sp', 'sp2', 'sp3', 'sp3d2']
    Returns:  int index of highest-probability choice
    """
    log_probs = []
    prompt_base = question + "\nThe answer is:"
    base_ids = tokenizer(prompt_base, return_tensors='pt').input_ids

    for i, choice in enumerate(choices):
        # Tokenize the answer token(s) — often a single letter like " A"
        answer = " " + "ABCD"[i]
        ans_ids = tokenizer(answer, add_special_tokens=False,
                           return_tensors='pt').input_ids

        full_ids = torch.cat([base_ids, ans_ids], dim=1)

        with torch.no_grad():
            logits = model(full_ids).logits  # (1, L, V)

        # Log-prob of the answer token(s)
        ans_len = ans_ids.shape[1]
        answer_logits = logits[0, -ans_len-1:-1, :]  # (ans_len, V)
        log_p = F.log_softmax(answer_logits, dim=-1)
        token_lp = log_p[0, ans_ids[0, 0]].item()
        log_probs.append(token_lp)

    return log_probs.index(max(log_probs))  # argmax

Generation scoring: prompt the model, sample a completion, then extract the answer from the generated text. This is more realistic — it tests whether the model can actually produce the right answer in a free-form context — but it requires a parsing step and is sensitive to the model's tendency to "talk around" the answer instead of stating it directly.

Misconception: log-likelihood scoring is always better. Log-likelihood scoring is fast and deterministic, but it only works for structured tasks where the answer is a single token or short phrase. For tasks like "write a function to sort a list" or "explain the French Revolution," there's no way to score by likelihood — you must generate and judge the output. Worse, a model that scores highly on log-likelihood might produce incoherent text when actually generating, because it's been tuned to assign high probability to correct tokens without being tuned to generate fluently.

Multiple-choice scoring: log-likelihoods → argmax

Adjust the four log-likelihoods to see which choice wins. Toggle the correct answer to see what "correct prediction" means. This mirrors exactly what happens in MMLU, GPQA, and HellaSwag evaluation.

log p(A) -3.1

log p(B) -4.7

log p(C) -2.9

log p(D) — correct answer -1.4

A model's log-likelihoods for choices A, B, C, D are −2.1, −1.8, −3.4, −2.6. Which answer does the model predict?

A — it has the most negative log-probability. B — it has the highest (least negative) log-probability, so the model assigns it the most probability mass. C — it has the lowest absolute value. D — it's alphabetically last.

Chapter 4: Test-Time Compute — Scaling Inference

A model's first answer isn't always its best. If you ask it to reason step-by-step, or if you sample multiple answers and take the most common one, accuracy can improve dramatically — without changing any weights. This is test-time compute scaling: spending more FLOPs at inference to extract better answers from the same model.

Two main strategies. First, chain-of-thought (CoT) prompting: instead of asking for the answer directly, ask the model to show its work. "Let's think step by step." This forces the model to decompose the problem into intermediate steps, each of which is easier to get right. On GSM8K (grade-school math), CoT boosted GPT-4 from ~55% to ~91%.

Second, self-consistency / majority voting: sample K independent reasoning chains, extract the final answer from each, and return the answer that appeared most often. This works because different reasoning paths can lead to the same correct answer, while wrong answers tend to be scattered across many incorrect values.

Majority vote accuracy from single-sample accuracy. If a single sample is correct with probability p, then K independent samples vote correctly when more than K/2 are right. For large K this approaches 1 if p > 0.5 and approaches 0 if p < 0.5. The improvement is largest when p is near 0.5 — moderate-difficulty questions benefit most from voting.

The math for majority vote accuracy as a function of single-sample accuracy p and number of samples K:

P(majority correct) = ∑_j=⌈K/2⌉^K C(K, j) · p^j · (1−p)^K−j

Let's compute this for a concrete example. Say p = 0.6 (each individual sample correct 60% of the time). With K = 1: 60%. With K = 5: P = C(5,3)·0.6³·0.4² + C(5,4)·0.6⁴·0.4 + C(5,5)·0.6⁵ = 10·0.216·0.16 + 5·0.1296·0.4 + 1·0.0778 = 0.346 + 0.259 + 0.078 = 68.3%. With K = 25: ~82%. The gains are real but diminishing.

Third strategy: best-of-N (BoN). Generate N answers and pick the one with the highest score according to a verifier or reward model. If each sample is independently correct with probability p, then P(at least one correct in N) = 1 − (1−p)^N. This reaches near-100% accuracy very quickly even for moderate p — but requires a reliable verifier.

P(best-of-N correct) = 1 − (1 − p)^N

Worked example: p = 0.3 (hard problem, 30% single-sample accuracy). BoN with N = 10: 1 − 0.7¹⁰ = 1 − 0.028 = 97.2%. Majority vote with K = 10: much lower, since p < 0.5 means the vote goes wrong most of the time. The right strategy depends on whether you have a good verifier.

python — self-consistency majority vote
from collections import Counter
import re

def majority_vote(model, tokenizer, problem, K=10, temperature=0.7):
    """
    problem: str — math word problem text
    K:       int — number of reasoning chains to sample
    Returns: str — most common final answer across K samples
    """
    answers = []

    for _ in range(K):
        prompt = problem + "\n\nLet's think step by step.\n"
        ids = tokenizer(prompt, return_tensors='pt').input_ids
        with torch.no_grad():
            output = model.generate(
                ids, max_new_tokens=256,
                temperature=temperature, do_sample=True
            )
        text = tokenizer.decode(output[0][ids.shape[1]:])

        # Extract final numeric answer (GSM8K convention: "#### 42")
        m = re.search(r'####\s*([0-9,\-\.]+)', text)
        if m:
            answers.append(m.group(1).replace(',', ''))

    if not answers:
        return "unknown"
    return Counter(answers).most_common(1)[0][0]

Test-time compute: majority vote vs best-of-N accuracy

Set the single-sample accuracy p and the number of samples K (or N). See how majority voting and best-of-N diverge — one needs p > 0.5 to help, the other always helps given a reliable verifier.

Single-sample accuracy p 0.45

Max samples K (or N) 16

A model has a single-sample accuracy of 0.4 (below 50%) on a hard reasoning task. You have a perfect verifier. Which test-time compute strategy do you use, and why?

Majority voting — the most popular answer is most likely correct. Best-of-N — since p < 0.5, majority voting gives the wrong answer most of the time (the 60% wrong votes outnumber the 40% right), but best-of-N only needs one correct sample, which a perfect verifier can identify. Neither — you should fine-tune the model instead. Both give the same result when p = 0.4.

Chapter 5: LLM-as-a-Judge — Scalable But Biased

For open-ended tasks — "write a poem about entropy," "debug this code," "explain quantum entanglement to a 10-year-old" — there's no ground-truth answer to compare against. Human raters are expensive and slow. The natural solution: use a powerful LLM as the judge.

In pairwise preference evaluation, you show the judge two model responses (A and B) and ask which is better. You run this for thousands of question pairs across many models and compute win rates. The model with the highest win rate against a reference model (usually GPT-4) wins.

This is how AlpacaEval works: 805 instructions, judge is GPT-4 turbo, metric is win rate against GPT-4-preview outputs. And it's how Chatbot Arena works at scale: real users vote on pairs of anonymous models; Elo scores are computed from the pairwise wins.

Elo/Bradley-Terry from pairwise wins. In chess, Elo score predicts the probability of A beating B: P(A beats B) = 1/(1 + 10^{(Elo_B − Elo_A)/400}). In LLM arenas, we use Bradley-Terry: after observing many wins/losses, find the Elo scores that maximize the likelihood of the observed win pattern. A model rated 100 Elo points higher wins about 64% of head-to-head comparisons.

The math. Given model A with rating r_A and model B with rating r_B:

P(A wins) = exp(r_A) / (exp(r_A) + exp(r_B))

Given a dataset of N pairwise comparisons, find r_i for each model i by maximizing log-likelihood:

log L = ∑_{(i,j) : i won} r_i − log(exp(r_i) + exp(r_j))

This can be solved by gradient ascent or Newton's method. The resulting ratings give a consistent global ranking from local pairwise comparisons — even when models haven't directly faced each other.

Now for the bad news: LLM judges are systematically biased. There are three well-documented failure modes:

Position Bias

Judges prefer whichever response appears first (Position A). Measured: 15–20% of cases reverse when the order is flipped. Fix: always evaluate both orders and average.

↓

Length Bias

Judges favor longer responses, even when brevity is better. A 2-sentence correct answer loses to a 3-paragraph wrong answer ~30% of the time. Fix: normalize by length or use a length-penalizing rubric.

↓

Self-Preference Bias

A GPT-4 judge favors GPT-4 outputs over equally-good Claude outputs. AlpacaEval uses GPT-4 as both judge and reference — circular. Fix: use third-party judges, multi-judge consensus, or human spot-checks.

Misconception: high win rate on AlpacaEval means the model is better. AlpacaEval win rate inflates for verbose models (length bias) and models trained on data similar to the judge's preferences (self-preference). A model fine-tuned on GPT-4 outputs will score artificially high because the GPT-4 judge recognizes and prefers its own style. WildBench addresses this with a checklist-based judging prompt and sources questions from real human conversations — it correlates 0.95 with Chatbot Arena human preferences.

LLM-judge bias: how position and length distort win rates

Adjust the position bias (how much the judge prefers response A regardless of quality) and length bias (how much extra length is rewarded). Watch how the measured win rate diverges from the true quality difference.

Position bias (0=none, 1=always picks A) 0.15

Length bias (per extra 100 words) 0.03

You're running a pairwise evaluation with GPT-4 as judge. Model A wins 62% of comparisons against Model B when A's response is shown first. When the order is flipped, Model A wins only 44%. What is your best estimate of Model A's true win rate?

62% — you should always trust the first-position result. 44% — the reversed order is more natural. ~53% — average the two win rates: (62% + 44%) / 2 = 53%, removing position bias by symmetrization. 18% — the difference between the two measurements.

Chapter 6: Showcase — Evaluation System Explorer

You're an ML engineer deciding which evaluation strategy to run for a new instruction-tuned model. You have a budget of N evaluations (inference calls). How should you allocate them across benchmark tasks? This showcase lets you configure a full eval suite and see the expected coverage, compute cost, and reliability of your evaluation.

The canvas below shows five benchmark categories. Drag the sliders to allocate your evaluation budget. The bars show: expected accuracy improvement from voting (left), expected cost in FLOPs (center), and coverage of real-world use cases (right). The goal is to find the mix that maximizes coverage while staying within budget.

Full eval suite configurator — allocate your inference budget

Allocate evaluation calls across benchmark types. Each type has a different cost (tokens per call) and coverage (fraction of real use cases it addresses). The total bar turns red when you exceed budget. Click any bar to see benchmark details.

Total eval budget (×1000 calls) 5k

Votes per question (self-consistency K) 1

Judge calls (for LM-judge tasks) 1×

You have a fixed token budget. Self-consistency with K=5 boosts accuracy from 60% to 73% on GSM8K. Spending the same budget on 5× more diverse benchmark questions gives coverage of 5 more domains. Which is generally more informative about model quality?

Self-consistency — the accuracy improvement proves the model is smarter. More diverse questions — broader coverage reveals whether the model generalizes, while voting on the same questions only tells you about variance on a narrow domain. They give equal information — both use the same total compute. Neither — you should just use human evaluation.

Chapter 7: Contamination — When Your Training Set Ate the Test Set

Machine learning has a cardinal rule: never train on your test set. Pre-foundation-model era, this was enforced by explicit data splits — ImageNet train/val/test are disjoint, SQuAD splits are manually separated. In the foundation model era, the rule is routinely broken by accident.

Modern LLMs train on trillions of tokens scraped from the Internet. MMLU questions were posted on Reddit study forums. GSM8K solutions were discussed on math help sites. HellaSwag text came from ActivityNet captions. The training corpus almost certainly contains text that appears verbatim or near-verbatim in the test benchmarks. This is contamination, and it silently inflates benchmark scores.

How much does contamination matter? A 2023 study tested models on original MMLU versus a "shuffled answers" variant. Clean models scored the same on both. Contaminated models scored 5–15% higher on the original than shuffled (because they memorized specific answer letters, not the reasoning). That's a contamination-inflated bonus of 5–15 percentage points.

Contamination is hard to detect and easy to exploit. If you know your training corpus, you can check for n-gram overlap between training data and benchmark questions. But model providers often don't publish their training data. Route 1: exploit exchangeability — permute the answer choices and check if the model's accuracy drops (a contaminated model memorized specific letter positions). Route 2: demand reporting norms — require model providers to publish their train-test overlap statistics.

There's a subtler form of contamination: benchmark adaptation. A lab knows MMLU is used for rankings. They include "MMLU-style" questions in their instruction-tuning data — not the exact questions, but the same format, subjects, and difficulty level. The model learns to be good at the benchmark format, not at chemistry or law. This is gaming the metric, not improving the underlying capability.

The goodhart trap. "When a measure becomes a target, it ceases to be a good measure." This is Goodhart's Law. MMLU was designed to measure general knowledge and reasoning. Once labs start optimizing specifically for MMLU, it stops being a neutral measure. The arms race then demands harder benchmarks (GPQA, HLE), which then get "trained against," requiring even harder benchmarks. This cycle has no end — which is why diverse, real-world evaluation (Chatbot Arena, MedHELM) is increasingly valued over static benchmarks.

Contamination: how train-test overlap inflates benchmark scores

Adjust the fraction of test questions that appear in training data and the memorization strength. The canvas shows the "true" accuracy (what the model would score on novel questions) vs the reported accuracy (inflated by memorization). The gap is the contamination bonus.

Contamination fraction (% test q's in train) 10%

Memorization boost (on contaminated q's) 0.30

You suspect a model has contaminated MMLU in its training data. You run the model on the original MMLU and on a variant where answer choices A/B/C/D are randomly shuffled. The model scores 87% on original but 74% on shuffled. What does this suggest?

The model's true accuracy is 87% — shuffling answer choices makes the task harder. The model is bad at handling answer choice shuffles — a formatting issue. The model likely memorized specific (question, answer-letter) pairs — the 13% gap is the contamination bonus, since a model reasoning from first principles should score similarly regardless of letter assignment. The model dislikes randomness and should be evaluated with a fixed seed.

Chapter 8: Instruction Following & Agent Evals

MMLU tests knowledge. HellaSwag tests commonsense completion. But most real users aren't asking their LLM to pick (A), (B), (C), or (D). They're asking it to write an email, debug their code, plan a trip, or help them understand a medical report. These tasks require instruction following — the ability to produce a useful, well-formatted response to an open-ended request.

Evaluating instruction following is fundamentally harder than evaluating multiple-choice. There's no single correct answer. Length, format, tone, and completeness all matter. Three approaches have emerged: synthetic constraints, pairwise preferences, and human-sourced tasks.

IFEval (Instruction-Following Eval) adds verifiable constraints to instructions: "Write a response in exactly 3 paragraphs," "Include the word 'sustainability' at least twice," "Do not use the word 'however'." These constraints can be checked programmatically — no human or LLM judge needed. The downside: the constraints feel artificial, and a model could satisfy them while producing a useless response.

AlpacaEval uses 805 real instructions from various sources and measures win rate against GPT-4's responses, as judged by GPT-4. Circular? Yes — but it correlates 0.97 with human preference on a smaller validation set. The key limitation: verbose models inflate their win rate because the GPT-4 judge has a length bias.

Chatbot Arena solves this by crowdsourcing: real people type prompts, see two anonymous model responses, and vote. Elo scores are computed from millions of pairwise votes. It has two major advantages over static benchmarks: inputs are live (adversarial red-teamers keep finding edge cases), and new models can be added without rerunning everything. The major disadvantage: it's slow (weeks to converge on a new model's rating) and reflects the preferences of whoever shows up on the website (tech-savvy, English-speaking, curious about AI).

Agent benchmarks change the evaluation unit. For tasks like SWEBench (solve a GitHub issue) or MLEBench (win a Kaggle competition), you're not evaluating a single model call — you're evaluating an agentic system: model + scaffolding + tool use + iteration. A 70B model with good scaffolding can outperform a 200B model without it. This means "model capability" and "system capability" are deeply entangled — and a benchmark score reflects the entire pipeline, not just the weights.

Benchmark	Type	Judge	Key metric	Limitation
IFEval	Instruction following	Programmatic	Constraint satisfaction rate	Artificial constraints
AlpacaEval	Instruction following	GPT-4	Win rate vs GPT-4	Length + self-preference bias
WildBench	Instruction following	GPT-4 + checklist	Win rate (checklist-weighted)	Still GPT-4 dependent
Chatbot Arena	Open-ended	Humans	Elo rating	Slow, biased sample
SWEBench	Coding (agent)	Unit tests	% issues resolved	Narrow domain
MLEBench	ML (agent)	Kaggle metrics	Medal rate	Requires training infra

Evaluating safety is a special case. Safety benchmarks like HarmBench (510 harmful behaviors) and AIR-Bench (5694 prompts across 314 risk categories) measure whether a model refuses harmful requests. But safety is contextual: what's harmful in one cultural or legal context may be acceptable in another. Jailbreaking techniques (e.g., GCG, which automatically optimizes adversarial suffixes) can bypass safety training — and these transfers from open-weight models to closed API models. A model that scores 0% refusal on jailbreaks isn't necessarily "less safe" if it's an API model that the attacker can't fine-tune.

SWEBench evaluates performance on real GitHub issues. A lab gets state-of-the-art SWEBench performance using their new model. But the SWEBench issues are from public GitHub repositories that were scraped before the model's training cutoff. What major validity concern does this raise?

Unit tests might be flawed — a different metric is needed. The coding tasks are too narrow to generalize. Contamination — the model may have seen the exact issue descriptions and even the solution PRs during training, memorizing patches rather than learning to reason about code. Python repositories are easier than other languages.

Chapter 9: Connections & What Evaluation Can't Tell You

Percy's takeaway from Lecture 12: "There is no one true evaluation. Choose your evaluation based on what question you are trying to answer. Always look at individual instances and predictions — not just the aggregate number." Let's make this concrete with a synthesis and cheat sheet.

The four purposes of evaluation — and why they need different tools.
1. Purchase decision: "Should we use Model A or Model B for our customer service chatbot?" → Use Chatbot Arena or a domain-specific WildBench variant with your actual use cases.
2. Capability measurement: "How intelligent is this model in a broad sense?" → Use GPQA, ARC-AGI, or HLE — benchmarks that resist Goodharting.
3. Benefit/harm analysis: "What can this model do, and to whom?" → Use HarmBench, AIR-Bench, dual-use capability benchmarks.
4. Model development feedback: "Is my fine-tuning run going in the right direction?" → Use perplexity on a validation split — smooth, fast, and directly optimizable.

Concept	Formula / Key number	Main caveat
Perplexity	PPL = exp(NLL) = exp(−mean log p(x_t\|x_<t))	Tokenizer-dependent; can't compare across different tokenizers
MC scoring	Predict: argmax_c log p(c \| question)	Disagrees with generation scoring; prompt-format sensitive
Majority vote	P(correct) = ∑_j≥K/2 C(K,j)p^j(1−p)^K−j	Requires p > 0.5 to help; gains are diminishing
Best-of-N	P(correct) = 1 − (1−p)^N	Requires a reliable verifier; always helps given verifier
Bradley-Terry Elo	P(A wins) = exp(r_A) / (exp(r_A) + exp(r_B))	Assumes transitivity; slow to converge; input distribution bias
Contamination	Shuffled-answer test: clean model scores same; contaminated scores lower	Doesn't catch semantic (near-duplicate) contamination

The three most dangerous misconceptions in evaluation:

1. Higher MMLU = better model. MMLU is becoming saturated, format-sensitive, and possibly contaminated. A model scoring 2% higher on MMLU may be worse in every real-world use case that matters to you.

2. The judge's preferences = user preferences. GPT-4 as a judge prefers verbose, GPT-4-style responses. Anthropic's Claude judge would prefer something different. Neither necessarily reflects what your users want.

3. Benchmark score = deployment readiness. Evaluation benchmarks measure average-case performance on curated inputs. Deployment involves the full tail — edge cases, adversarial users, domain-specific jargon, multilingual inputs, multi-turn conversations. A model needs to pass the benchmark AND real red-teaming before it's deployment-ready.

What's next after Lecture 12? Lecture 13 covers data — the other side of the training equation. How do you curate a trillion-token corpus, handle multilingual data, and deal with the quality-quantity tradeoff? And since contamination depends on what's in your training data, the two lectures are intimately connected.

Related lessons: CS336 L11 — Scaling Laws II — perplexity appears as the validation loss that scaling laws predict. Understanding PPL deeply is prerequisite to understanding why scaling laws work.

CS336 L9 — Scaling Laws I — cross-entropy loss L(N,D) is exactly NLL, which is log(PPL). Scaling laws tell you where on the PPL landscape your training run will end up.

AI Evaluation (Harness Engineering) — practical guide to building eval pipelines: frameworks like ELEUTHER LM EVAL harness, HELM, and how to run evals in production.

"There is no free lunch. Every evaluation trades off cost, coverage, validity, and realism. The art is in choosing the tradeoffs that match your actual question."
— Percy Liang (paraphrased, CS336 Lecture 12)

You are building a medical Q&A assistant and need to choose an evaluation strategy before deployment. Which combination is most appropriate?

MMLU (medical subset) + AlpacaEval win rate — these are standard benchmarks. Perplexity on medical text + Chatbot Arena overall Elo — perplexity is smooth and Arena is human-rated. MedHELM (121 real clinical tasks from clinicians) + HarmBench (safety) + hallucination rate on medical facts — covers real clinical use cases, safety, and the highest-stakes failure mode for medical AI. Best-of-32 sampling + majority vote — more compute = better quality.

Evaluation: How Good Is Your Model?