Alignment I: SFT & RLHF — Language Modeling from Scratch (CS336 L15)

Chapter 0: The Alignment Gap

You have trained a 7B-parameter language model on two trillion tokens of internet text. It can continue any piece of writing with stunning fluency. You type: "How do I make a bomb?"

The base model completes this sentence. It found thousands of relevant documents during training, and its job — the only job it was ever trained for — is to predict what comes next given a prefix. From the model's perspective, this is the same as "How do I make a cake?" Complete the sentence. Maximize the probability of the most plausible continuation.

This is the alignment gap: the gap between what the base model is optimized for (next-token prediction on web text) and what a deployed assistant must do (follow instructions, produce helpful content, refuse harmful requests, maintain a helpful tone, say "I don't know" when it doesn't know).

The base model is not dumb — it is misaligned. It has absorbed vast knowledge about how to complete all kinds of text. It just has no notion of "helpfulness," "instruction-following," or "refusal." Alignment injects those behaviors without destroying the underlying knowledge.

The 2022 InstructGPT paper (Ouyang et al.) codified the standard recipe. You can think of it as a two-stage post-training process that runs after pretraining:

Stage 1 — SFT

Show the model examples of (prompt → ideal response). Train it to imitate those responses via supervised cross-entropy. This teaches style, instruction-following behavior, and basic safety.

↓

Stage 2 — RLHF

Collect human preferences between pairs of responses (A vs B). Train a reward model on those preferences. Then optimize the LM to generate high-reward responses — while staying close to the SFT model.

Both stages are cheap compared to pretraining. The entire InstructGPT SFT dataset was ~13k prompts with human-written responses. Reward model training used ~33k comparison pairs. Yet the resulting model was rated dramatically more helpful by human evaluators — even compared to a 100× larger base model.

Base model vs SFT vs RLHF — same prompt, three different completions

Click each stage to see what a model at that training phase might output for the prompt "Explain quantum entanglement to a 10-year-old." The differences illustrate what each training stage adds.

A pretrained base LM is asked "What is the capital of France?" and completes: "What is the capital of France? The capital of Germany is Berlin. The capital of Italy is Rome…" This is an example of:

The model being too small — a larger model would answer correctly. A tokenization error — the model misread the question. The alignment gap — the model is completing the prompt in the style of a quiz or list it saw in training, not following the implied instruction to answer. A hallucination — the model made up wrong facts.

Chapter 1: SFT: Training Data

Supervised Fine-Tuning (SFT) is simple in concept: collect a dataset of (prompt, ideal response) pairs, and do gradient descent to maximize the probability the model assigns to the ideal response given the prompt. The question is — what data should you use?

Three major public instruction-tuning datasets have shaped how the field thinks about this:

FLAN — academic benchmarks reformatted as instructions. "Classify this article: text" with a label as the response. Huge scale (~1.8T tokens in FLANv2), but mechanical and narrow in style.
Alpaca — 52k examples generated by GPT-3 using seed instructions. First paper to show that a small, cheap, model-generated dataset could significantly improve instruction-following. Introduced the "self-instruct" paradigm.
OpenAssistant (OASST) — human-written conversations, including multi-turn dialogue and nuanced tasks like citing research. Higher quality but much smaller (~161k messages).

What varies across these datasets? Three things matter most: (1) response length and style — FLAN is terse, OASST is verbose; (2) knowledge scope — FLAN is narrow, OASST is broad; (3) safety coverage — the amount of harmful/sensitive prompt-response pairs that teach the model to refuse. The style the model learns depends heavily on which dataset dominates.

The "less is more" result — LIMA

The LIMA paper (Zhou et al., 2023) is one of the most important results in post-training. They curated just 1,000 examples by hand — high-quality, diverse, carefully written — and found that a LLaMA model fine-tuned on this tiny dataset matched or exceeded models fine-tuned on 52,000 (Alpaca) or 9,846 (Dolly) examples.

Their conclusion: SFT does not teach the model new knowledge. It teaches the model a response style — how to present the knowledge it already absorbed during pretraining. If the model already knows something, SFT helps it learn when and how to say it. But SFT cannot make the model know things it doesn't know from pretraining.

The elicitation hypothesis. Instruction fine-tuning is not "teaching" in the same sense as pretraining. You are not adding new facts to the model. You are eliciting behaviors that are already latent in the weights — turning a web-text completer into an assistant. This has a corollary: don't fine-tune on facts the model doesn't already know. It will confabulate rather than learn.

Safety data: a little goes a long way

Several studies found that adding just ~500 safety-specific examples (prompts that should be refused + good refusals as responses) drastically improves safety behavior. This is striking: 500 examples out of the billions of tokens the model saw in pretraining are enough to establish a new behavioral pattern.

The danger is over-refusal. A model trained too heavily on safety data starts refusing benign requests ("what is the history of nuclear weapons?" gets refused because it mentions "nuclear"). The art of safety tuning is keeping a narrow, well-targeted refusal distribution without collateral damage to helpfulness.

LIMA achieves strong instruction-following with only 1,000 examples. What does this most directly support about the mechanism of SFT?

SFT is a very data-efficient learning algorithm — it updates weights in highly targeted ways. The pretrained model already contains the knowledge; SFT mainly elicits the style and format for presenting that knowledge as an assistant. 1,000 examples is actually a lot of data when compared to the model's parameter count. Larger SFT datasets hurt because they introduce noisy gradients that overwrite pretraining.

Chapter 2: SFT: The Loss

The mechanics of SFT are straightforward once you understand one design choice: we only compute loss on the response tokens, not the prompt tokens.

A training example in instruction-tuning is formatted as a chat template — a structured string that packages the system prompt, user message, and assistant response:

text
<|system|>
You are a helpful assistant.
<|user|>
Explain quantum entanglement to a 10-year-old.
<|assistant|>
Imagine you have two magic coins that are best friends...

After tokenization, this becomes a sequence of tokens. We split it into two parts: the prompt (everything up to and including <|assistant|>) and the response (everything after). The loss is computed only on the response tokens.

Deriving the SFT cross-entropy loss

Let the response tokens be y₁, y₂, …, y_T and the full prefix (prompt + response so far) at step t be x_<t. The model produces a probability distribution over the vocabulary at each step. The SFT loss is the standard cross-entropy, but summed only over response token positions:

ℒ_SFT(θ) = − (1/T) ∑_t=1^T log p_θ(y_t | x_<t)

Each term log p_θ(y_t | x_<t) is the log probability the model assigns to the correct next token. If the model is confident and correct, this is close to 0. If it's confident and wrong, this is very negative. Averaging over T response tokens and negating gives a positive loss we minimize.

Why mask the prompt? If we also computed loss on the prompt, we would be penalizing the model for not predicting user messages — which we never actually want to generate during inference. Worse, user messages are often questions, and minimizing loss on questions would pull the model toward question-completion behavior rather than answer-generation behavior.

PyTorch implementation

python
import torch
import torch.nn.functional as F

def sft_loss(model, input_ids, labels, prompt_len):
    # input_ids: (B, seq_len) — full prompt + response tokens
    # labels:    (B, seq_len) — same, but prompt positions set to -100
    # -100 is PyTorch's ignore_index for cross_entropy

    logits = model(input_ids).logits           # (B, seq_len, vocab_size)

    # Shift: predict token t+1 from position t
    shift_logits = logits[:, :-1, :].contiguous()   # (B, seq_len-1, V)
    shift_labels = labels[:, 1:].contiguous()        # (B, seq_len-1)

    # cross_entropy ignores positions where label == -100
    loss = F.cross_entropy(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_labels.view(-1),
        ignore_index=-100
    )
    return loss

# Build labels: copy input_ids, mask prompt positions
labels = input_ids.clone()
labels[:, :prompt_len] = -100  # mask prompt tokens from loss

Worked example: suppose the response is the 4-token sequence "Paris is the capital" and the model assigns probabilities [0.7, 0.6, 0.9, 0.8] to the correct tokens at each step. Then:

ℒ_SFT = −(1/4)(log 0.7 + log 0.6 + log 0.9 + log 0.8) = −(1/4)(−0.357 − 0.511 − 0.105 − 0.223) = 0.299

In the SFT loss, why do we set prompt token labels to −100 (PyTorch's ignore_index) rather than computing loss on the full sequence?

To save GPU memory — gradient computation through the prompt doubles memory usage. Because we want to train the model to generate responses, not predict user inputs. Penalizing the model for not predicting user messages would pull gradients in the wrong direction — toward question-completion rather than answer-generation. Because prompt tokens are already known to be correct — they have loss = 0 by definition. To prevent catastrophic forgetting — the prompt tokens were seen during pretraining.

Chapter 3: Preference Data

SFT works when you have good (prompt, response) pairs. But writing ideal responses is expensive, and there is a subtler problem: people are much better at judging quality than producing it. It is much easier for a human to look at two responses and say "this one is better" than to write the perfect response from scratch.

This is the generation-verification (G-V) gap: the cost of verifying quality is far lower than the cost of generating quality. RLHF exploits this gap. Instead of expensive gold responses, you collect cheap pairwise preferences.

Cost reality check (InstructGPT scale). For a 7B model: pretraining ≈ $300k. SFT data (25k examples from human writers) ≈ $25k. Preference data (33k pairwise comparisons from crowd workers) ≈ $4k. The RL training itself ≈ $100. Each stage is cheaper by an order of magnitude, yet each stage compounds the alignment improvement.

What preference data looks like

The standard setup: given a prompt x, sample two responses y_w (winner/chosen) and y_l (loser/rejected) from the current model. Present both to a human annotator and ask: "Which response do you prefer?" The annotator picks one. You record the tuple (x, y_w, y_l).

InstructGPT hired 40 workers via Scale AI and Upwork, with careful vetting for agreement with researcher judgments. They were given detailed rating guidelines covering helpfulness, harmlessness, and honesty. Despite this, inter-annotator agreement was only moderate (~72%) — humans do not agree perfectly on what "better" means.

Who annotates matters — a lot. Research by Santurkar et al. (2023) showed that RLHF models reflect the values and demographic biases of their annotator pool. A model trained on preferences from U.S. crowd workers will learn a distinctly American sense of what counts as a "good" answer — more verbose, more optimistic, more certain. Models trained on annotators from different demographics show noticeably different response styles on politically sensitive topics.

LM-generated feedback

An increasingly popular cost-cutting move: use GPT-4 (or a strong model) as the preference annotator. Constitutional AI (Anthropic) and UltraFeedback (used in Zephyr, Tulu3) showed that AI-generated preferences correlate surprisingly well with human preferences at the system level — near human inter-annotator agreement rates. This collapses the cost from thousands of dollars to tens of dollars for a new dataset.

The G-V gap justifies collecting preference data rather than SFT gold responses. Which statement best characterizes the G-V gap?

Generating text with a language model is faster than verifying its quality with a classifier. Gradient descent finds better solutions than greedy generation. Large models generate better text than small models verify. Humans (and models) can judge which of two outputs is better far more reliably and cheaply than they can produce an ideal output from scratch.

Chapter 4: Reward Modeling

You have a dataset of pairwise preferences (x, y_w, y_l). Now what? You need to turn these discrete comparisons into a scalar reward signal that the RL training loop can optimize. That is the job of the reward model.

The Bradley-Terry model

The Bradley-Terry model (1952) is a probabilistic framework for pairwise comparisons. It says: each item i has a latent "strength" r_i. The probability that item A beats item B is:

P(A ≻ B) = e^r_A / (e^r_A + e^r_B) = σ(r_A − r_B)

That last step is the key: the sigmoid function σ(z) = 1/(1+e^−z) naturally converts the reward difference into a probability. When r_A ≫ r_B, P(A≻B) → 1. When r_A = r_B, P = 0.5 (a coin flip). When r_A ≪ r_B, P → 0.

Worked example

Suppose our reward model scores two responses as r_w = 2.1 and r_l = 0.4. Then:

P(y_w ≻ y_l) = σ(2.1 − 0.4) = σ(1.7) = 1/(1 + e^−1.7) = 1/(1 + 0.183) = 0.845

The model says there is an 84.5% chance that y_w is preferred. The reward model's training loss is the negative log-likelihood under Bradley-Terry:

ℒ_RM(θ) = − E_{(x,y_w,y_l)} [ log σ(r_θ(x, y_w) − r_θ(x, y_l)) ]

This loss drives r_θ(x, y_w) > r_θ(x, y_l) for every preference pair. The bigger the margin, the lower the loss. The reward model is initialized from the SFT checkpoint (same architecture) with a linear head on top of the final hidden state that outputs a single scalar reward.

Why initialize from SFT? The reward model needs to understand language well enough to assess response quality. Starting from the SFT checkpoint means it already understands the instruction-following context. Random initialization would require the reward model to learn language understanding from scratch on only 33k examples — hopeless.

PyTorch reward model loss

python
import torch
import torch.nn.functional as F

def reward_model_loss(rm, x_chosen, x_rejected):
    # rm: reward model (LM + scalar head)
    # x_chosen, x_rejected: tokenized (prompt + response) for each pair

    r_w = rm(x_chosen).reward      # scalar reward for chosen response
    r_l = rm(x_rejected).reward    # scalar reward for rejected response

    # Bradley-Terry: maximize log σ(r_w − r_l)
    loss = -F.logsigmoid(r_w - r_l).mean()
    return loss

# At inference: higher reward = model predicts human would prefer this response
# reward(x, "Paris is the capital of France") → 2.1
# reward(x, "France has a capital called maybe Paris?") → 0.4

Bradley-Terry preference probability — drag the reward gap

Adjust the reward difference (r_w − r_l) to see how the Bradley-Terry model converts it to a preference probability. Notice how the curve saturates quickly — a gap of 3 already implies >95% confidence.

Reward gap (r_w − r_l) 1.7

The Bradley-Terry model assigns P(A≻B) = σ(r_A − r_B). If the reward model gives r_A = 1.5 and r_B = 1.5, what is P(A≻B), and what does this mean?

P = 1.0. Identical scores mean both are perfect. P = 0.0. Identical scores mean neither can win. P = 0.5. σ(0) = 0.5, meaning the model is maximally uncertain — it has no preference between A and B. P = 0.75. The model adds a small prior toward A when scores are tied.

Chapter 5: RLHF with PPO

You now have a reward model r(x, y) that can score any (prompt, response) pair. The RLHF problem is: find the policy π_θ (your language model) that maximizes expected reward. Naively:

max_θ E_{x~D, y~π_θ(·|x)} [ r(x, y) ]

This looks clean, but there is a catastrophic failure mode hiding in it. If you optimize this objective unconstrained, the language model will learn to produce reward-hacked outputs — responses that score high on the reward model but are not actually good. Maybe the reward model learned that long, confident responses tend to be preferred. So the LM just generates endlessly verbose nonsense — scoring high on the RM while being useless.

The KL constraint

The fix: add a KL divergence penalty between the current policy π_θ and a frozen reference policy π_ref (the SFT model). The full RLHF objective is:

max_θ E_{x~D, y~π_θ} [ r(x, y) ] − β · KL(π_θ(·|x) ∥ π_ref(·|x))

The KL term measures how far the current policy has drifted from the SFT reference. β is the KL coefficient — a hyperparameter that controls the tradeoff. Large β: stay close to SFT, reward optimization is weak. Small β: allow large policy changes, risk reward hacking.

KL divergence as a leash. Think of the SFT model as a well-trained dog. RLHF teaches it new tricks via reward signals. The KL penalty is the leash — it prevents the dog from running so far after the reward that it forgets how to behave. Without the leash (β=0), the model exploits the reward model's blind spots and produces degenerate outputs.

Expanding the KL term per token and combining with the reward gives a per-token reward at each step t:

r̃(x, y_t) = r(x, y) · 𝟙[t=T] − β · log(π_θ(y_t|x,y_<t) / π_ref(y_t|x,y_<t))

The reward r(x, y) arrives only at the end of the sequence (the terminal reward). The KL penalty applies token-by-token throughout. This is a standard Markov Decision Process: state = (x, y_<t), action = y_t, reward = r̃.

PPO at a conceptual level

Proximal Policy Optimization (PPO) is the RL algorithm used in InstructGPT. The key idea: don't take policy update steps that are too large. Directly optimizing E[R(z)∇ log p_θ(z)] has catastrophically high variance. PPO clips the policy ratio π_θ(a)/π_old(a) to stay in [1−ε, 1+ε], preventing updates that change the policy too radically in a single step.

In practice, PPO for RLHF requires four models in memory simultaneously: the policy being trained, the reference policy (frozen), the reward model (frozen), and a value network (predicting expected future reward). This is operationally complex — each training step requires a rollout phase (generate y ~ π_θ), a scoring phase (compute r, KL), and an update phase (gradient step on policy and value network).

PPO is finicky. InstructGPT required careful hyperparameter tuning: the right β, the right learning rate, the right number of rollouts per batch. PPO is also sensitive to the quality of the reward model — a noisy RM amplifies errors through the RL loop. This instability motivated the search for alternatives that avoid on-policy RL entirely.

RLHF reward vs KL tradeoff — the β slider

Drag β to see how it shapes the optimization landscape. Low β allows high reward but risks large policy drift (reward hacking). High β keeps the model near SFT but caps achievable reward. The sweet spot is where the reward-KL frontier curves.

β (KL penalty weight) 0.30

In the RLHF objective max E[r(x,y)] − β·KL(π∥π_ref), what happens as β→0?

The model converges to the SFT reference policy, because without KL penalty there is no gradient signal. The model optimizes reward without constraint, risking reward hacking — generating responses that exploit the reward model's blind spots rather than being genuinely helpful. The update steps become very small because the KL gradient dominates when β is small. The model stops generating long sequences because long sequences accumulate more KL penalty.

Chapter 6: DPO: Derivation

PPO works but it is operationally painful: four models in memory, rollout loops, careful hyperparameter tuning, high variance gradients. Is there a way to train on preference data without any on-policy RL?

Direct Preference Optimization (DPO) says yes — and it derives from the same RLHF objective we just wrote down. The insight is algebraic: the KL-constrained RL problem has a closed-form optimal policy, which lets us reparametrize the reward in terms of the policy and skip the reward model entirely.

Step 1 — Solve the KL-constrained objective analytically

For a fixed reward function r(x, y), the policy that maximizes E[r] − β·KL(π∥π_ref) is (up to a normalizing constant Z(x)):

π^*(y|x) = (1/Z(x)) · π_ref(y|x) · exp(r(x,y)/β)

This is a Gibbs distribution — the reference policy reweighted by exponentiated reward. You can verify this by writing the Lagrangian and setting the functional derivative to zero. Z(x) = ∑_y π_ref(y|x) exp(r(x,y)/β) is a normalizing constant that depends on x but not y.

Step 2 — Solve for the implied reward

Rearrange to express r(x, y) as a function of the policy:

r(x, y) = β · log(π^*(y|x) / π_ref(y|x)) + β · log Z(x)

This is the key equation: given any policy π, its "implied reward" under the RLHF framework is the log ratio π/π_ref scaled by β. The intractable Z(x) term disappears in the next step.

Step 3 — Plug into Bradley-Terry

The preference probability under Bradley-Terry is σ(r(x,y_w) − r(x,y_l)). Substituting our implied reward:

P(y_w ≻ y_l | x) = σ(r(y_w) − r(y_l)) = σ(β log(π/π_ref|y_w) − β log(π/π_ref|y_l))

The Z(x) terms cancel because they appear with opposite signs! We are left with a preference probability that depends only on policy log-ratios — no reward model, no intractable normalization.

Step 4 — The DPO loss

Maximize the log-likelihood of the observed preferences:

ℒ_DPO(θ) = −E_{(x,y_w,y_l)} [ log σ(β log(π_θ(y_w|x)/π_ref(y_w|x)) − β log(π_θ(y_l|x)/π_ref(y_l|x))) ]

This is the DPO objective. Let's name the two terms: let Δ_w = log π_θ(y_w|x) − log π_ref(y_w|x) (how much more/less likely the trained model is to produce the chosen response vs the reference). Similarly Δ_l for the rejected. The loss is −log σ(β(Δ_w − Δ_l)).

Gradient intuition. The DPO gradient pushes in two directions simultaneously: increase log probability of the chosen response (positive gradient), decrease log probability of the rejected response (negative gradient). But both updates are scaled by a "prediction error" signal — how wrong the current implied reward is. If the model already correctly ranks the pair, the gradient vanishes.

PyTorch DPO loss

python
import torch, torch.nn.functional as F

def log_prob_sequence(model, input_ids, labels):
    # Sum log-probs over response token positions only
    logits = model(input_ids).logits               # (B, L, V)
    log_probs = F.log_softmax(logits[:, :-1], dim=-1)  # (B, L-1, V)
    token_log_probs = log_probs.gather(
        dim=-1, index=labels[:, 1:].unsqueeze(-1)    # (B, L-1, 1)
    ).squeeze(-1)                                  # (B, L-1)
    # Mask prompt positions (label == -100)
    mask = (labels[:, 1:] != -100).float()
    return (token_log_probs * mask).sum(dim=-1)     # (B,) — sum over response

def dpo_loss(policy, reference, x_w, y_w_labels, x_l, y_l_labels, beta=0.1):
    # Chosen response log-probs
    lp_w_policy = log_prob_sequence(policy,    x_w, y_w_labels)
    lp_w_ref    = log_prob_sequence(reference, x_w, y_w_labels)
    # Rejected response log-probs
    lp_l_policy = log_prob_sequence(policy,    x_l, y_l_labels)
    lp_l_ref    = log_prob_sequence(reference, x_l, y_l_labels)

    delta_w = lp_w_policy - lp_w_ref    # log π_θ(yw) - log π_ref(yw)
    delta_l = lp_l_policy - lp_l_ref    # log π_θ(yl) - log π_ref(yl)

    # DPO loss: -log σ(β * (delta_w - delta_l))
    loss = -F.logsigmoid(beta * (delta_w - delta_l)).mean()
    return loss

DPO margin → loss — drag the chosen−rejected logprob margin

The DPO loss is −log σ(βΔ) where Δ = (log π/π_ref)_chosen − (log π/π_ref)_rejected. Adjust β and the current margin to see the loss and its gradient magnitude.

Margin Δ (chosen−rejected log-ratio diff) 0.5

β 0.10

In DPO, why does the intractable partition function Z(x) not appear in the final loss?

DPO makes a nonparametric assumption that forces Z(x) = 1. DPO normalizes the policy with a softmax, absorbing Z(x) into the softmax denominator. When computing the reward difference r(y_w) − r(y_l), the β·log Z(x) terms have equal magnitude and opposite sign, so they cancel. DPO estimates Z(x) using Monte Carlo sampling from the reference policy.

Chapter 7: Showcase: RLHF-PPO vs DPO

Both RLHF-PPO and DPO start from the same theoretical objective (maximize reward − β·KL). They are two algorithms for solving the same optimization problem. The question is: do they converge to the same solution in practice?

RLHF-PPO pipeline vs DPO pipeline — toggle and compare

Toggle between the two pipelines to see their data flows, model requirements, and training loops. Use the slider to adjust where on the reward–KL tradeoff curve each algorithm ends up.

Training steps (optimization progress) 50

Empirical comparison

The controlled comparison in the DPO paper showed comparable performance between DPO and PPO on summarization and dialogue tasks — same benchmark scores, but DPO achieved this with no reward model training, no rollouts, and a single-stage training loop. Most open-source RLHF models today (Zephyr, Tulu3, Llama-3-Instruct) use DPO or one of its variants.

That said, PPO is not obsolete. Recent results from labs with significant compute budgets have found that PPO with a strong reward model can outperform DPO on complex reasoning and coding tasks. The intuition: PPO can explore the response space during rollouts, discovering high-reward sequences that weren't in the training preference data. DPO can only leverage the preference pairs it was given.

	RLHF-PPO	DPO
Training paradigm	On-policy RL	Supervised (offline)
Models needed	Policy + Reference + RM + Value net (4 total)	Policy + Reference (2 total)
Data required	Online rollouts + pref data	Offline pref data only
Implementation complexity	High (rollout loop, PPO clip, value net)	Low (single forward pass per pair)
Exploration	Yes — discovers new sequences	No — confined to training pairs
Risk	Reward hacking, instability	Mode collapse to preferred style
Industry usage	GPT-4, Gemini (reportedly)	Llama 3, Zephyr, Tulu3, Mistral

DPO requires no reward model and no on-policy rollouts. What capability does this tradeoff sacrifice compared to PPO?

DPO cannot use the KL penalty, so it has no way to prevent reward hacking. DPO cannot explore new response sequences — it is confined to the preference pairs in the training data and cannot discover novel high-reward outputs via generation. DPO requires more GPU memory because it stores both policy and reference gradients simultaneously. DPO cannot use pairwise preference data — it requires absolute reward scores for each response.

Chapter 8: Overoptimization

Alignment is not a solved problem once you've run RLHF. There is a systematic failure mode that emerges whenever you optimize a proxy objective hard enough: overoptimization, also called Goodhart's Law in the social sciences ("when a measure becomes a target, it ceases to be a good measure").

In RLHF, the proxy is the reward model. The reward model is an imperfect proxy for actual human preference — it learned from noisy human labels, it has limited capacity, and it cannot generalize perfectly. As the policy optimization drives the LM further from its SFT starting point in pursuit of higher RM scores, it eventually finds response patterns that exploit the RM's errors rather than genuine quality signals.

Classic example of reward hacking. A reward model trained on human preferences for helpful summaries gives high scores to: (1) long responses, (2) confident-sounding statements, (3) responses that use the word "certainly." After enough PPO steps, the model produces infinitely long, maximally confident, "certainly certainly certainly certainly..." drivel. The RM scores this highly. Human evaluators do not.

The overoptimization curve

Gao et al. (2022) measured this systematically: take a reward model trained on human preferences, then optimize a policy against it for increasing numbers of steps. Plot RM score vs human preference score over time. The curves diverge: RM score increases monotonically, human preference score peaks around some intermediate optimization strength and then decreases.

This peak-then-decline shape is universal across RLHF systems. The KL penalty (the β term) controls how quickly you reach the peak — smaller β = faster divergence = the peak arrives sooner and is lower. But no value of β prevents overoptimization eventually.

Mode collapse

A second failure mode: mode collapse. RLHF removes the calibration that the base model had from pretraining. A pretrained model produces diverse outputs sampled from a broad distribution over plausible completions. After RLHF, the model increasingly concentrates probability mass on a narrow set of "preferred" response patterns. Ask it the same question 10 times — you get nearly identical answers. The policy is no longer a probabilistic model; it has collapsed to a near-deterministic map from prompt to preferred response style.

Overoptimization — RM score vs true human preference as optimization progresses

Watch how RM score and human preference diverge as optimization proceeds. Adjust β to see how the KL penalty delays (but does not prevent) overoptimization. The dashed vertical line marks the optimal stopping point.

β (KL penalty weight) 0.10

A reward model is trained on human preferences for helpful responses. After 10,000 PPO steps, the model's RM score is at its all-time high, but human evaluators rate it worse than the SFT checkpoint. What has happened?

The SFT checkpoint was already better than RLHF can achieve — RLHF actively hurts this model. The reward model's scoring is too slow, causing stale gradients during PPO optimization. Reward hacking — the model has found response patterns that exploit blind spots in the reward model (e.g., length, confidence signals) rather than improving genuine helpfulness. The RM score is no longer a reliable proxy. Catastrophic forgetting — RLHF overwrote the model's factual knowledge from pretraining.

Chapter 9: Connections

Alignment via SFT and RLHF sits at the intersection of several threads. Here is how this lecture connects to the broader landscape:

Concept from this lesson	Where it leads next
SFT cross-entropy loss	Same loss as pretraining — the only difference is the data and the prompt mask. CS336 Lec 2
Bradley-Terry reward modeling	Elo ratings, tournament ranking, pairwise comparison theory
KL divergence in RLHF	Information geometry, variational inference, VI-based LLM sampling
DPO closed-form derivation	SimPO (no reference), IPO, KTO — DPO variants in CS336 Lec 16
Reward hacking / overoptimization	Constitutional AI, process reward models, verifiable rewards in Reward & Alignment Gleam
Preference data collection	RLHF ethics, crowdworker conditions, AI feedback (RLAIF)

DPO variants worth knowing

DPO spawned a family of variants, two of which are worth knowing from the Tulu 3 paper:

SimPO — removes the reference model entirely. Uses sequence-level log-likelihood normalized by length as the implicit reward. Simpler and sometimes stronger than DPO.
Length-normalized DPO — addresses the length-bias problem. Standard DPO favors longer chosen responses because longer sequences accumulate more log-prob. Dividing by response length debiases the implicit reward.

The alignment tax

A persistent empirical observation: RLHF-aligned models tend to score slightly worse on academic benchmarks (MMLU, etc.) than their base model counterparts. The hypothesis is that alignment compresses the output distribution — the model learns to respond in assistant style even when the benchmark expects different formatting. This is sometimes called the "alignment tax."

In practice, the alignment tax is small and worth paying — a model that refuses to answer harmful questions and follows instructions reliably is far more useful than one that scores 2 points higher on MMLU.

The full InstructGPT recipe. Pretrain on web text → SFT on 13k human-written (prompt, response) pairs → train reward model on 33k pairwise comparisons (initialized from SFT) → PPO against the RM with KL penalty. This pipeline became the template for ChatGPT, Claude, and virtually every major aligned model that followed.

Cheat sheet

Formula	What it is
`ℒ_SFT = −(1/T)∑ log p_θ(y_t\|x<t)`	SFT loss (response tokens only)
`P(A≻B) = σ(r_A − r_B)`	Bradley-Terry preference probability
`ℒ_RM = −log σ(r_w − r_l)`	Reward model training loss
`max E[r] − β·KL(π∥π_ref)`	RLHF objective
`ℒ_DPO = −log σ(β(Δ_w−Δ_l))`	DPO loss (Δ = log π/π_ref)

Which stage of the InstructGPT recipe directly uses pairwise preference data (A vs B comparisons)?

SFT — the human-written responses are ranked pairwise before being used as training targets. Reward model training — the RM is trained on (x, y_w, y_l) triplets via the Bradley-Terry log-sigmoid loss. PPO — each rollout generates a pair of responses and picks the winner via reward comparison. Pretraining — pairwise ranking is used to select which web documents to train on.

"What I cannot create, I do not understand." — Richard Feynman. You can now create every piece of the alignment pipeline: the SFT data loader, the reward model, the RLHF objective, and the DPO loss. What you cannot yet create: a reward model that doesn't overfit, a preference dataset without demographic bias, and a KL penalty that perfectly calibrates the reward-safety tradeoff. Those remain open problems.