Your pretrained LM completes text — it does not follow instructions, refuse harmful requests, or even say "I don't know." Alignment is the engineering problem of turning a raw next-token predictor into a helpful, safe assistant. This lesson derives the full post-training pipeline from first principles: Supervised Fine-Tuning (SFT) with prompt-masking and cross-entropy, the Bradley-Terry preference model (derive P(A≻B)=σ(rA−rB)), the RLHF reward-minus-KL objective (why the KL term is non-negotiable), PPO at a conceptual level (InstructGPT recipe), and DPO — derive the closed-form loss that skips the reward model entirely. Five interactive canvases, worked numerical examples, and PyTorch code for every step.
You have trained a 7B-parameter language model on two trillion tokens of internet text. It can continue any piece of writing with stunning fluency. You type: "How do I make a bomb?"
The base model completes this sentence. It found thousands of relevant documents during training, and its job — the only job it was ever trained for — is to predict what comes next given a prefix. From the model's perspective, this is the same as "How do I make a cake?" Complete the sentence. Maximize the probability of the most plausible continuation.
This is the alignment gap: the gap between what the base model is optimized for (next-token prediction on web text) and what a deployed assistant must do (follow instructions, produce helpful content, refuse harmful requests, maintain a helpful tone, say "I don't know" when it doesn't know).
The 2022 InstructGPT paper (Ouyang et al.) codified the standard recipe. You can think of it as a two-stage post-training process that runs after pretraining:
Both stages are cheap compared to pretraining. The entire InstructGPT SFT dataset was ~13k prompts with human-written responses. Reward model training used ~33k comparison pairs. Yet the resulting model was rated dramatically more helpful by human evaluators — even compared to a 100× larger base model.
Click each stage to see what a model at that training phase might output for the prompt "Explain quantum entanglement to a 10-year-old." The differences illustrate what each training stage adds.
Supervised Fine-Tuning (SFT) is simple in concept: collect a dataset of (prompt, ideal response) pairs, and do gradient descent to maximize the probability the model assigns to the ideal response given the prompt. The question is — what data should you use?
Three major public instruction-tuning datasets have shaped how the field thinks about this:
The LIMA paper (Zhou et al., 2023) is one of the most important results in post-training. They curated just 1,000 examples by hand — high-quality, diverse, carefully written — and found that a LLaMA model fine-tuned on this tiny dataset matched or exceeded models fine-tuned on 52,000 (Alpaca) or 9,846 (Dolly) examples.
Their conclusion: SFT does not teach the model new knowledge. It teaches the model a response style — how to present the knowledge it already absorbed during pretraining. If the model already knows something, SFT helps it learn when and how to say it. But SFT cannot make the model know things it doesn't know from pretraining.
Several studies found that adding just ~500 safety-specific examples (prompts that should be refused + good refusals as responses) drastically improves safety behavior. This is striking: 500 examples out of the billions of tokens the model saw in pretraining are enough to establish a new behavioral pattern.
The danger is over-refusal. A model trained too heavily on safety data starts refusing benign requests ("what is the history of nuclear weapons?" gets refused because it mentions "nuclear"). The art of safety tuning is keeping a narrow, well-targeted refusal distribution without collateral damage to helpfulness.
The mechanics of SFT are straightforward once you understand one design choice: we only compute loss on the response tokens, not the prompt tokens.
A training example in instruction-tuning is formatted as a chat template — a structured string that packages the system prompt, user message, and assistant response:
text
<|system|>
You are a helpful assistant.
<|user|>
Explain quantum entanglement to a 10-year-old.
<|assistant|>
Imagine you have two magic coins that are best friends...
After tokenization, this becomes a sequence of tokens. We split it into two parts: the prompt (everything up to and including <|assistant|>) and the response (everything after). The loss is computed only on the response tokens.
Let the response tokens be y1, y2, …, yT and the full prefix (prompt + response so far) at step t be x<t. The model produces a probability distribution over the vocabulary at each step. The SFT loss is the standard cross-entropy, but summed only over response token positions:
Each term log pθ(yt | x<t) is the log probability the model assigns to the correct next token. If the model is confident and correct, this is close to 0. If it's confident and wrong, this is very negative. Averaging over T response tokens and negating gives a positive loss we minimize.
python import torch import torch.nn.functional as F def sft_loss(model, input_ids, labels, prompt_len): # input_ids: (B, seq_len) — full prompt + response tokens # labels: (B, seq_len) — same, but prompt positions set to -100 # -100 is PyTorch's ignore_index for cross_entropy logits = model(input_ids).logits # (B, seq_len, vocab_size) # Shift: predict token t+1 from position t shift_logits = logits[:, :-1, :].contiguous() # (B, seq_len-1, V) shift_labels = labels[:, 1:].contiguous() # (B, seq_len-1) # cross_entropy ignores positions where label == -100 loss = F.cross_entropy( shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1), ignore_index=-100 ) return loss # Build labels: copy input_ids, mask prompt positions labels = input_ids.clone() labels[:, :prompt_len] = -100 # mask prompt tokens from loss
Worked example: suppose the response is the 4-token sequence "Paris is the capital" and the model assigns probabilities [0.7, 0.6, 0.9, 0.8] to the correct tokens at each step. Then:
SFT works when you have good (prompt, response) pairs. But writing ideal responses is expensive, and there is a subtler problem: people are much better at judging quality than producing it. It is much easier for a human to look at two responses and say "this one is better" than to write the perfect response from scratch.
This is the generation-verification (G-V) gap: the cost of verifying quality is far lower than the cost of generating quality. RLHF exploits this gap. Instead of expensive gold responses, you collect cheap pairwise preferences.
The standard setup: given a prompt x, sample two responses yw (winner/chosen) and yl (loser/rejected) from the current model. Present both to a human annotator and ask: "Which response do you prefer?" The annotator picks one. You record the tuple (x, yw, yl).
InstructGPT hired 40 workers via Scale AI and Upwork, with careful vetting for agreement with researcher judgments. They were given detailed rating guidelines covering helpfulness, harmlessness, and honesty. Despite this, inter-annotator agreement was only moderate (~72%) — humans do not agree perfectly on what "better" means.
An increasingly popular cost-cutting move: use GPT-4 (or a strong model) as the preference annotator. Constitutional AI (Anthropic) and UltraFeedback (used in Zephyr, Tulu3) showed that AI-generated preferences correlate surprisingly well with human preferences at the system level — near human inter-annotator agreement rates. This collapses the cost from thousands of dollars to tens of dollars for a new dataset.
You have a dataset of pairwise preferences (x, yw, yl). Now what? You need to turn these discrete comparisons into a scalar reward signal that the RL training loop can optimize. That is the job of the reward model.
The Bradley-Terry model (1952) is a probabilistic framework for pairwise comparisons. It says: each item i has a latent "strength" ri. The probability that item A beats item B is:
That last step is the key: the sigmoid function σ(z) = 1/(1+e−z) naturally converts the reward difference into a probability. When rA ≫ rB, P(A≻B) → 1. When rA = rB, P = 0.5 (a coin flip). When rA ≪ rB, P → 0.
Suppose our reward model scores two responses as rw = 2.1 and rl = 0.4. Then:
The model says there is an 84.5% chance that yw is preferred. The reward model's training loss is the negative log-likelihood under Bradley-Terry:
This loss drives rθ(x, yw) > rθ(x, yl) for every preference pair. The bigger the margin, the lower the loss. The reward model is initialized from the SFT checkpoint (same architecture) with a linear head on top of the final hidden state that outputs a single scalar reward.
python import torch import torch.nn.functional as F def reward_model_loss(rm, x_chosen, x_rejected): # rm: reward model (LM + scalar head) # x_chosen, x_rejected: tokenized (prompt + response) for each pair r_w = rm(x_chosen).reward # scalar reward for chosen response r_l = rm(x_rejected).reward # scalar reward for rejected response # Bradley-Terry: maximize log σ(r_w − r_l) loss = -F.logsigmoid(r_w - r_l).mean() return loss # At inference: higher reward = model predicts human would prefer this response # reward(x, "Paris is the capital of France") → 2.1 # reward(x, "France has a capital called maybe Paris?") → 0.4
Adjust the reward difference (rw − rl) to see how the Bradley-Terry model converts it to a preference probability. Notice how the curve saturates quickly — a gap of 3 already implies >95% confidence.
You now have a reward model r(x, y) that can score any (prompt, response) pair. The RLHF problem is: find the policy πθ (your language model) that maximizes expected reward. Naively:
This looks clean, but there is a catastrophic failure mode hiding in it. If you optimize this objective unconstrained, the language model will learn to produce reward-hacked outputs — responses that score high on the reward model but are not actually good. Maybe the reward model learned that long, confident responses tend to be preferred. So the LM just generates endlessly verbose nonsense — scoring high on the RM while being useless.
The fix: add a KL divergence penalty between the current policy πθ and a frozen reference policy πref (the SFT model). The full RLHF objective is:
The KL term measures how far the current policy has drifted from the SFT reference. β is the KL coefficient — a hyperparameter that controls the tradeoff. Large β: stay close to SFT, reward optimization is weak. Small β: allow large policy changes, risk reward hacking.
Expanding the KL term per token and combining with the reward gives a per-token reward at each step t:
The reward r(x, y) arrives only at the end of the sequence (the terminal reward). The KL penalty applies token-by-token throughout. This is a standard Markov Decision Process: state = (x, y<t), action = yt, reward = r̃.
Proximal Policy Optimization (PPO) is the RL algorithm used in InstructGPT. The key idea: don't take policy update steps that are too large. Directly optimizing E[R(z)∇ log pθ(z)] has catastrophically high variance. PPO clips the policy ratio πθ(a)/πold(a) to stay in [1−ε, 1+ε], preventing updates that change the policy too radically in a single step.
In practice, PPO for RLHF requires four models in memory simultaneously: the policy being trained, the reference policy (frozen), the reward model (frozen), and a value network (predicting expected future reward). This is operationally complex — each training step requires a rollout phase (generate y ~ πθ), a scoring phase (compute r, KL), and an update phase (gradient step on policy and value network).
Drag β to see how it shapes the optimization landscape. Low β allows high reward but risks large policy drift (reward hacking). High β keeps the model near SFT but caps achievable reward. The sweet spot is where the reward-KL frontier curves.
PPO works but it is operationally painful: four models in memory, rollout loops, careful hyperparameter tuning, high variance gradients. Is there a way to train on preference data without any on-policy RL?
Direct Preference Optimization (DPO) says yes — and it derives from the same RLHF objective we just wrote down. The insight is algebraic: the KL-constrained RL problem has a closed-form optimal policy, which lets us reparametrize the reward in terms of the policy and skip the reward model entirely.
For a fixed reward function r(x, y), the policy that maximizes E[r] − β·KL(π∥πref) is (up to a normalizing constant Z(x)):
This is a Gibbs distribution — the reference policy reweighted by exponentiated reward. You can verify this by writing the Lagrangian and setting the functional derivative to zero. Z(x) = ∑y πref(y|x) exp(r(x,y)/β) is a normalizing constant that depends on x but not y.
Rearrange to express r(x, y) as a function of the policy:
This is the key equation: given any policy π, its "implied reward" under the RLHF framework is the log ratio π/πref scaled by β. The intractable Z(x) term disappears in the next step.
The preference probability under Bradley-Terry is σ(r(x,yw) − r(x,yl)). Substituting our implied reward:
The Z(x) terms cancel because they appear with opposite signs! We are left with a preference probability that depends only on policy log-ratios — no reward model, no intractable normalization.
Maximize the log-likelihood of the observed preferences:
This is the DPO objective. Let's name the two terms: let Δw = log πθ(yw|x) − log πref(yw|x) (how much more/less likely the trained model is to produce the chosen response vs the reference). Similarly Δl for the rejected. The loss is −log σ(β(Δw − Δl)).
python import torch, torch.nn.functional as F def log_prob_sequence(model, input_ids, labels): # Sum log-probs over response token positions only logits = model(input_ids).logits # (B, L, V) log_probs = F.log_softmax(logits[:, :-1], dim=-1) # (B, L-1, V) token_log_probs = log_probs.gather( dim=-1, index=labels[:, 1:].unsqueeze(-1) # (B, L-1, 1) ).squeeze(-1) # (B, L-1) # Mask prompt positions (label == -100) mask = (labels[:, 1:] != -100).float() return (token_log_probs * mask).sum(dim=-1) # (B,) — sum over response def dpo_loss(policy, reference, x_w, y_w_labels, x_l, y_l_labels, beta=0.1): # Chosen response log-probs lp_w_policy = log_prob_sequence(policy, x_w, y_w_labels) lp_w_ref = log_prob_sequence(reference, x_w, y_w_labels) # Rejected response log-probs lp_l_policy = log_prob_sequence(policy, x_l, y_l_labels) lp_l_ref = log_prob_sequence(reference, x_l, y_l_labels) delta_w = lp_w_policy - lp_w_ref # log π_θ(yw) - log π_ref(yw) delta_l = lp_l_policy - lp_l_ref # log π_θ(yl) - log π_ref(yl) # DPO loss: -log σ(β * (delta_w - delta_l)) loss = -F.logsigmoid(beta * (delta_w - delta_l)).mean() return loss
The DPO loss is −log σ(βΔ) where Δ = (log π/πref)chosen − (log π/πref)rejected. Adjust β and the current margin to see the loss and its gradient magnitude.
Both RLHF-PPO and DPO start from the same theoretical objective (maximize reward − β·KL). They are two algorithms for solving the same optimization problem. The question is: do they converge to the same solution in practice?
Toggle between the two pipelines to see their data flows, model requirements, and training loops. Use the slider to adjust where on the reward–KL tradeoff curve each algorithm ends up.
The controlled comparison in the DPO paper showed comparable performance between DPO and PPO on summarization and dialogue tasks — same benchmark scores, but DPO achieved this with no reward model training, no rollouts, and a single-stage training loop. Most open-source RLHF models today (Zephyr, Tulu3, Llama-3-Instruct) use DPO or one of its variants.
That said, PPO is not obsolete. Recent results from labs with significant compute budgets have found that PPO with a strong reward model can outperform DPO on complex reasoning and coding tasks. The intuition: PPO can explore the response space during rollouts, discovering high-reward sequences that weren't in the training preference data. DPO can only leverage the preference pairs it was given.
| RLHF-PPO | DPO | |
|---|---|---|
| Training paradigm | On-policy RL | Supervised (offline) |
| Models needed | Policy + Reference + RM + Value net (4 total) | Policy + Reference (2 total) |
| Data required | Online rollouts + pref data | Offline pref data only |
| Implementation complexity | High (rollout loop, PPO clip, value net) | Low (single forward pass per pair) |
| Exploration | Yes — discovers new sequences | No — confined to training pairs |
| Risk | Reward hacking, instability | Mode collapse to preferred style |
| Industry usage | GPT-4, Gemini (reportedly) | Llama 3, Zephyr, Tulu3, Mistral |
Alignment is not a solved problem once you've run RLHF. There is a systematic failure mode that emerges whenever you optimize a proxy objective hard enough: overoptimization, also called Goodhart's Law in the social sciences ("when a measure becomes a target, it ceases to be a good measure").
In RLHF, the proxy is the reward model. The reward model is an imperfect proxy for actual human preference — it learned from noisy human labels, it has limited capacity, and it cannot generalize perfectly. As the policy optimization drives the LM further from its SFT starting point in pursuit of higher RM scores, it eventually finds response patterns that exploit the RM's errors rather than genuine quality signals.
Gao et al. (2022) measured this systematically: take a reward model trained on human preferences, then optimize a policy against it for increasing numbers of steps. Plot RM score vs human preference score over time. The curves diverge: RM score increases monotonically, human preference score peaks around some intermediate optimization strength and then decreases.
This peak-then-decline shape is universal across RLHF systems. The KL penalty (the β term) controls how quickly you reach the peak — smaller β = faster divergence = the peak arrives sooner and is lower. But no value of β prevents overoptimization eventually.
A second failure mode: mode collapse. RLHF removes the calibration that the base model had from pretraining. A pretrained model produces diverse outputs sampled from a broad distribution over plausible completions. After RLHF, the model increasingly concentrates probability mass on a narrow set of "preferred" response patterns. Ask it the same question 10 times — you get nearly identical answers. The policy is no longer a probabilistic model; it has collapsed to a near-deterministic map from prompt to preferred response style.
Watch how RM score and human preference diverge as optimization proceeds. Adjust β to see how the KL penalty delays (but does not prevent) overoptimization. The dashed vertical line marks the optimal stopping point.
Alignment via SFT and RLHF sits at the intersection of several threads. Here is how this lecture connects to the broader landscape:
| Concept from this lesson | Where it leads next |
|---|---|
| SFT cross-entropy loss | Same loss as pretraining — the only difference is the data and the prompt mask. CS336 Lec 2 |
| Bradley-Terry reward modeling | Elo ratings, tournament ranking, pairwise comparison theory |
| KL divergence in RLHF | Information geometry, variational inference, VI-based LLM sampling |
| DPO closed-form derivation | SimPO (no reference), IPO, KTO — DPO variants in CS336 Lec 16 |
| Reward hacking / overoptimization | Constitutional AI, process reward models, verifiable rewards in Reward & Alignment Gleam |
| Preference data collection | RLHF ethics, crowdworker conditions, AI feedback (RLAIF) |
DPO spawned a family of variants, two of which are worth knowing from the Tulu 3 paper:
A persistent empirical observation: RLHF-aligned models tend to score slightly worse on academic benchmarks (MMLU, etc.) than their base model counterparts. The hypothesis is that alignment compresses the output distribution — the model learns to respond in assistant style even when the benchmark expects different formatting. This is sometimes called the "alignment tax."
In practice, the alignment tax is small and worth paying — a model that refuses to answer harmful questions and follows instructions reliably is far more useful than one that scores 2 points higher on MMLU.
| Formula | What it is |
|---|---|
ℒSFT = −(1/T)∑ log pθ(yt|x<t) | SFT loss (response tokens only) |
P(A≻B) = σ(rA − rB) | Bradley-Terry preference probability |
ℒRM = −log σ(rw − rl) | Reward model training loss |
max E[r] − β·KL(π∥πref) | RLHF objective |
ℒDPO = −log σ(β(Δw−Δl)) | DPO loss (Δ = log π/πref) |