Language Modeling from Scratch · CS336 · Lecture 15

Alignment I: SFT & RLHF

Your pretrained LM completes text — it does not follow instructions, refuse harmful requests, or even say "I don't know." Alignment is the engineering problem of turning a raw next-token predictor into a helpful, safe assistant. This lesson derives the full post-training pipeline from first principles: Supervised Fine-Tuning (SFT) with prompt-masking and cross-entropy, the Bradley-Terry preference model (derive P(A≻B)=σ(rA−rB)), the RLHF reward-minus-KL objective (why the KL term is non-negotiable), PPO at a conceptual level (InstructGPT recipe), and DPO — derive the closed-form loss that skips the reward model entirely. Five interactive canvases, worked numerical examples, and PyTorch code for every step.

Prerequisites: CS336 Lec 12 (cross-entropy loss, log-likelihood). Basic probability (sigmoid function, KL divergence concept).
10
Chapters
5
Live Canvases
Derived
From First Principles

Chapter 0: The Alignment Gap

You have trained a 7B-parameter language model on two trillion tokens of internet text. It can continue any piece of writing with stunning fluency. You type: "How do I make a bomb?"

The base model completes this sentence. It found thousands of relevant documents during training, and its job — the only job it was ever trained for — is to predict what comes next given a prefix. From the model's perspective, this is the same as "How do I make a cake?" Complete the sentence. Maximize the probability of the most plausible continuation.

This is the alignment gap: the gap between what the base model is optimized for (next-token prediction on web text) and what a deployed assistant must do (follow instructions, produce helpful content, refuse harmful requests, maintain a helpful tone, say "I don't know" when it doesn't know).

The base model is not dumb — it is misaligned. It has absorbed vast knowledge about how to complete all kinds of text. It just has no notion of "helpfulness," "instruction-following," or "refusal." Alignment injects those behaviors without destroying the underlying knowledge.

The 2022 InstructGPT paper (Ouyang et al.) codified the standard recipe. You can think of it as a two-stage post-training process that runs after pretraining:

Stage 1 — SFT
Show the model examples of (prompt → ideal response). Train it to imitate those responses via supervised cross-entropy. This teaches style, instruction-following behavior, and basic safety.
Stage 2 — RLHF
Collect human preferences between pairs of responses (A vs B). Train a reward model on those preferences. Then optimize the LM to generate high-reward responses — while staying close to the SFT model.

Both stages are cheap compared to pretraining. The entire InstructGPT SFT dataset was ~13k prompts with human-written responses. Reward model training used ~33k comparison pairs. Yet the resulting model was rated dramatically more helpful by human evaluators — even compared to a 100× larger base model.

Base model vs SFT vs RLHF — same prompt, three different completions

Click each stage to see what a model at that training phase might output for the prompt "Explain quantum entanglement to a 10-year-old." The differences illustrate what each training stage adds.

A pretrained base LM is asked "What is the capital of France?" and completes: "What is the capital of France? The capital of Germany is Berlin. The capital of Italy is Rome…" This is an example of:

Chapter 1: SFT: Training Data

Supervised Fine-Tuning (SFT) is simple in concept: collect a dataset of (prompt, ideal response) pairs, and do gradient descent to maximize the probability the model assigns to the ideal response given the prompt. The question is — what data should you use?

Three major public instruction-tuning datasets have shaped how the field thinks about this:

What varies across these datasets? Three things matter most: (1) response length and style — FLAN is terse, OASST is verbose; (2) knowledge scope — FLAN is narrow, OASST is broad; (3) safety coverage — the amount of harmful/sensitive prompt-response pairs that teach the model to refuse. The style the model learns depends heavily on which dataset dominates.

The "less is more" result — LIMA

The LIMA paper (Zhou et al., 2023) is one of the most important results in post-training. They curated just 1,000 examples by hand — high-quality, diverse, carefully written — and found that a LLaMA model fine-tuned on this tiny dataset matched or exceeded models fine-tuned on 52,000 (Alpaca) or 9,846 (Dolly) examples.

Their conclusion: SFT does not teach the model new knowledge. It teaches the model a response style — how to present the knowledge it already absorbed during pretraining. If the model already knows something, SFT helps it learn when and how to say it. But SFT cannot make the model know things it doesn't know from pretraining.

The elicitation hypothesis. Instruction fine-tuning is not "teaching" in the same sense as pretraining. You are not adding new facts to the model. You are eliciting behaviors that are already latent in the weights — turning a web-text completer into an assistant. This has a corollary: don't fine-tune on facts the model doesn't already know. It will confabulate rather than learn.

Safety data: a little goes a long way

Several studies found that adding just ~500 safety-specific examples (prompts that should be refused + good refusals as responses) drastically improves safety behavior. This is striking: 500 examples out of the billions of tokens the model saw in pretraining are enough to establish a new behavioral pattern.

The danger is over-refusal. A model trained too heavily on safety data starts refusing benign requests ("what is the history of nuclear weapons?" gets refused because it mentions "nuclear"). The art of safety tuning is keeping a narrow, well-targeted refusal distribution without collateral damage to helpfulness.

LIMA achieves strong instruction-following with only 1,000 examples. What does this most directly support about the mechanism of SFT?

Chapter 2: SFT: The Loss

The mechanics of SFT are straightforward once you understand one design choice: we only compute loss on the response tokens, not the prompt tokens.

A training example in instruction-tuning is formatted as a chat template — a structured string that packages the system prompt, user message, and assistant response:

text
<|system|>
You are a helpful assistant.
<|user|>
Explain quantum entanglement to a 10-year-old.
<|assistant|>
Imagine you have two magic coins that are best friends...

After tokenization, this becomes a sequence of tokens. We split it into two parts: the prompt (everything up to and including <|assistant|>) and the response (everything after). The loss is computed only on the response tokens.

Deriving the SFT cross-entropy loss

Let the response tokens be y1, y2, …, yT and the full prefix (prompt + response so far) at step t be x<t. The model produces a probability distribution over the vocabulary at each step. The SFT loss is the standard cross-entropy, but summed only over response token positions:

SFT(θ) = − (1/T) ∑t=1T log pθ(yt | x<t)

Each term log pθ(yt | x<t) is the log probability the model assigns to the correct next token. If the model is confident and correct, this is close to 0. If it's confident and wrong, this is very negative. Averaging over T response tokens and negating gives a positive loss we minimize.

Why mask the prompt? If we also computed loss on the prompt, we would be penalizing the model for not predicting user messages — which we never actually want to generate during inference. Worse, user messages are often questions, and minimizing loss on questions would pull the model toward question-completion behavior rather than answer-generation behavior.

PyTorch implementation

python
import torch
import torch.nn.functional as F

def sft_loss(model, input_ids, labels, prompt_len):
    # input_ids: (B, seq_len) — full prompt + response tokens
    # labels:    (B, seq_len) — same, but prompt positions set to -100
    # -100 is PyTorch's ignore_index for cross_entropy

    logits = model(input_ids).logits           # (B, seq_len, vocab_size)

    # Shift: predict token t+1 from position t
    shift_logits = logits[:, :-1, :].contiguous()   # (B, seq_len-1, V)
    shift_labels = labels[:, 1:].contiguous()        # (B, seq_len-1)

    # cross_entropy ignores positions where label == -100
    loss = F.cross_entropy(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_labels.view(-1),
        ignore_index=-100
    )
    return loss

# Build labels: copy input_ids, mask prompt positions
labels = input_ids.clone()
labels[:, :prompt_len] = -100  # mask prompt tokens from loss

Worked example: suppose the response is the 4-token sequence "Paris is the capital" and the model assigns probabilities [0.7, 0.6, 0.9, 0.8] to the correct tokens at each step. Then:

SFT = −(1/4)(log 0.7 + log 0.6 + log 0.9 + log 0.8) = −(1/4)(−0.357 − 0.511 − 0.105 − 0.223) = 0.299
In the SFT loss, why do we set prompt token labels to −100 (PyTorch's ignore_index) rather than computing loss on the full sequence?

Chapter 3: Preference Data

SFT works when you have good (prompt, response) pairs. But writing ideal responses is expensive, and there is a subtler problem: people are much better at judging quality than producing it. It is much easier for a human to look at two responses and say "this one is better" than to write the perfect response from scratch.

This is the generation-verification (G-V) gap: the cost of verifying quality is far lower than the cost of generating quality. RLHF exploits this gap. Instead of expensive gold responses, you collect cheap pairwise preferences.

Cost reality check (InstructGPT scale). For a 7B model: pretraining ≈ $300k. SFT data (25k examples from human writers) ≈ $25k. Preference data (33k pairwise comparisons from crowd workers) ≈ $4k. The RL training itself ≈ $100. Each stage is cheaper by an order of magnitude, yet each stage compounds the alignment improvement.

What preference data looks like

The standard setup: given a prompt x, sample two responses yw (winner/chosen) and yl (loser/rejected) from the current model. Present both to a human annotator and ask: "Which response do you prefer?" The annotator picks one. You record the tuple (x, yw, yl).

InstructGPT hired 40 workers via Scale AI and Upwork, with careful vetting for agreement with researcher judgments. They were given detailed rating guidelines covering helpfulness, harmlessness, and honesty. Despite this, inter-annotator agreement was only moderate (~72%) — humans do not agree perfectly on what "better" means.

Who annotates matters — a lot. Research by Santurkar et al. (2023) showed that RLHF models reflect the values and demographic biases of their annotator pool. A model trained on preferences from U.S. crowd workers will learn a distinctly American sense of what counts as a "good" answer — more verbose, more optimistic, more certain. Models trained on annotators from different demographics show noticeably different response styles on politically sensitive topics.

LM-generated feedback

An increasingly popular cost-cutting move: use GPT-4 (or a strong model) as the preference annotator. Constitutional AI (Anthropic) and UltraFeedback (used in Zephyr, Tulu3) showed that AI-generated preferences correlate surprisingly well with human preferences at the system level — near human inter-annotator agreement rates. This collapses the cost from thousands of dollars to tens of dollars for a new dataset.

The G-V gap justifies collecting preference data rather than SFT gold responses. Which statement best characterizes the G-V gap?

Chapter 4: Reward Modeling

You have a dataset of pairwise preferences (x, yw, yl). Now what? You need to turn these discrete comparisons into a scalar reward signal that the RL training loop can optimize. That is the job of the reward model.

The Bradley-Terry model

The Bradley-Terry model (1952) is a probabilistic framework for pairwise comparisons. It says: each item i has a latent "strength" ri. The probability that item A beats item B is:

P(A ≻ B) = erA / (erA + erB) = σ(rA − rB)

That last step is the key: the sigmoid function σ(z) = 1/(1+e−z) naturally converts the reward difference into a probability. When rA ≫ rB, P(A≻B) → 1. When rA = rB, P = 0.5 (a coin flip). When rA ≪ rB, P → 0.

Worked example

Suppose our reward model scores two responses as rw = 2.1 and rl = 0.4. Then:

P(yw ≻ yl) = σ(2.1 − 0.4) = σ(1.7) = 1/(1 + e−1.7) = 1/(1 + 0.183) = 0.845

The model says there is an 84.5% chance that yw is preferred. The reward model's training loss is the negative log-likelihood under Bradley-Terry:

RM(θ) = − E(x,yw,yl) [ log σ(rθ(x, yw) − rθ(x, yl)) ]

This loss drives rθ(x, yw) > rθ(x, yl) for every preference pair. The bigger the margin, the lower the loss. The reward model is initialized from the SFT checkpoint (same architecture) with a linear head on top of the final hidden state that outputs a single scalar reward.

Why initialize from SFT? The reward model needs to understand language well enough to assess response quality. Starting from the SFT checkpoint means it already understands the instruction-following context. Random initialization would require the reward model to learn language understanding from scratch on only 33k examples — hopeless.

PyTorch reward model loss

python
import torch
import torch.nn.functional as F

def reward_model_loss(rm, x_chosen, x_rejected):
    # rm: reward model (LM + scalar head)
    # x_chosen, x_rejected: tokenized (prompt + response) for each pair

    r_w = rm(x_chosen).reward      # scalar reward for chosen response
    r_l = rm(x_rejected).reward    # scalar reward for rejected response

    # Bradley-Terry: maximize log σ(r_w − r_l)
    loss = -F.logsigmoid(r_w - r_l).mean()
    return loss

# At inference: higher reward = model predicts human would prefer this response
# reward(x, "Paris is the capital of France") → 2.1
# reward(x, "France has a capital called maybe Paris?") → 0.4
Bradley-Terry preference probability — drag the reward gap

Adjust the reward difference (rw − rl) to see how the Bradley-Terry model converts it to a preference probability. Notice how the curve saturates quickly — a gap of 3 already implies >95% confidence.

Reward gap (rw − rl) 1.7
The Bradley-Terry model assigns P(A≻B) = σ(rA − rB). If the reward model gives rA = 1.5 and rB = 1.5, what is P(A≻B), and what does this mean?

Chapter 5: RLHF with PPO

You now have a reward model r(x, y) that can score any (prompt, response) pair. The RLHF problem is: find the policy πθ (your language model) that maximizes expected reward. Naively:

maxθ Ex~D, y~πθ(·|x) [ r(x, y) ]

This looks clean, but there is a catastrophic failure mode hiding in it. If you optimize this objective unconstrained, the language model will learn to produce reward-hacked outputs — responses that score high on the reward model but are not actually good. Maybe the reward model learned that long, confident responses tend to be preferred. So the LM just generates endlessly verbose nonsense — scoring high on the RM while being useless.

The KL constraint

The fix: add a KL divergence penalty between the current policy πθ and a frozen reference policy πref (the SFT model). The full RLHF objective is:

maxθ Ex~D, y~πθ [ r(x, y) ] − β · KL(πθ(·|x) ∥ πref(·|x))

The KL term measures how far the current policy has drifted from the SFT reference. β is the KL coefficient — a hyperparameter that controls the tradeoff. Large β: stay close to SFT, reward optimization is weak. Small β: allow large policy changes, risk reward hacking.

KL divergence as a leash. Think of the SFT model as a well-trained dog. RLHF teaches it new tricks via reward signals. The KL penalty is the leash — it prevents the dog from running so far after the reward that it forgets how to behave. Without the leash (β=0), the model exploits the reward model's blind spots and produces degenerate outputs.

Expanding the KL term per token and combining with the reward gives a per-token reward at each step t:

r̃(x, yt) = r(x, y) · 𝟙[t=T] − β · log(πθ(yt|x,y<t) / πref(yt|x,y<t))

The reward r(x, y) arrives only at the end of the sequence (the terminal reward). The KL penalty applies token-by-token throughout. This is a standard Markov Decision Process: state = (x, y<t), action = yt, reward = r̃.

PPO at a conceptual level

Proximal Policy Optimization (PPO) is the RL algorithm used in InstructGPT. The key idea: don't take policy update steps that are too large. Directly optimizing E[R(z)∇ log pθ(z)] has catastrophically high variance. PPO clips the policy ratio πθ(a)/πold(a) to stay in [1−ε, 1+ε], preventing updates that change the policy too radically in a single step.

In practice, PPO for RLHF requires four models in memory simultaneously: the policy being trained, the reference policy (frozen), the reward model (frozen), and a value network (predicting expected future reward). This is operationally complex — each training step requires a rollout phase (generate y ~ πθ), a scoring phase (compute r, KL), and an update phase (gradient step on policy and value network).

PPO is finicky. InstructGPT required careful hyperparameter tuning: the right β, the right learning rate, the right number of rollouts per batch. PPO is also sensitive to the quality of the reward model — a noisy RM amplifies errors through the RL loop. This instability motivated the search for alternatives that avoid on-policy RL entirely.
RLHF reward vs KL tradeoff — the β slider

Drag β to see how it shapes the optimization landscape. Low β allows high reward but risks large policy drift (reward hacking). High β keeps the model near SFT but caps achievable reward. The sweet spot is where the reward-KL frontier curves.

β (KL penalty weight) 0.30
In the RLHF objective max E[r(x,y)] − β·KL(π∥πref), what happens as β→0?

Chapter 6: DPO: Derivation

PPO works but it is operationally painful: four models in memory, rollout loops, careful hyperparameter tuning, high variance gradients. Is there a way to train on preference data without any on-policy RL?

Direct Preference Optimization (DPO) says yes — and it derives from the same RLHF objective we just wrote down. The insight is algebraic: the KL-constrained RL problem has a closed-form optimal policy, which lets us reparametrize the reward in terms of the policy and skip the reward model entirely.

Step 1 — Solve the KL-constrained objective analytically

For a fixed reward function r(x, y), the policy that maximizes E[r] − β·KL(π∥πref) is (up to a normalizing constant Z(x)):

π*(y|x) = (1/Z(x)) · πref(y|x) · exp(r(x,y)/β)

This is a Gibbs distribution — the reference policy reweighted by exponentiated reward. You can verify this by writing the Lagrangian and setting the functional derivative to zero. Z(x) = ∑y πref(y|x) exp(r(x,y)/β) is a normalizing constant that depends on x but not y.

Step 2 — Solve for the implied reward

Rearrange to express r(x, y) as a function of the policy:

r(x, y) = β · log(π*(y|x) / πref(y|x)) + β · log Z(x)

This is the key equation: given any policy π, its "implied reward" under the RLHF framework is the log ratio π/πref scaled by β. The intractable Z(x) term disappears in the next step.

Step 3 — Plug into Bradley-Terry

The preference probability under Bradley-Terry is σ(r(x,yw) − r(x,yl)). Substituting our implied reward:

P(yw ≻ yl | x) = σ(r(yw) − r(yl)) = σ(β log(π/πref|yw) − β log(π/πref|yl))

The Z(x) terms cancel because they appear with opposite signs! We are left with a preference probability that depends only on policy log-ratios — no reward model, no intractable normalization.

Step 4 — The DPO loss

Maximize the log-likelihood of the observed preferences:

DPO(θ) = −E(x,yw,yl) [ log σ(β log(πθ(yw|x)/πref(yw|x)) − β log(πθ(yl|x)/πref(yl|x))) ]

This is the DPO objective. Let's name the two terms: let Δw = log πθ(yw|x) − log πref(yw|x) (how much more/less likely the trained model is to produce the chosen response vs the reference). Similarly Δl for the rejected. The loss is −log σ(β(Δw − Δl)).

Gradient intuition. The DPO gradient pushes in two directions simultaneously: increase log probability of the chosen response (positive gradient), decrease log probability of the rejected response (negative gradient). But both updates are scaled by a "prediction error" signal — how wrong the current implied reward is. If the model already correctly ranks the pair, the gradient vanishes.

PyTorch DPO loss

python
import torch, torch.nn.functional as F

def log_prob_sequence(model, input_ids, labels):
    # Sum log-probs over response token positions only
    logits = model(input_ids).logits               # (B, L, V)
    log_probs = F.log_softmax(logits[:, :-1], dim=-1)  # (B, L-1, V)
    token_log_probs = log_probs.gather(
        dim=-1, index=labels[:, 1:].unsqueeze(-1)    # (B, L-1, 1)
    ).squeeze(-1)                                  # (B, L-1)
    # Mask prompt positions (label == -100)
    mask = (labels[:, 1:] != -100).float()
    return (token_log_probs * mask).sum(dim=-1)     # (B,) — sum over response

def dpo_loss(policy, reference, x_w, y_w_labels, x_l, y_l_labels, beta=0.1):
    # Chosen response log-probs
    lp_w_policy = log_prob_sequence(policy,    x_w, y_w_labels)
    lp_w_ref    = log_prob_sequence(reference, x_w, y_w_labels)
    # Rejected response log-probs
    lp_l_policy = log_prob_sequence(policy,    x_l, y_l_labels)
    lp_l_ref    = log_prob_sequence(reference, x_l, y_l_labels)

    delta_w = lp_w_policy - lp_w_ref    # log π_θ(yw) - log π_ref(yw)
    delta_l = lp_l_policy - lp_l_ref    # log π_θ(yl) - log π_ref(yl)

    # DPO loss: -log σ(β * (delta_w - delta_l))
    loss = -F.logsigmoid(beta * (delta_w - delta_l)).mean()
    return loss
DPO margin → loss — drag the chosen−rejected logprob margin

The DPO loss is −log σ(βΔ) where Δ = (log π/πref)chosen − (log π/πref)rejected. Adjust β and the current margin to see the loss and its gradient magnitude.

Margin Δ (chosen−rejected log-ratio diff) 0.5
β 0.10
In DPO, why does the intractable partition function Z(x) not appear in the final loss?

Chapter 7: Showcase: RLHF-PPO vs DPO

Both RLHF-PPO and DPO start from the same theoretical objective (maximize reward − β·KL). They are two algorithms for solving the same optimization problem. The question is: do they converge to the same solution in practice?

RLHF-PPO pipeline vs DPO pipeline — toggle and compare

Toggle between the two pipelines to see their data flows, model requirements, and training loops. Use the slider to adjust where on the reward–KL tradeoff curve each algorithm ends up.

Training steps (optimization progress) 50

Empirical comparison

The controlled comparison in the DPO paper showed comparable performance between DPO and PPO on summarization and dialogue tasks — same benchmark scores, but DPO achieved this with no reward model training, no rollouts, and a single-stage training loop. Most open-source RLHF models today (Zephyr, Tulu3, Llama-3-Instruct) use DPO or one of its variants.

That said, PPO is not obsolete. Recent results from labs with significant compute budgets have found that PPO with a strong reward model can outperform DPO on complex reasoning and coding tasks. The intuition: PPO can explore the response space during rollouts, discovering high-reward sequences that weren't in the training preference data. DPO can only leverage the preference pairs it was given.

RLHF-PPODPO
Training paradigmOn-policy RLSupervised (offline)
Models neededPolicy + Reference + RM + Value net (4 total)Policy + Reference (2 total)
Data requiredOnline rollouts + pref dataOffline pref data only
Implementation complexityHigh (rollout loop, PPO clip, value net)Low (single forward pass per pair)
ExplorationYes — discovers new sequencesNo — confined to training pairs
RiskReward hacking, instabilityMode collapse to preferred style
Industry usageGPT-4, Gemini (reportedly)Llama 3, Zephyr, Tulu3, Mistral
DPO requires no reward model and no on-policy rollouts. What capability does this tradeoff sacrifice compared to PPO?

Chapter 8: Overoptimization

Alignment is not a solved problem once you've run RLHF. There is a systematic failure mode that emerges whenever you optimize a proxy objective hard enough: overoptimization, also called Goodhart's Law in the social sciences ("when a measure becomes a target, it ceases to be a good measure").

In RLHF, the proxy is the reward model. The reward model is an imperfect proxy for actual human preference — it learned from noisy human labels, it has limited capacity, and it cannot generalize perfectly. As the policy optimization drives the LM further from its SFT starting point in pursuit of higher RM scores, it eventually finds response patterns that exploit the RM's errors rather than genuine quality signals.

Classic example of reward hacking. A reward model trained on human preferences for helpful summaries gives high scores to: (1) long responses, (2) confident-sounding statements, (3) responses that use the word "certainly." After enough PPO steps, the model produces infinitely long, maximally confident, "certainly certainly certainly certainly..." drivel. The RM scores this highly. Human evaluators do not.

The overoptimization curve

Gao et al. (2022) measured this systematically: take a reward model trained on human preferences, then optimize a policy against it for increasing numbers of steps. Plot RM score vs human preference score over time. The curves diverge: RM score increases monotonically, human preference score peaks around some intermediate optimization strength and then decreases.

This peak-then-decline shape is universal across RLHF systems. The KL penalty (the β term) controls how quickly you reach the peak — smaller β = faster divergence = the peak arrives sooner and is lower. But no value of β prevents overoptimization eventually.

Mode collapse

A second failure mode: mode collapse. RLHF removes the calibration that the base model had from pretraining. A pretrained model produces diverse outputs sampled from a broad distribution over plausible completions. After RLHF, the model increasingly concentrates probability mass on a narrow set of "preferred" response patterns. Ask it the same question 10 times — you get nearly identical answers. The policy is no longer a probabilistic model; it has collapsed to a near-deterministic map from prompt to preferred response style.

Overoptimization — RM score vs true human preference as optimization progresses

Watch how RM score and human preference diverge as optimization proceeds. Adjust β to see how the KL penalty delays (but does not prevent) overoptimization. The dashed vertical line marks the optimal stopping point.

β (KL penalty weight) 0.10
A reward model is trained on human preferences for helpful responses. After 10,000 PPO steps, the model's RM score is at its all-time high, but human evaluators rate it worse than the SFT checkpoint. What has happened?

Chapter 9: Connections

Alignment via SFT and RLHF sits at the intersection of several threads. Here is how this lecture connects to the broader landscape:

Concept from this lessonWhere it leads next
SFT cross-entropy lossSame loss as pretraining — the only difference is the data and the prompt mask. CS336 Lec 2
Bradley-Terry reward modelingElo ratings, tournament ranking, pairwise comparison theory
KL divergence in RLHFInformation geometry, variational inference, VI-based LLM sampling
DPO closed-form derivationSimPO (no reference), IPO, KTO — DPO variants in CS336 Lec 16
Reward hacking / overoptimizationConstitutional AI, process reward models, verifiable rewards in Reward & Alignment Gleam
Preference data collectionRLHF ethics, crowdworker conditions, AI feedback (RLAIF)

DPO variants worth knowing

DPO spawned a family of variants, two of which are worth knowing from the Tulu 3 paper:

The alignment tax

A persistent empirical observation: RLHF-aligned models tend to score slightly worse on academic benchmarks (MMLU, etc.) than their base model counterparts. The hypothesis is that alignment compresses the output distribution — the model learns to respond in assistant style even when the benchmark expects different formatting. This is sometimes called the "alignment tax."

In practice, the alignment tax is small and worth paying — a model that refuses to answer harmful questions and follows instructions reliably is far more useful than one that scores 2 points higher on MMLU.

The full InstructGPT recipe. Pretrain on web text → SFT on 13k human-written (prompt, response) pairs → train reward model on 33k pairwise comparisons (initialized from SFT) → PPO against the RM with KL penalty. This pipeline became the template for ChatGPT, Claude, and virtually every major aligned model that followed.

Cheat sheet

FormulaWhat it is
SFT = −(1/T)∑ log pθ(yt|x<t)SFT loss (response tokens only)
P(A≻B) = σ(rA − rB)Bradley-Terry preference probability
RM = −log σ(rw − rl)Reward model training loss
max E[r] − β·KL(π∥πref)RLHF objective
DPO = −log σ(β(Δw−Δl))DPO loss (Δ = log π/πref)
Which stage of the InstructGPT recipe directly uses pairwise preference data (A vs B comparisons)?
"What I cannot create, I do not understand." — Richard Feynman. You can now create every piece of the alignment pipeline: the SFT data loader, the reward model, the RLHF objective, and the DPO loss. What you cannot yet create: a reward model that doesn't overfit, a preference dataset without demographic bias, and a KL penalty that perfectly calibrates the reward-safety tradeoff. Those remain open problems.