Language Modeling from Scratch · CS336 · Lecture 17

Alignment III: Policy Gradients & the RL Frontier

The foundations and frontier of RL for language models — from first principles. Lectures 15 & 16 gave you PPO, DPO, and GRPO. This lesson goes deeper: why does the policy gradient theorem hold? What is an LM generation as an RL trajectory? Why do baselines not bias the gradient? What does the PPO clip actually do geometrically? And where is RL-for-LMs headed next? Full derivations, five live canvases, and a CS336 series capstone.

Prerequisites: CS336 Lec 15 (PPO, DPO) + CS336 Lec 16 (GRPO, RLVR). Probability expectation. Basic calculus (chain rule).
10
Chapters
5
Live Canvases
Derived
From First Principles

Chapter 0: An LM Generation Is a Trajectory

You have a language model. You ask it: "What is 3x + 7 = 22?" It writes a response, token by token. At the end you check: is the answer correct? If yes, you want more of that. If no, you want less. How do you actually train toward that objective?

The first step is recognizing that generation is a sequential decision process — the kind reinforcement learning was built to handle. But RL is usually explained with robots and game controllers, not text generators. Let's pin down exactly what the RL framing means for an LM.

The mapping. State s = the prompt plus all response tokens generated so far. Action a = generate the next token (a choice from the vocabulary). Trajectory = the full generation: s0 → a1 → s1 → a2 → … → aT. Reward R = a single scalar at the end (did the response satisfy the user or pass the verifier?). Policy πθ(a|s) = the LM itself — a probability distribution over next tokens given context.

Three LM-specific features make this RL problem unusual. First, the transition is deterministic: given state s and action a (a token), the next state s′ = concat(s, a). No environment stochasticity. Second, rewards are sparse and terminal: the reward is 0 at every step except the last, where it reflects the quality of the entire response. Third, the state space is vast but structured — unlike a robot in physical space, every "state" is just a string of tokens, and the model can compute log-probabilities exactly.

s0 = prompt
"Sort the list [3, 1, 2]. Respond with the sorted list."
↓ action a1 = token "["
s1 = prompt + "["
State grows by one token.
↓ action a2 = token "1"
s2 = prompt + "[1"
Deterministic transition: concatenate.
↓ … T tokens later …
sT = full response
R = 1 if "[1, 2, 3]" is correct, else 0.

The objective is to find policy parameters θ that maximize the expected reward: J(θ) = E(s,a) ∼ πθ[R(s, a)], where the expectation is over both sampled prompts s and sampled response tokens a. Everything in Lectures 16 and 17 is about how to compute ∇θJ efficiently.

Misconception: "RL for LMs is just supervised learning with reweighting." Partially true at one update step, but the distribution changes. SFT trains on a fixed dataset. REINFORCE trains on samples from the current policy — as the policy improves, its sample distribution shifts, exposing new states. This non-stationarity is what makes RL harder than SFT, and why on-policy freshness matters.
LM generation as a trajectory — animate a token-by-token rollout

Watch a language model generate a response token by token. Each step is a state-action pair. The reward appears only at the end. Click Play or Step to advance.

In the RL framing of LM generation, what is the "action" at each step?

Chapter 1: The Policy Gradient Theorem

We want to compute ∇θJ(θ) — the gradient of expected reward with respect to policy parameters. The challenge: we can't differentiate through a discrete sample. The solution is the log-derivative trick, also called the score function estimator or REINFORCE identity.

The derivation — three lines

For notational simplicity, let a denote the entire response (treating the sequence as one compound action — valid for the bandit / terminal-reward setting CS336 uses):

J(θ) = Es∼D, a∼πθ(·|s)[R(s, a)] = ∑s p(s) ∑a πθ(a|s) R(s, a)

Take the gradient with respect to θ. p(s) doesn't depend on θ, and R(s, a) is a scalar reward that doesn't depend on θ (the verifier is fixed). So the gradient hits only πθ(a|s):

θ J = ∑s p(s) ∑a R(s, a) · ∇θ πθ(a|s)

Now apply the log-derivative identity: ∇ p = p · ∇ log p (just the chain rule — d/dθ log f = (1/f)(df/dθ) rearranged). Multiply and divide by πθ(a|s):

θ J = ∑s p(s) ∑a πθ(a|s) · R(s, a) · ∇θ log πθ(a|s)
= Es∼D, a∼πθ [ R(s, a) · ∇θ log πθ(a|s) ]
Why this is powerful. The gradient is now an expectation over the policy. We can estimate it by sampling: generate a response a from the current policy, compute R(a), multiply by ∇ log πθ(a|s), and average over a batch. R(a) is never differentiated — it's just a scalar weight. This lets us use any reward function, even a non-differentiable verifier or a human rater.

What ∇ log πθ(a|s) means in practice

For an autoregressive LM, the full-sequence log-probability decomposes as a sum over tokens: log πθ(a|s) = ∑t=1T log πθ(at|s, a<t). This sum is fully differentiable — it's exactly the same computation as the SFT cross-entropy loss (with sign flipped and weighted by R).

θ log πθ(a|s) = ∑t=1Tθ log πθ(at | s, a<t)

So the REINFORCE update is: for each token at in the response, compute the cross-entropy gradient at that position, and scale it by R(s, a). High-reward responses push their token probabilities up. Zero-reward responses produce no gradient. This is "reward-weighted SFT" — and it's the foundation of every algorithm in this CS336 alignment series.

python
import torch

def policy_gradient_loss(model, prompt_ids, response_ids, reward):
    # reward: scalar (e.g. 1.0 if correct, 0.0 if wrong)
    # response_ids: (T,) token ids for the response

    # Forward pass over prompt + response
    input_ids = torch.cat([prompt_ids, response_ids], dim=-1).unsqueeze(0)
    logits = model(input_ids).logits  # (1, L, vocab)

    # Log-probs at each response token position
    prompt_len = prompt_ids.shape[0]
    resp_logits = logits[0, prompt_len-1:-1, :]  # shifted: predict response tokens
    log_probs = torch.log_softmax(resp_logits, dim=-1)
    token_log_probs = log_probs.gather(1, response_ids.unsqueeze(1)).squeeze(1)

    # REINFORCE loss: -R * sum_t log π(a_t | s, a_<t)
    # Negative because PyTorch minimizes, we want to maximize J
    seq_log_prob = token_log_probs.sum()
    loss = -reward * seq_log_prob
    return loss
The policy gradient theorem gives ∇J = E[R(a) · ∇ log π(a|s)]. Why doesn't the gradient need to pass through R(a) — even if R involves a non-differentiable verifier?

Chapter 2: REINFORCE & Its Variance Problem

The policy gradient theorem gives us an unbiased gradient estimator. We just sample (s, a) ~ π, compute R(s, a) · ∇ log π(a|s), and average over a batch. This is REINFORCE (Williams, 1992). It works — in theory. In practice, the variance is often catastrophically high.

Why variance is catastrophically high

For verifiable rewards in math (R ∈ {0, 1}), consider a problem where the model has a 20% chance of being correct. In a batch of 8 rollouts, you might get 0 correct, producing a zero gradient for the entire batch. Or you might get 1 correct, producing a large gradient from a single noisy sample. The signal-to-noise ratio is terrible.

More precisely: the variance of the REINFORCE estimator scales with E[R2] − (E[R])2. When rewards are sparse (most zero), E[R] is small. But when one reward fires, R2 = R = 1, so the variance term is dominated by the first moment. You need many samples to get a stable gradient estimate.

Misconception: "More rollouts always fix variance." More rollouts reduce variance by 1/N (standard error). But if the reward is very sparse — say, 1% correctness — you need hundreds of rollouts per prompt just to get one positive sample. This is computationally prohibitive at LM scale. Variance reduction techniques (baselines, advantages) are more efficient than brute-force sampling.

The two-state worked example

Percy's example from Lecture 17 makes the problem concrete. Two states:

In s1, a1 is better (11 > 9). In s2, a2 is better (2 > 0). But naive REINFORCE sees reward 9 for (s1, a2) — much higher than reward 2 for (s2, a2). Without a baseline, it pushes up a2 in s1 strongly (R=9), even though a1 is better there. The absolute reward magnitude swamps the relative signal.

Naive variance: std([11, 9, 0, 2]) = 4.74

Exact Python: torch.std(torch.tensor([11., 9, 0, 2])) = 4.743. The gradient weights are all positive and very different in scale. Convergence requires averaging out this noise.

REINFORCE variance — gradient spread over N samples

Each dot is an independent gradient estimate from one rollout. Adjust N (samples) and reward sparsity to see how gradient variance changes. The spread of dots shows variance; the center is the true gradient direction.

Samples N 4
Reward sparsity (0=all wrong, 1=all right) 0.30
In the two-state example, naive REINFORCE pushes up action a2 in state s1 strongly (reward = 9), even though a1 is better. Why does this happen, and what does it reveal about REINFORCE?

Chapter 3: Baselines & Advantage

The variance problem is solved by baseline subtraction. Instead of using the raw reward R(s, a) as the gradient weight, we use R(s, a) − b(s), where b(s) is any function of the state (not the action). The key theorem: this does not bias the gradient.

Why baselines don't introduce bias — the proof

We need to show that Ea∼π[b(s) · ∇ log π(a|s)] = 0 for any b(s). Expand the expectation:

Ea∼π[b(s) · ∇ log π(a|s)] = b(s) · ∑a π(a|s) · ∇ log π(a|s)
= b(s) · ∑a π(a|s) · (∇ π(a|s) / π(a|s))
= b(s) · ∑a ∇ π(a|s) = b(s) · ∇ a π(a|s)
= b(s) · ∇ 1 = b(s) · 0 = 0

The sum ∑a π(a|s) = 1 for all θ, so its gradient is zero. Therefore subtracting any b(s) from R doesn't change the expected gradient. It can only reduce variance.

Proven. Any state-dependent baseline b(s) is valid. The optimal baseline that minimizes variance is b*(s) = E[(∥∇ log π(a|s)∥2) R(s,a) | s] / E[∥∇ log π(a|s)∥2 | s] — but this requires computing the gradient before we know the optimal b, so in practice we use simpler choices.

The best practical baseline: the value function

The best practical choice is b(s) = V(s) = Ea∼π[R(s, a)] — the expected reward from state s. Then the gradient weight becomes:

R(s, a) − V(s) = Q(s, a) − V(s) ≡ A(s, a)

This is the advantage function: how much better is action a compared to the average action from state s? Returning to the two-state example with b(s1) = 10, b(s2) = 1:

Advantaged rewards: [11−10, 9−10, 0−1, 2−1] = [1, −1, −1, 1]
Baselined variance: std([1, −1, −1, 1]) = 1.15

Variance dropped from 4.74 to 1.15 — a 4× reduction. And now the gradient signal is correct: a1 in s1 gets +1 (it's above average), a2 in s1 gets −1 (below average). Same for s2. The relative advantage within each state is now what drives the update.

Advantage is relative, not absolute. A reward of 9 in state s1 where the average is 10 is BAD. A reward of 2 in state s2 where the average is 1 is GOOD. The advantage function captures this relativity. Every modern RL algorithm — PPO, GRPO, actor-critic — uses advantage estimates instead of raw rewards.
Gradient push: REINFORCE vs baselined — a 2D policy toy

Two actions at two states. Arrows show gradient direction (where the policy is being pushed). No baseline: raw reward magnitudes. With baseline: advantage-weighted. Watch how the baseline reveals the correct relative ranking within each state.

We showed that E[b(s) · ∇ log π(a|s)] = 0 by using ∑a ∇ π(a|s) = ∇ 1 = 0. What is the precise mathematical condition that b(s) must satisfy for this identity to hold?

Chapter 4: Generalized Advantage Estimation (GAE)

The advantage A(s, a) = R(s, a) − V(s) is conceptually clean, but requires a good estimate of V(s). In the LM bandit setting (one terminal reward R at the end of the sequence), V(s) is just the expected reward from the current prompt-response prefix. But how do you estimate V(s) in practice?

The trade-off: bias vs variance

Two extreme approaches. Monte Carlo: use the actual sampled reward R as the estimate of Q(s, a), and subtract a separate estimate of V(s). This is unbiased (given infinite samples), but high-variance. TD(0): use the one-step TD target rt + γ V(st+1) as the Q estimate. This is low-variance (bootstrapping from a learned V), but biased if V is wrong.

Generalized Advantage Estimation (GAE) by Schulman et al. (2015) interpolates between these extremes with a parameter λ ∈ [0, 1]. First define the TD residual (a one-step advantage estimate):

δt = rt + γ V(st+1) − V(st)

Then the GAE advantage is an exponentially-weighted sum of TD residuals:

AtGAE(γ,λ) = ∑k=0 (γλ)k δt+k

When λ = 0: At = δt (TD(0) — pure bootstrapping, low variance, high bias). When λ = 1: At = ∑k γk rt+k − V(st) (Monte Carlo return minus baseline, high variance, low bias). In practice λ = 0.95–0.99.

The LM bandit simplification

In CS336's setting, there is only one reward R at the end (rt = 0 for t < T, rT = R). The transition is deterministic (no stochastic environment). With γ = 1 and the terminal reward:

ATbandit = R − V(s0)

Where s0 is the initial prompt state and V(s0) = E[R | prompt]. This is exactly what GRPO computes as the group mean μ — a Monte Carlo estimate of V(s) from G parallel rollouts. GAE with λ=1, γ=1, and V approximated by the group mean is GRPO's advantage.

GRPO is a special case of GAE. When γ = λ = 1 (no discounting, full Monte Carlo) and V(s) ≈ μ (group mean), GAE reduces to: Ai = ri − μ. GRPO additionally normalizes by σ to control the gradient scale. The GRPO advantage Ai = (ri − μ)/σ is GAE with group-mean baseline and scale normalization.
MethodλBaselineBiasVariance
Vanilla REINFORCE1.0None (b=0)NoneVery high
REINFORCE + mean1.0Mean rewardNoneModerate
GRPO1.0Group mean μSmall (σ norm)Low-moderate
PPO (GAE)0.95Learned Vφ(s)LowLow
TD(0)0.0Bootstrapped VHighVery low
python
import torch

def compute_gae(rewards, values, gamma=1.0, lam=0.95):
    # rewards: (T,) — per-step rewards (usually 0 except last)
    # values:  (T+1,) — V(s_t) estimates, V(s_T+1) = 0 (terminal)
    T = len(rewards)
    advantages = torch.zeros(T)
    gae = 0.0

    for t in reversed(range(T)):
        delta = rewards[t] + gamma * values[t+1] - values[t]
        gae = delta + gamma * lam * gae
        advantages[t] = gae

    returns = advantages + values[:-1]   # targets for value function update
    return advantages, returns

# LM bandit special case (gamma=lambda=1, all rewards at terminal step)
def grpo_advantage(group_rewards):
    # group_rewards: (G,) tensor of rewards from one prompt
    mu = group_rewards.mean()
    sigma = group_rewards.std() + 1e-4
    return (group_rewards - mu) / sigma
GAE has a parameter λ that interpolates between TD(0) (λ=0) and Monte Carlo (λ=1). For LM training with GRPO, why is using λ=1 (Monte Carlo) a natural choice?

Chapter 5: PPO Clip: Why It Works

We have advantages. Now we update the policy. The simplest approach: gradient ascent on E[A · ∇ log π(a|s)]. But this can cause catastrophically large updates — if the policy changes too much in one step, the rollout data becomes stale (generated by a very different old policy), and the gradient estimate is no longer valid.

The importance ratio

When we collect rollouts from policy πold and update policy πθ, the rollout distribution is off-policy. We can correct for this with importance sampling. The importance ratio measures how much the new policy weights a response relative to the old one:

r(θ) = πθ(a|s) / πold(a|s)

For a full response, this is a product of per-token ratios: r(θ) = ∏tθ(at|s, a<t) / πold(at|s, a<t)]. In practice computed as exp(logπnew − logπold) for numerical stability.

The unclipped surrogate objective is: LIS(θ) = E[r(θ) · A]. If r(θ) is close to 1 (policy hasn't changed much), this recovers the standard policy gradient. But r(θ) can grow large if πθ diverges from πold — and then the objective becomes unstable.

The PPO clip

Proximal Policy Optimization (PPO) clips the ratio to stay near 1. The clipped objective for a single sample is:

Lclip(θ) = min( r(θ) · A,   clip(r(θ), 1−ε, 1+ε) · A )

Where ε is typically 0.2 (or 0.01 for GRPO per the CS336 code). Work through the two cases:

Trust region intuition. PPO is an approximation to Trust Region Policy Optimization (TRPO), which constrains the policy update to stay within a KL ball of the old policy. The clip is a simpler, differentiable surrogate: it says "don't let the ratio stray more than ε from 1." In practice, PPO with clipping achieves similar stability to TRPO at much lower computational cost.

Worked numerical example

Setting: ε = 0.2. One response. Advantage A = +0.5 (good response). Current ratio r = 1.35 (new policy increased this response's probability by 35%).

Unclipped: r · A = 1.35 × 0.5 = 0.675
Clipped: clip(1.35, 0.8, 1.2) × 0.5 = 1.2 × 0.5 = 0.60
Lclip = min(0.675, 0.60) = 0.60

The clip activated and reduced the objective — preventing us from extracting too much gradient from this response. Now same example with A = −0.5 (bad response, r = 0.65):

Unclipped: 0.65 × (−0.5) = −0.325
Clipped: clip(0.65, 0.8, 1.2) × (−0.5) = 0.8 × (−0.5) = −0.40
Lclip = min(−0.325, −0.40) = −0.40

For negative advantages, the min picks the more negative (pessimistic) value — the clipped version. This enforces a lower bound on how much we suppress the bad response in one step.

PPO clip curve — ratio r vs objective Lclip

Drag the advantage slider (positive or negative) and ε to see the clip curve. The flat region shows where the clip activates and the gradient goes to zero — the policy will not update further in that direction.

Advantage A 0.8
Clip ε 0.20
PPO clips the importance ratio r(θ) at [1−ε, 1+ε]. What does it mean geometrically when the clip is active and the gradient of Lclip with respect to θ is zero?

Chapter 6: On-Policy vs Off-Policy

Every policy gradient algorithm must answer a fundamental question: when you collect a batch of rollouts and perform multiple gradient steps on them, at what point do the rollouts become "too old" to use? This is the on-policy vs off-policy distinction.

On-policy: freshness at a cost

A purely on-policy algorithm uses only rollouts from the current policy πθ — generated moments before the update. It discards them after a single gradient step. REINFORCE is perfectly on-policy. This is safe (the gradient estimate is exactly correct) but computationally wasteful: LM inference is expensive, and throwing away rollouts after one step is painful.

Off-policy correction: importance sampling

An off-policy algorithm reuses rollouts collected from an older policy πold. The bias introduced by this is corrected (in expectation) by the importance ratio r(θ) = πθold. As long as πθ stays close to πold, the correction is accurate. If the policies diverge too much, the importance weights become extreme (many values near 0 or infinity), and the correction fails.

GRPO and PPO take a middle ground: collect G rollouts, store them as πold, then take K gradient steps (called "inner steps" or "PPO epochs") using the clipped ratio objective. Typically K = 1–4. After K steps, re-collect fresh rollouts. The clip on r(θ) ensures the policy doesn't drift too far during the K steps — so the off-policy correction remains valid.

Misconception: "More PPO inner steps is always better." More inner steps extract more signal from each rollout batch, but increase off-policy error. The clip bounds the ratio to [1−ε, 1+ε] per update, but K steps can compound: after K=10 steps, the effective drift could be large. DeepSeek-R1 uses K=1 (single step per rollout batch) precisely to avoid this. GRPO in the CS336 code uses K=1 as well.

The credit assignment problem

There is a deeper challenge specific to LMs: a response of T=500 tokens gets a single reward R at the end. How do we assign credit to individual tokens? Token 47 might have been a key logical step that led to the correct answer — but it gets the same weight in the loss as token 48 (filler word). This credit assignment problem is partially addressed by per-token GAE (which PPO uses), but GRPO assigns the same advantage to every token in the response.

Process rewards as a solution. One emerging approach: train a Process Reward Model (PRM) that gives intermediate rewards after each logical step in a chain of thought. This solves credit assignment by providing a dense reward signal — but requires labeled step-level data, which is expensive to collect and harder to verify automatically.
PropertyOn-Policy (pure REINFORCE)Off-Policy + PPO clipGRPO (K=1)
Rollout freshnessEvery stepEvery K stepsEvery step (K=1)
Gradient biasNoneSmall (clipped)Small (σ norm)
Sample efficiencyLow (discard after 1 step)Higher (reuse K times)Medium
ImplementationSimpleComplex (track old policy)Moderate
Credit assignmentSequence-levelPer-token (GAE)Sequence-level
In GRPO-style training, why do we freeze the "old policy" log-probabilities before computing the importance ratio, and what happens if we don't?

Chapter 7: Showcase: GRPO Training Lab

This is the interactive payoff. You're running a miniature GRPO training loop on a sorting task — exactly the task in Percy's Lecture 17 code. The model receives a 3-number sequence and must output the sorted version. Watch gradient descent happen in real time: see the reward signal, the advantage computation, and how the loss and accuracy evolve.

GRPO Training Loop — sorting task simulation

Control the delta mode (how advantages are computed) and loss mode. Press Play to run training. Watch reward and loss curves evolve. Each "step" is one gradient update on a batch of G rollouts per prompt.

Group size G 8
Delta mode
Learning rate 3
What to look for. With raw rewards (no baseline), training is noisy and often fails to converge — the model learns nothing or collapses to a degenerate solution. With centered rewards, convergence is smoother (suboptimal responses get negative signal). With GRPO normalization, the reward scale is consistent across prompts — this helps the most when prompts vary in difficulty.
RL is not easy. Percy's Lecture 17 notes this explicitly: "RL systems are much more complex than pretraining." Even on this toy sorting task, you can observe the model getting stuck in local optima, oscillating, or making no progress on hard prompts. The hyperparameters (learning rate, group size G, delta mode) all matter enormously. Real GRPO at LM scale adds distributed inference, memory management for old models, reward model serving, and KL penalty tuning on top of all this.

Chapter 8: RL vs SFT: When Each Helps

The CS336 series has now covered both SFT (Lec 15) and RL (Lecs 16–17). A natural question: when do you need RL, and when is SFT sufficient? Percy's Lecture 17 summary gives the key insight: "Reinforcement learning is the key to surpassing human abilities — if you can measure it, you can optimize it."

What SFT can and can't do

SFT (Supervised Fine-Tuning) trains the model to imitate a fixed dataset of demonstrations. It is simple, stable, and highly effective — LIMA showed that 1,000 high-quality examples can produce an instruction-following model competitive with RLHF. But SFT is fundamentally bounded by the quality of its demonstrations. If the training data never contains a chain-of-thought that solves a competition math problem correctly, SFT cannot exceed that ceiling. SFT imitates; it cannot exceed its training distribution.

RL can exceed the demonstration ceiling — but only when the reward signal is well-specified. The policy explores, tries things the dataset never contained, and gets rewarded for anything that works. For verifiable tasks (math, code, formal proofs), the reward is exact and exploration is productive. For open-ended tasks (writing quality, helpfulness, creativity), the reward is harder to specify, and RL can lead to reward hacking or mode collapse toward whatever the reward model values.

The key asymmetry. SFT needs good demonstrations — hard to collect at scale, but training is stable. RL needs a good reward signal — easy for verifiable tasks, hard for subjective ones. For math and code, the verifier is free and exact, so RL is clearly superior to SFT at pushing capability beyond the training distribution. For open-ended chat quality, SFT + RLHF (preference RM) is still the dominant approach.

The exploration problem

RL for LMs faces a severe exploration problem. For a 500-token response, the space of possible responses is |V|500 — astronomically large. Most random explorations are wrong. Unlike a robot that can take small steps in a continuous space, an LM makes discrete token choices where one wrong step can cascade into a completely wrong response. This is why RL for LMs works much better starting from a strong pretrained model — the initialization already places the policy in a productive region of the space.

Cold-start RL failure. Training RL from a randomly initialized LM almost never works — the exploration is too sparse to find any reward. DeepSeek-R1-Zero's key engineering decision was to start from the DeepSeek-V3 base model (already trained on trillions of tokens), where the policy already generates coherent text and occasionally solves math problems. The RL then sharpens and extends this existing capability.

Worked comparison: SFT vs GRPO on sorting

In Percy's lecture code, the sorting task with 3 numbers has a clear optimal policy: always output sorted(prompt). SFT with optimal demonstrations would converge immediately. But GRPO starting from random initialization must explore — and the lecture shows it can get stuck in local optima (e.g., always outputting [0,1,2] regardless of input) because some responses get reward > 0 and the policy concentrates on them.

The lesson: RL is a powerful tool, but it requires (1) a good initialization, (2) a well-specified reward, (3) enough exploration diversity (GRPO ensures this by sampling G rollouts per prompt), and (4) careful hyperparameter tuning. SFT is often the right choice when you have good demonstrations; RL is the right choice when you need to push beyond them.

SFTRLHF (learned RM)RLVR (verifier)
When to useGood demonstrations availableSubjective quality (style, helpfulness)Verifiable correctness (math, code)
CeilingTraining data qualityHuman preference qualityNone (can exceed human)
StabilityVery stableModerate (reward hacking)Stable (exact reward)
Data costHigh (demos)High (preference labels)Low (problems + answers)
Reward hackingN/AYes (proxy exploitation)Minimal (verifier is exact)
ExampleAlpaca, LIMAInstructGPT, Claude RLHFDeepSeek-R1, Qwen3-Thinking
Why does RL for LMs typically require starting from a strong pretrained model, rather than training RL from scratch (random initialization)?

Chapter 9: Connections & the RL Frontier

You've reached the end of the CS336 series — 17 lectures from character-level tokenization to the frontier of RL for language models. This chapter connects everything, maps what's known vs open, and points to where the field is heading.

CS336 Series Capstone Cheat Sheet

TopicKey Formula / ConceptLecture
Tokenization (BPE)Merge most frequent pair iteratively; vocab size V controls granularityL1
Transformer architectureAttention: softmax(QKT/√d)V; residual + LayerNorm each blockL2–L3
Training objectiveCross-entropy: −∑ log πθ(xt|x<t); minimize NLL over tokensL4–L6
Scaling lawsL = C/Dα + C/Nβ; Chinchilla: N ≈ D optimallyL7
Efficient attentionFlashAttention: fuse softmax + matmul, O(N) HBM reads; KV cache at inferenceL8–L9
Distributed trainingDP × TP × PP = 3D parallelism; ZeRO shards optimizer stateL10–L12
Inference optimizationSpeculative decoding, quantization, continuous batchingL13–L14
SFT + RLHFDPO loss: −logσ(β(logπ/πref|yw − logπ/πref|yl))L15
RLVR + GRPOAi = (ri−μ)/σ; verifier reward; no learned RM neededL16
Policy gradient theorem∇J = E[R · ∇logπ]; baselines unbiased; PPO clip stabilizesL17

Open frontiers in RL for LMs

Process vs outcome rewards. Current RLVR uses outcome rewards (did the final answer match?). Process rewards reward each logical step. This would solve the credit assignment problem — but requires step-level verification, which is hard to automate. OpenAI's PRM800K and DeepMind's work on process supervision point in this direction.

RL from AI feedback (RLAIF). Instead of human preference labels, use a stronger AI to judge response quality. Constitutional AI (Anthropic) and RLAIF (Google) showed this can approach human-labeled RLHF quality at much lower cost. Combined with RLVR for verifiable tasks and RLAIF for subjective tasks, this suggests a path toward continuous self-improvement.

Multi-turn and agentic RL. Current RL training treats generation as a single-turn bandit. Real-world tasks (coding agents, research assistants) are multi-turn: the model must plan, execute, observe, and revise over many steps. Standard REINFORCE and GRPO don't naturally extend here — you need credit assignment across turns, memory over interaction history, and rewards that evaluate end-to-end task completion.

Limits of RLVR. RLVR is powerful but domain-limited. Math and code have exact verifiers. Most of the knowledge and capabilities we want from LMs — nuanced explanation, creative synthesis, cross-domain reasoning — don't have exact verifiers. The question of how to extend RL training to these domains without learned reward models (which can be hacked) or human labels (which are expensive) is a major open problem.

Percy's closing summary from Lec 17. "Reinforcement learning is the key to surpassing human abilities — if you can measure it, you can optimize it. The policy gradient framework is conceptually clear; you just need baselines to reduce variance. But RL systems are much more complex than pretraining (inference workloads, managing multiple models). The hard part is not the math — it's the engineering."

What comes next in CS336

Two guest lectures close the series: Junyang Lin (Qwen team, Alibaba) on Qwen 3 — the culmination of iterative RLVR, thinking-mode fusion, and efficient alignment at scale. Mike Lewis (Meta FAIR) on Llama — large-scale pretraining, open-weights philosophy, and the role of open models in the ecosystem. Both connect directly to everything derived in this lesson.

Related lessons in this library

← CS336 Lec 15: SFT & RLHF ← CS336 Lec 16: RLVR & GRPO ↻ CS336 Lec 1: Start Over from Zero Policy Gradients (deep dive) RL Algorithms Survey Actor-Critic Methods Reward & Alignment
"What I cannot create, I do not understand."
— Richard Feynman. The site motto. Earning it took 17 lectures.