The foundations and frontier of RL for language models — from first principles. Lectures 15 & 16 gave you PPO, DPO, and GRPO. This lesson goes deeper: why does the policy gradient theorem hold? What is an LM generation as an RL trajectory? Why do baselines not bias the gradient? What does the PPO clip actually do geometrically? And where is RL-for-LMs headed next? Full derivations, five live canvases, and a CS336 series capstone.
You have a language model. You ask it: "What is 3x + 7 = 22?" It writes a response, token by token. At the end you check: is the answer correct? If yes, you want more of that. If no, you want less. How do you actually train toward that objective?
The first step is recognizing that generation is a sequential decision process — the kind reinforcement learning was built to handle. But RL is usually explained with robots and game controllers, not text generators. Let's pin down exactly what the RL framing means for an LM.
Three LM-specific features make this RL problem unusual. First, the transition is deterministic: given state s and action a (a token), the next state s′ = concat(s, a). No environment stochasticity. Second, rewards are sparse and terminal: the reward is 0 at every step except the last, where it reflects the quality of the entire response. Third, the state space is vast but structured — unlike a robot in physical space, every "state" is just a string of tokens, and the model can compute log-probabilities exactly.
The objective is to find policy parameters θ that maximize the expected reward: J(θ) = E(s,a) ∼ πθ[R(s, a)], where the expectation is over both sampled prompts s and sampled response tokens a. Everything in Lectures 16 and 17 is about how to compute ∇θJ efficiently.
Watch a language model generate a response token by token. Each step is a state-action pair. The reward appears only at the end. Click Play or Step to advance.
We want to compute ∇θJ(θ) — the gradient of expected reward with respect to policy parameters. The challenge: we can't differentiate through a discrete sample. The solution is the log-derivative trick, also called the score function estimator or REINFORCE identity.
For notational simplicity, let a denote the entire response (treating the sequence as one compound action — valid for the bandit / terminal-reward setting CS336 uses):
Take the gradient with respect to θ. p(s) doesn't depend on θ, and R(s, a) is a scalar reward that doesn't depend on θ (the verifier is fixed). So the gradient hits only πθ(a|s):
Now apply the log-derivative identity: ∇ p = p · ∇ log p (just the chain rule — d/dθ log f = (1/f)(df/dθ) rearranged). Multiply and divide by πθ(a|s):
For an autoregressive LM, the full-sequence log-probability decomposes as a sum over tokens: log πθ(a|s) = ∑t=1T log πθ(at|s, a<t). This sum is fully differentiable — it's exactly the same computation as the SFT cross-entropy loss (with sign flipped and weighted by R).
So the REINFORCE update is: for each token at in the response, compute the cross-entropy gradient at that position, and scale it by R(s, a). High-reward responses push their token probabilities up. Zero-reward responses produce no gradient. This is "reward-weighted SFT" — and it's the foundation of every algorithm in this CS336 alignment series.
python import torch def policy_gradient_loss(model, prompt_ids, response_ids, reward): # reward: scalar (e.g. 1.0 if correct, 0.0 if wrong) # response_ids: (T,) token ids for the response # Forward pass over prompt + response input_ids = torch.cat([prompt_ids, response_ids], dim=-1).unsqueeze(0) logits = model(input_ids).logits # (1, L, vocab) # Log-probs at each response token position prompt_len = prompt_ids.shape[0] resp_logits = logits[0, prompt_len-1:-1, :] # shifted: predict response tokens log_probs = torch.log_softmax(resp_logits, dim=-1) token_log_probs = log_probs.gather(1, response_ids.unsqueeze(1)).squeeze(1) # REINFORCE loss: -R * sum_t log π(a_t | s, a_<t) # Negative because PyTorch minimizes, we want to maximize J seq_log_prob = token_log_probs.sum() loss = -reward * seq_log_prob return loss
The policy gradient theorem gives us an unbiased gradient estimator. We just sample (s, a) ~ π, compute R(s, a) · ∇ log π(a|s), and average over a batch. This is REINFORCE (Williams, 1992). It works — in theory. In practice, the variance is often catastrophically high.
For verifiable rewards in math (R ∈ {0, 1}), consider a problem where the model has a 20% chance of being correct. In a batch of 8 rollouts, you might get 0 correct, producing a zero gradient for the entire batch. Or you might get 1 correct, producing a large gradient from a single noisy sample. The signal-to-noise ratio is terrible.
More precisely: the variance of the REINFORCE estimator scales with E[R2] − (E[R])2. When rewards are sparse (most zero), E[R] is small. But when one reward fires, R2 = R = 1, so the variance term is dominated by the first moment. You need many samples to get a stable gradient estimate.
Percy's example from Lecture 17 makes the problem concrete. Two states:
In s1, a1 is better (11 > 9). In s2, a2 is better (2 > 0). But naive REINFORCE sees reward 9 for (s1, a2) — much higher than reward 2 for (s2, a2). Without a baseline, it pushes up a2 in s1 strongly (R=9), even though a1 is better there. The absolute reward magnitude swamps the relative signal.
Exact Python: torch.std(torch.tensor([11., 9, 0, 2])) = 4.743. The gradient weights are all positive and very different in scale. Convergence requires averaging out this noise.
Each dot is an independent gradient estimate from one rollout. Adjust N (samples) and reward sparsity to see how gradient variance changes. The spread of dots shows variance; the center is the true gradient direction.
The variance problem is solved by baseline subtraction. Instead of using the raw reward R(s, a) as the gradient weight, we use R(s, a) − b(s), where b(s) is any function of the state (not the action). The key theorem: this does not bias the gradient.
We need to show that Ea∼π[b(s) · ∇ log π(a|s)] = 0 for any b(s). Expand the expectation:
The sum ∑a π(a|s) = 1 for all θ, so its gradient is zero. Therefore subtracting any b(s) from R doesn't change the expected gradient. It can only reduce variance.
The best practical choice is b(s) = V(s) = Ea∼π[R(s, a)] — the expected reward from state s. Then the gradient weight becomes:
This is the advantage function: how much better is action a compared to the average action from state s? Returning to the two-state example with b(s1) = 10, b(s2) = 1:
Variance dropped from 4.74 to 1.15 — a 4× reduction. And now the gradient signal is correct: a1 in s1 gets +1 (it's above average), a2 in s1 gets −1 (below average). Same for s2. The relative advantage within each state is now what drives the update.
Two actions at two states. Arrows show gradient direction (where the policy is being pushed). No baseline: raw reward magnitudes. With baseline: advantage-weighted. Watch how the baseline reveals the correct relative ranking within each state.
The advantage A(s, a) = R(s, a) − V(s) is conceptually clean, but requires a good estimate of V(s). In the LM bandit setting (one terminal reward R at the end of the sequence), V(s) is just the expected reward from the current prompt-response prefix. But how do you estimate V(s) in practice?
Two extreme approaches. Monte Carlo: use the actual sampled reward R as the estimate of Q(s, a), and subtract a separate estimate of V(s). This is unbiased (given infinite samples), but high-variance. TD(0): use the one-step TD target rt + γ V(st+1) as the Q estimate. This is low-variance (bootstrapping from a learned V), but biased if V is wrong.
Generalized Advantage Estimation (GAE) by Schulman et al. (2015) interpolates between these extremes with a parameter λ ∈ [0, 1]. First define the TD residual (a one-step advantage estimate):
Then the GAE advantage is an exponentially-weighted sum of TD residuals:
When λ = 0: At = δt (TD(0) — pure bootstrapping, low variance, high bias). When λ = 1: At = ∑k γk rt+k − V(st) (Monte Carlo return minus baseline, high variance, low bias). In practice λ = 0.95–0.99.
In CS336's setting, there is only one reward R at the end (rt = 0 for t < T, rT = R). The transition is deterministic (no stochastic environment). With γ = 1 and the terminal reward:
Where s0 is the initial prompt state and V(s0) = E[R | prompt]. This is exactly what GRPO computes as the group mean μ — a Monte Carlo estimate of V(s) from G parallel rollouts. GAE with λ=1, γ=1, and V approximated by the group mean is GRPO's advantage.
| Method | λ | Baseline | Bias | Variance |
|---|---|---|---|---|
| Vanilla REINFORCE | 1.0 | None (b=0) | None | Very high |
| REINFORCE + mean | 1.0 | Mean reward | None | Moderate |
| GRPO | 1.0 | Group mean μ | Small (σ norm) | Low-moderate |
| PPO (GAE) | 0.95 | Learned Vφ(s) | Low | Low |
| TD(0) | 0.0 | Bootstrapped V | High | Very low |
python import torch def compute_gae(rewards, values, gamma=1.0, lam=0.95): # rewards: (T,) — per-step rewards (usually 0 except last) # values: (T+1,) — V(s_t) estimates, V(s_T+1) = 0 (terminal) T = len(rewards) advantages = torch.zeros(T) gae = 0.0 for t in reversed(range(T)): delta = rewards[t] + gamma * values[t+1] - values[t] gae = delta + gamma * lam * gae advantages[t] = gae returns = advantages + values[:-1] # targets for value function update return advantages, returns # LM bandit special case (gamma=lambda=1, all rewards at terminal step) def grpo_advantage(group_rewards): # group_rewards: (G,) tensor of rewards from one prompt mu = group_rewards.mean() sigma = group_rewards.std() + 1e-4 return (group_rewards - mu) / sigma
We have advantages. Now we update the policy. The simplest approach: gradient ascent on E[A · ∇ log π(a|s)]. But this can cause catastrophically large updates — if the policy changes too much in one step, the rollout data becomes stale (generated by a very different old policy), and the gradient estimate is no longer valid.
When we collect rollouts from policy πold and update policy πθ, the rollout distribution is off-policy. We can correct for this with importance sampling. The importance ratio measures how much the new policy weights a response relative to the old one:
For a full response, this is a product of per-token ratios: r(θ) = ∏t [πθ(at|s, a<t) / πold(at|s, a<t)]. In practice computed as exp(logπnew − logπold) for numerical stability.
The unclipped surrogate objective is: LIS(θ) = E[r(θ) · A]. If r(θ) is close to 1 (policy hasn't changed much), this recovers the standard policy gradient. But r(θ) can grow large if πθ diverges from πold — and then the objective becomes unstable.
Proximal Policy Optimization (PPO) clips the ratio to stay near 1. The clipped objective for a single sample is:
Where ε is typically 0.2 (or 0.01 for GRPO per the CS336 code). Work through the two cases:
Setting: ε = 0.2. One response. Advantage A = +0.5 (good response). Current ratio r = 1.35 (new policy increased this response's probability by 35%).
The clip activated and reduced the objective — preventing us from extracting too much gradient from this response. Now same example with A = −0.5 (bad response, r = 0.65):
For negative advantages, the min picks the more negative (pessimistic) value — the clipped version. This enforces a lower bound on how much we suppress the bad response in one step.
Drag the advantage slider (positive or negative) and ε to see the clip curve. The flat region shows where the clip activates and the gradient goes to zero — the policy will not update further in that direction.
Every policy gradient algorithm must answer a fundamental question: when you collect a batch of rollouts and perform multiple gradient steps on them, at what point do the rollouts become "too old" to use? This is the on-policy vs off-policy distinction.
A purely on-policy algorithm uses only rollouts from the current policy πθ — generated moments before the update. It discards them after a single gradient step. REINFORCE is perfectly on-policy. This is safe (the gradient estimate is exactly correct) but computationally wasteful: LM inference is expensive, and throwing away rollouts after one step is painful.
An off-policy algorithm reuses rollouts collected from an older policy πold. The bias introduced by this is corrected (in expectation) by the importance ratio r(θ) = πθ/πold. As long as πθ stays close to πold, the correction is accurate. If the policies diverge too much, the importance weights become extreme (many values near 0 or infinity), and the correction fails.
GRPO and PPO take a middle ground: collect G rollouts, store them as πold, then take K gradient steps (called "inner steps" or "PPO epochs") using the clipped ratio objective. Typically K = 1–4. After K steps, re-collect fresh rollouts. The clip on r(θ) ensures the policy doesn't drift too far during the K steps — so the off-policy correction remains valid.
There is a deeper challenge specific to LMs: a response of T=500 tokens gets a single reward R at the end. How do we assign credit to individual tokens? Token 47 might have been a key logical step that led to the correct answer — but it gets the same weight in the loss as token 48 (filler word). This credit assignment problem is partially addressed by per-token GAE (which PPO uses), but GRPO assigns the same advantage to every token in the response.
| Property | On-Policy (pure REINFORCE) | Off-Policy + PPO clip | GRPO (K=1) |
|---|---|---|---|
| Rollout freshness | Every step | Every K steps | Every step (K=1) |
| Gradient bias | None | Small (clipped) | Small (σ norm) |
| Sample efficiency | Low (discard after 1 step) | Higher (reuse K times) | Medium |
| Implementation | Simple | Complex (track old policy) | Moderate |
| Credit assignment | Sequence-level | Per-token (GAE) | Sequence-level |
This is the interactive payoff. You're running a miniature GRPO training loop on a sorting task — exactly the task in Percy's Lecture 17 code. The model receives a 3-number sequence and must output the sorted version. Watch gradient descent happen in real time: see the reward signal, the advantage computation, and how the loss and accuracy evolve.
Control the delta mode (how advantages are computed) and loss mode. Press Play to run training. Watch reward and loss curves evolve. Each "step" is one gradient update on a batch of G rollouts per prompt.
The CS336 series has now covered both SFT (Lec 15) and RL (Lecs 16–17). A natural question: when do you need RL, and when is SFT sufficient? Percy's Lecture 17 summary gives the key insight: "Reinforcement learning is the key to surpassing human abilities — if you can measure it, you can optimize it."
SFT (Supervised Fine-Tuning) trains the model to imitate a fixed dataset of demonstrations. It is simple, stable, and highly effective — LIMA showed that 1,000 high-quality examples can produce an instruction-following model competitive with RLHF. But SFT is fundamentally bounded by the quality of its demonstrations. If the training data never contains a chain-of-thought that solves a competition math problem correctly, SFT cannot exceed that ceiling. SFT imitates; it cannot exceed its training distribution.
RL can exceed the demonstration ceiling — but only when the reward signal is well-specified. The policy explores, tries things the dataset never contained, and gets rewarded for anything that works. For verifiable tasks (math, code, formal proofs), the reward is exact and exploration is productive. For open-ended tasks (writing quality, helpfulness, creativity), the reward is harder to specify, and RL can lead to reward hacking or mode collapse toward whatever the reward model values.
RL for LMs faces a severe exploration problem. For a 500-token response, the space of possible responses is |V|500 — astronomically large. Most random explorations are wrong. Unlike a robot that can take small steps in a continuous space, an LM makes discrete token choices where one wrong step can cascade into a completely wrong response. This is why RL for LMs works much better starting from a strong pretrained model — the initialization already places the policy in a productive region of the space.
In Percy's lecture code, the sorting task with 3 numbers has a clear optimal policy: always output sorted(prompt). SFT with optimal demonstrations would converge immediately. But GRPO starting from random initialization must explore — and the lecture shows it can get stuck in local optima (e.g., always outputting [0,1,2] regardless of input) because some responses get reward > 0 and the policy concentrates on them.
The lesson: RL is a powerful tool, but it requires (1) a good initialization, (2) a well-specified reward, (3) enough exploration diversity (GRPO ensures this by sampling G rollouts per prompt), and (4) careful hyperparameter tuning. SFT is often the right choice when you have good demonstrations; RL is the right choice when you need to push beyond them.
| SFT | RLHF (learned RM) | RLVR (verifier) | |
|---|---|---|---|
| When to use | Good demonstrations available | Subjective quality (style, helpfulness) | Verifiable correctness (math, code) |
| Ceiling | Training data quality | Human preference quality | None (can exceed human) |
| Stability | Very stable | Moderate (reward hacking) | Stable (exact reward) |
| Data cost | High (demos) | High (preference labels) | Low (problems + answers) |
| Reward hacking | N/A | Yes (proxy exploitation) | Minimal (verifier is exact) |
| Example | Alpaca, LIMA | InstructGPT, Claude RLHF | DeepSeek-R1, Qwen3-Thinking |
You've reached the end of the CS336 series — 17 lectures from character-level tokenization to the frontier of RL for language models. This chapter connects everything, maps what's known vs open, and points to where the field is heading.
| Topic | Key Formula / Concept | Lecture |
|---|---|---|
| Tokenization (BPE) | Merge most frequent pair iteratively; vocab size V controls granularity | L1 |
| Transformer architecture | Attention: softmax(QKT/√d)V; residual + LayerNorm each block | L2–L3 |
| Training objective | Cross-entropy: −∑ log πθ(xt|x<t); minimize NLL over tokens | L4–L6 |
| Scaling laws | L = C/Dα + C/Nβ; Chinchilla: N ≈ D optimally | L7 |
| Efficient attention | FlashAttention: fuse softmax + matmul, O(N) HBM reads; KV cache at inference | L8–L9 |
| Distributed training | DP × TP × PP = 3D parallelism; ZeRO shards optimizer state | L10–L12 |
| Inference optimization | Speculative decoding, quantization, continuous batching | L13–L14 |
| SFT + RLHF | DPO loss: −logσ(β(logπ/πref|yw − logπ/πref|yl)) | L15 |
| RLVR + GRPO | Ai = (ri−μ)/σ; verifier reward; no learned RM needed | L16 |
| Policy gradient theorem | ∇J = E[R · ∇logπ]; baselines unbiased; PPO clip stabilizes | L17 |
Process vs outcome rewards. Current RLVR uses outcome rewards (did the final answer match?). Process rewards reward each logical step. This would solve the credit assignment problem — but requires step-level verification, which is hard to automate. OpenAI's PRM800K and DeepMind's work on process supervision point in this direction.
RL from AI feedback (RLAIF). Instead of human preference labels, use a stronger AI to judge response quality. Constitutional AI (Anthropic) and RLAIF (Google) showed this can approach human-labeled RLHF quality at much lower cost. Combined with RLVR for verifiable tasks and RLAIF for subjective tasks, this suggests a path toward continuous self-improvement.
Multi-turn and agentic RL. Current RL training treats generation as a single-turn bandit. Real-world tasks (coding agents, research assistants) are multi-turn: the model must plan, execute, observe, and revise over many steps. Standard REINFORCE and GRPO don't naturally extend here — you need credit assignment across turns, memory over interaction history, and rewards that evaluate end-to-end task completion.
Limits of RLVR. RLVR is powerful but domain-limited. Math and code have exact verifiers. Most of the knowledge and capabilities we want from LMs — nuanced explanation, creative synthesis, cross-domain reasoning — don't have exact verifiers. The question of how to extend RL training to these domains without learned reward models (which can be hacked) or human labels (which are expensive) is a major open problem.
Two guest lectures close the series: Junyang Lin (Qwen team, Alibaba) on Qwen 3 — the culmination of iterative RLVR, thinking-mode fusion, and efficient alignment at scale. Mike Lewis (Meta FAIR) on Llama — large-scale pretraining, open-weights philosophy, and the role of open models in the ecosystem. Both connect directly to everything derived in this lesson.
"What I cannot create, I do not understand."
— Richard Feynman. The site motto. Earning it took 17 lectures.