Alignment III: Policy Gradients & the RL Frontier

Chapter 0: An LM Generation Is a Trajectory

You have a language model. You ask it: "What is 3x + 7 = 22?" It writes a response, token by token. At the end you check: is the answer correct? If yes, you want more of that. If no, you want less. How do you actually train toward that objective?

The first step is recognizing that generation is a sequential decision process — the kind reinforcement learning was built to handle. But RL is usually explained with robots and game controllers, not text generators. Let's pin down exactly what the RL framing means for an LM.

The mapping. State s = the prompt plus all response tokens generated so far. Action a = generate the next token (a choice from the vocabulary). Trajectory = the full generation: s₀ → a₁ → s₁ → a₂ → … → a_T. Reward R = a single scalar at the end (did the response satisfy the user or pass the verifier?). Policy π_θ(a|s) = the LM itself — a probability distribution over next tokens given context.

Three LM-specific features make this RL problem unusual. First, the transition is deterministic: given state s and action a (a token), the next state s′ = concat(s, a). No environment stochasticity. Second, rewards are sparse and terminal: the reward is 0 at every step except the last, where it reflects the quality of the entire response. Third, the state space is vast but structured — unlike a robot in physical space, every "state" is just a string of tokens, and the model can compute log-probabilities exactly.

s₀ = prompt

"Sort the list [3, 1, 2]. Respond with the sorted list."

↓ action a₁ = token "["

s₁ = prompt + "["

State grows by one token.

↓ action a₂ = token "1"

s₂ = prompt + "[1"

Deterministic transition: concatenate.

↓ … T tokens later …

s_T = full response

R = 1 if "[1, 2, 3]" is correct, else 0.

The objective is to find policy parameters θ that maximize the expected reward: J(θ) = E_{(s,a) ∼ π_θ}[R(s, a)], where the expectation is over both sampled prompts s and sampled response tokens a. Everything in Lectures 16 and 17 is about how to compute ∇_θJ efficiently.

Misconception: "RL for LMs is just supervised learning with reweighting." Partially true at one update step, but the distribution changes. SFT trains on a fixed dataset. REINFORCE trains on samples from the current policy — as the policy improves, its sample distribution shifts, exposing new states. This non-stationarity is what makes RL harder than SFT, and why on-policy freshness matters.

LM generation as a trajectory — animate a token-by-token rollout

Watch a language model generate a response token by token. Each step is a state-action pair. The reward appears only at the end. Click Play or Step to advance.

In the RL framing of LM generation, what is the "action" at each step?

The entire response generated by the model (all tokens at once). A single token chosen from the vocabulary — the model's next-token prediction sampled from the softmax distribution. The reward signal returned by the verifier at the end of generation. The hidden state vector inside the transformer at the current position.

Chapter 1: The Policy Gradient Theorem

We want to compute ∇_θJ(θ) — the gradient of expected reward with respect to policy parameters. The challenge: we can't differentiate through a discrete sample. The solution is the log-derivative trick, also called the score function estimator or REINFORCE identity.

The derivation — three lines

For notational simplicity, let a denote the entire response (treating the sequence as one compound action — valid for the bandit / terminal-reward setting CS336 uses):

J(θ) = E_{s∼D, a∼π_θ(·|s)}[R(s, a)] = ∑_s p(s) ∑_a π_θ(a|s) R(s, a)

Take the gradient with respect to θ. p(s) doesn't depend on θ, and R(s, a) is a scalar reward that doesn't depend on θ (the verifier is fixed). So the gradient hits only π_θ(a|s):

∇_θ J = ∑_s p(s) ∑_a R(s, a) · ∇_θ π_θ(a|s)

Now apply the log-derivative identity: ∇ p = p · ∇ log p (just the chain rule — d/dθ log f = (1/f)(df/dθ) rearranged). Multiply and divide by π_θ(a|s):

∇_θ J = ∑_s p(s) ∑_a π_θ(a|s) · R(s, a) · ∇_θ log π_θ(a|s)

= E_{s∼D, a∼π_θ} [ R(s, a) · ∇_θ log π_θ(a|s) ]

Why this is powerful. The gradient is now an expectation over the policy. We can estimate it by sampling: generate a response a from the current policy, compute R(a), multiply by ∇ log π_θ(a|s), and average over a batch. R(a) is never differentiated — it's just a scalar weight. This lets us use any reward function, even a non-differentiable verifier or a human rater.

What ∇ log π_θ(a|s) means in practice

For an autoregressive LM, the full-sequence log-probability decomposes as a sum over tokens: log π_θ(a|s) = ∑_t=1^T log π_θ(a_t|s, a_<t). This sum is fully differentiable — it's exactly the same computation as the SFT cross-entropy loss (with sign flipped and weighted by R).

∇_θ log π_θ(a|s) = ∑_t=1^T ∇_θ log π_θ(a_t | s, a_<t)

So the REINFORCE update is: for each token a_t in the response, compute the cross-entropy gradient at that position, and scale it by R(s, a). High-reward responses push their token probabilities up. Zero-reward responses produce no gradient. This is "reward-weighted SFT" — and it's the foundation of every algorithm in this CS336 alignment series.

python
import torch

def policy_gradient_loss(model, prompt_ids, response_ids, reward):
    # reward: scalar (e.g. 1.0 if correct, 0.0 if wrong)
    # response_ids: (T,) token ids for the response

    # Forward pass over prompt + response
    input_ids = torch.cat([prompt_ids, response_ids], dim=-1).unsqueeze(0)
    logits = model(input_ids).logits  # (1, L, vocab)

    # Log-probs at each response token position
    prompt_len = prompt_ids.shape[0]
    resp_logits = logits[0, prompt_len-1:-1, :]  # shifted: predict response tokens
    log_probs = torch.log_softmax(resp_logits, dim=-1)
    token_log_probs = log_probs.gather(1, response_ids.unsqueeze(1)).squeeze(1)

    # REINFORCE loss: -R * sum_t log π(a_t | s, a_<t)
    # Negative because PyTorch minimizes, we want to maximize J
    seq_log_prob = token_log_probs.sum()
    loss = -reward * seq_log_prob
    return loss

The policy gradient theorem gives ∇J = E[R(a) · ∇ log π(a|s)]. Why doesn't the gradient need to pass through R(a) — even if R involves a non-differentiable verifier?

Because R(a) is always 0 or 1, and gradients of constants are zero, so no differentiation is needed. Because we approximate R with a differentiable reward model before computing the gradient. The log-derivative trick moved the gradient so it attaches to log π_θ(a|s), which IS differentiable. R(a) is just a fixed scalar weight on that gradient — sampled before the update, not differentiated. This is why RL can train against any reward function, including exact verifiers. We use a straight-through estimator to treat the discrete sample as continuous during backprop.

Chapter 2: REINFORCE & Its Variance Problem

The policy gradient theorem gives us an unbiased gradient estimator. We just sample (s, a) ~ π, compute R(s, a) · ∇ log π(a|s), and average over a batch. This is REINFORCE (Williams, 1992). It works — in theory. In practice, the variance is often catastrophically high.

Why variance is catastrophically high

For verifiable rewards in math (R ∈ {0, 1}), consider a problem where the model has a 20% chance of being correct. In a batch of 8 rollouts, you might get 0 correct, producing a zero gradient for the entire batch. Or you might get 1 correct, producing a large gradient from a single noisy sample. The signal-to-noise ratio is terrible.

More precisely: the variance of the REINFORCE estimator scales with E[R²] − (E[R])². When rewards are sparse (most zero), E[R] is small. But when one reward fires, R² = R = 1, so the variance term is dominated by the first moment. You need many samples to get a stable gradient estimate.

Misconception: "More rollouts always fix variance." More rollouts reduce variance by 1/N (standard error). But if the reward is very sparse — say, 1% correctness — you need hundreds of rollouts per prompt just to get one positive sample. This is computationally prohibitive at LM scale. Variance reduction techniques (baselines, advantages) are more efficient than brute-force sampling.

The two-state worked example

Percy's example from Lecture 17 makes the problem concrete. Two states:

s₁: action a₁ → reward 11, action a₂ → reward 9.
s₂: action a₁ → reward 0, action a₂ → reward 2.

In s₁, a₁ is better (11 > 9). In s₂, a₂ is better (2 > 0). But naive REINFORCE sees reward 9 for (s₁, a₂) — much higher than reward 2 for (s₂, a₂). Without a baseline, it pushes up a₂ in s₁ strongly (R=9), even though a₁ is better there. The absolute reward magnitude swamps the relative signal.

Naive variance: std([11, 9, 0, 2]) = 4.74

Exact Python: torch.std(torch.tensor([11., 9, 0, 2])) = 4.743. The gradient weights are all positive and very different in scale. Convergence requires averaging out this noise.

REINFORCE variance — gradient spread over N samples

Each dot is an independent gradient estimate from one rollout. Adjust N (samples) and reward sparsity to see how gradient variance changes. The spread of dots shows variance; the center is the true gradient direction.

Samples N 4

Reward sparsity (0=all wrong, 1=all right) 0.30

In the two-state example, naive REINFORCE pushes up action a₂ in state s₁ strongly (reward = 9), even though a₁ is better. Why does this happen, and what does it reveal about REINFORCE?

REINFORCE uses the wrong formula — it should use Q(s,a) not R(s,a). REINFORCE uses absolute reward magnitudes as gradient weights. It has no way to tell "reward 9 in s1 is bad" from "reward 2 in s2 is good" — it only sees the raw values. A baseline that captures per-state expectations would reveal the relative advantage of each action. REINFORCE samples too few rollouts, so the gradient estimate is noisy but unbiased. The log-derivative trick introduces a bias when rewards are large integers.

Chapter 3: Baselines & Advantage

The variance problem is solved by baseline subtraction. Instead of using the raw reward R(s, a) as the gradient weight, we use R(s, a) − b(s), where b(s) is any function of the state (not the action). The key theorem: this does not bias the gradient.

Why baselines don't introduce bias — the proof

We need to show that E_a∼π[b(s) · ∇ log π(a|s)] = 0 for any b(s). Expand the expectation:

E_a∼π[b(s) · ∇ log π(a|s)] = b(s) · ∑_a π(a|s) · ∇ log π(a|s)

= b(s) · ∑_a π(a|s) · (∇ π(a|s) / π(a|s))

= b(s) · ∑_a ∇ π(a|s) = b(s) · ∇ ∑_a π(a|s)

= b(s) · ∇ 1 = b(s) · 0 = 0

The sum ∑_a π(a|s) = 1 for all θ, so its gradient is zero. Therefore subtracting any b(s) from R doesn't change the expected gradient. It can only reduce variance.

Proven. Any state-dependent baseline b(s) is valid. The optimal baseline that minimizes variance is b*(s) = E[(∥∇ log π(a|s)∥²) R(s,a) | s] / E[∥∇ log π(a|s)∥² | s] — but this requires computing the gradient before we know the optimal b, so in practice we use simpler choices.

The best practical baseline: the value function

The best practical choice is b(s) = V(s) = E_a∼π[R(s, a)] — the expected reward from state s. Then the gradient weight becomes:

R(s, a) − V(s) = Q(s, a) − V(s) ≡ A(s, a)

This is the advantage function: how much better is action a compared to the average action from state s? Returning to the two-state example with b(s₁) = 10, b(s₂) = 1:

Advantaged rewards: [11−10, 9−10, 0−1, 2−1] = [1, −1, −1, 1]

Baselined variance: std([1, −1, −1, 1]) = 1.15

Variance dropped from 4.74 to 1.15 — a 4× reduction. And now the gradient signal is correct: a₁ in s₁ gets +1 (it's above average), a₂ in s₁ gets −1 (below average). Same for s₂. The relative advantage within each state is now what drives the update.

Advantage is relative, not absolute. A reward of 9 in state s₁ where the average is 10 is BAD. A reward of 2 in state s₂ where the average is 1 is GOOD. The advantage function captures this relativity. Every modern RL algorithm — PPO, GRPO, actor-critic — uses advantage estimates instead of raw rewards.

Gradient push: REINFORCE vs baselined — a 2D policy toy

Two actions at two states. Arrows show gradient direction (where the policy is being pushed). No baseline: raw reward magnitudes. With baseline: advantage-weighted. Watch how the baseline reveals the correct relative ranking within each state.

We showed that E[b(s) · ∇ log π(a|s)] = 0 by using ∑_a ∇ π(a|s) = ∇ 1 = 0. What is the precise mathematical condition that b(s) must satisfy for this identity to hold?

b(s) must be a constant (not depend on the state s at all). b(s) must be bounded between 0 and 1 to ensure numerical stability. b must not depend on the action a — it can depend on the state s (and anything else that doesn't change with a), so it factors out of the sum over actions. The key step is pulling b(s) outside ∑_a π(a|s)(∇π/π). b(s) must equal the optimal value function V*(s) to ensure zero bias.

Chapter 4: Generalized Advantage Estimation (GAE)

The advantage A(s, a) = R(s, a) − V(s) is conceptually clean, but requires a good estimate of V(s). In the LM bandit setting (one terminal reward R at the end of the sequence), V(s) is just the expected reward from the current prompt-response prefix. But how do you estimate V(s) in practice?

The trade-off: bias vs variance

Two extreme approaches. Monte Carlo: use the actual sampled reward R as the estimate of Q(s, a), and subtract a separate estimate of V(s). This is unbiased (given infinite samples), but high-variance. TD(0): use the one-step TD target r_t + γ V(s_t+1) as the Q estimate. This is low-variance (bootstrapping from a learned V), but biased if V is wrong.

Generalized Advantage Estimation (GAE) by Schulman et al. (2015) interpolates between these extremes with a parameter λ ∈ [0, 1]. First define the TD residual (a one-step advantage estimate):

δ_t = r_t + γ V(s_t+1) − V(s_t)

Then the GAE advantage is an exponentially-weighted sum of TD residuals:

A_t^GAE(γ,λ) = ∑_k=0^∞ (γλ)^k δ_t+k

When λ = 0: A_t = δ_t (TD(0) — pure bootstrapping, low variance, high bias). When λ = 1: A_t = ∑_k γ^k r_t+k − V(s_t) (Monte Carlo return minus baseline, high variance, low bias). In practice λ = 0.95–0.99.

The LM bandit simplification

In CS336's setting, there is only one reward R at the end (r_t = 0 for t < T, r_T = R). The transition is deterministic (no stochastic environment). With γ = 1 and the terminal reward:

A_T^bandit = R − V(s₀)

Where s₀ is the initial prompt state and V(s₀) = E[R | prompt]. This is exactly what GRPO computes as the group mean μ — a Monte Carlo estimate of V(s) from G parallel rollouts. GAE with λ=1, γ=1, and V approximated by the group mean is GRPO's advantage.

GRPO is a special case of GAE. When γ = λ = 1 (no discounting, full Monte Carlo) and V(s) ≈ μ (group mean), GAE reduces to: A_i = r_i − μ. GRPO additionally normalizes by σ to control the gradient scale. The GRPO advantage A_i = (r_i − μ)/σ is GAE with group-mean baseline and scale normalization.

Method	λ	Baseline	Bias	Variance
Vanilla REINFORCE	1.0	None (b=0)	None	Very high
REINFORCE + mean	1.0	Mean reward	None	Moderate
GRPO	1.0	Group mean μ	Small (σ norm)	Low-moderate
PPO (GAE)	0.95	Learned V_φ(s)	Low	Low
TD(0)	0.0	Bootstrapped V	High	Very low

python
import torch

def compute_gae(rewards, values, gamma=1.0, lam=0.95):
    # rewards: (T,) — per-step rewards (usually 0 except last)
    # values:  (T+1,) — V(s_t) estimates, V(s_T+1) = 0 (terminal)
    T = len(rewards)
    advantages = torch.zeros(T)
    gae = 0.0

    for t in reversed(range(T)):
        delta = rewards[t] + gamma * values[t+1] - values[t]
        gae = delta + gamma * lam * gae
        advantages[t] = gae

    returns = advantages + values[:-1]   # targets for value function update
    return advantages, returns

# LM bandit special case (gamma=lambda=1, all rewards at terminal step)
def grpo_advantage(group_rewards):
    # group_rewards: (G,) tensor of rewards from one prompt
    mu = group_rewards.mean()
    sigma = group_rewards.std() + 1e-4
    return (group_rewards - mu) / sigma

GAE has a parameter λ that interpolates between TD(0) (λ=0) and Monte Carlo (λ=1). For LM training with GRPO, why is using λ=1 (Monte Carlo) a natural choice?

Because λ=1 makes the math simpler, even though it introduces more bias. Because TD methods require dense rewards at each step, but GAE with λ=1 works with any reward schedule. In the LM bandit setting there's only a terminal reward — there are no intermediate rewards to bootstrap from. TD bootstrapping (which helps in dense-reward settings) offers no benefit here. Monte Carlo (λ=1) uses the actual terminal R directly, which is the correct signal. Also, with G group rollouts, you already have low-variance estimates via averaging — so high λ is fine. Because λ=1 ensures the advantage equals exactly zero when the policy is optimal.

Chapter 5: PPO Clip: Why It Works

We have advantages. Now we update the policy. The simplest approach: gradient ascent on E[A · ∇ log π(a|s)]. But this can cause catastrophically large updates — if the policy changes too much in one step, the rollout data becomes stale (generated by a very different old policy), and the gradient estimate is no longer valid.

The importance ratio

When we collect rollouts from policy π_old and update policy π_θ, the rollout distribution is off-policy. We can correct for this with importance sampling. The importance ratio measures how much the new policy weights a response relative to the old one:

r(θ) = π_θ(a|s) / π_old(a|s)

For a full response, this is a product of per-token ratios: r(θ) = ∏_t [π_θ(a_t|s, a_<t) / π_old(a_t|s, a_<t)]. In practice computed as exp(logπ_new − logπ_old) for numerical stability.

The unclipped surrogate objective is: L^IS(θ) = E[r(θ) · A]. If r(θ) is close to 1 (policy hasn't changed much), this recovers the standard policy gradient. But r(θ) can grow large if π_θ diverges from π_old — and then the objective becomes unstable.

The PPO clip

Proximal Policy Optimization (PPO) clips the ratio to stay near 1. The clipped objective for a single sample is:

L^clip(θ) = min( r(θ) · A, clip(r(θ), 1−ε, 1+ε) · A )

Where ε is typically 0.2 (or 0.01 for GRPO per the CS336 code). Work through the two cases:

A > 0 (good response): We want to increase π_θ(a|s) — i.e., r(θ) > 1. The min operator clips at r = 1+ε: once we've increased the probability by ε, stop. Prevents over-committing to one good rollout.
A < 0 (bad response): We want to decrease π_θ(a|s) — i.e., r(θ) < 1. The clip at r = 1−ε prevents us from decreasing too aggressively. The min with A<0 means the clipped (less negative) term is the pessimistic bound.

Trust region intuition. PPO is an approximation to Trust Region Policy Optimization (TRPO), which constrains the policy update to stay within a KL ball of the old policy. The clip is a simpler, differentiable surrogate: it says "don't let the ratio stray more than ε from 1." In practice, PPO with clipping achieves similar stability to TRPO at much lower computational cost.

Worked numerical example

Setting: ε = 0.2. One response. Advantage A = +0.5 (good response). Current ratio r = 1.35 (new policy increased this response's probability by 35%).

Unclipped: r · A = 1.35 × 0.5 = 0.675

Clipped: clip(1.35, 0.8, 1.2) × 0.5 = 1.2 × 0.5 = 0.60

L^clip = min(0.675, 0.60) = 0.60

The clip activated and reduced the objective — preventing us from extracting too much gradient from this response. Now same example with A = −0.5 (bad response, r = 0.65):

Unclipped: 0.65 × (−0.5) = −0.325

Clipped: clip(0.65, 0.8, 1.2) × (−0.5) = 0.8 × (−0.5) = −0.40

L^clip = min(−0.325, −0.40) = −0.40

For negative advantages, the min picks the more negative (pessimistic) value — the clipped version. This enforces a lower bound on how much we suppress the bad response in one step.

PPO clip curve — ratio r vs objective L^clip

Drag the advantage slider (positive or negative) and ε to see the clip curve. The flat region shows where the clip activates and the gradient goes to zero — the policy will not update further in that direction.

Advantage A 0.8

Clip ε 0.20

PPO clips the importance ratio r(θ) at [1−ε, 1+ε]. What does it mean geometrically when the clip is active and the gradient of L^clip with respect to θ is zero?

The policy has converged — the current advantage estimate is exactly zero. The KL divergence between the new and old policy has exceeded the allowed budget, so training stops. The policy has already moved ε away from the old policy in the direction of this response. The clip creates a flat region where further movement in the same direction produces no gradient — a built-in "don't overshoot" mechanism. The policy will continue updating on other responses that haven't yet hit the clip. The ratio r(θ) = 1 exactly, meaning new and old policy are identical — no update needed.

Chapter 6: On-Policy vs Off-Policy

Every policy gradient algorithm must answer a fundamental question: when you collect a batch of rollouts and perform multiple gradient steps on them, at what point do the rollouts become "too old" to use? This is the on-policy vs off-policy distinction.

On-policy: freshness at a cost

A purely on-policy algorithm uses only rollouts from the current policy π_θ — generated moments before the update. It discards them after a single gradient step. REINFORCE is perfectly on-policy. This is safe (the gradient estimate is exactly correct) but computationally wasteful: LM inference is expensive, and throwing away rollouts after one step is painful.

Off-policy correction: importance sampling

An off-policy algorithm reuses rollouts collected from an older policy π_old. The bias introduced by this is corrected (in expectation) by the importance ratio r(θ) = π_θ/π_old. As long as π_θ stays close to π_old, the correction is accurate. If the policies diverge too much, the importance weights become extreme (many values near 0 or infinity), and the correction fails.

GRPO and PPO take a middle ground: collect G rollouts, store them as π_old, then take K gradient steps (called "inner steps" or "PPO epochs") using the clipped ratio objective. Typically K = 1–4. After K steps, re-collect fresh rollouts. The clip on r(θ) ensures the policy doesn't drift too far during the K steps — so the off-policy correction remains valid.

Misconception: "More PPO inner steps is always better." More inner steps extract more signal from each rollout batch, but increase off-policy error. The clip bounds the ratio to [1−ε, 1+ε] per update, but K steps can compound: after K=10 steps, the effective drift could be large. DeepSeek-R1 uses K=1 (single step per rollout batch) precisely to avoid this. GRPO in the CS336 code uses K=1 as well.

The credit assignment problem

There is a deeper challenge specific to LMs: a response of T=500 tokens gets a single reward R at the end. How do we assign credit to individual tokens? Token 47 might have been a key logical step that led to the correct answer — but it gets the same weight in the loss as token 48 (filler word). This credit assignment problem is partially addressed by per-token GAE (which PPO uses), but GRPO assigns the same advantage to every token in the response.

Process rewards as a solution. One emerging approach: train a Process Reward Model (PRM) that gives intermediate rewards after each logical step in a chain of thought. This solves credit assignment by providing a dense reward signal — but requires labeled step-level data, which is expensive to collect and harder to verify automatically.

Property	On-Policy (pure REINFORCE)	Off-Policy + PPO clip	GRPO (K=1)
Rollout freshness	Every step	Every K steps	Every step (K=1)
Gradient bias	None	Small (clipped)	Small (σ norm)
Sample efficiency	Low (discard after 1 step)	Higher (reuse K times)	Medium
Implementation	Simple	Complex (track old policy)	Moderate
Credit assignment	Sequence-level	Per-token (GAE)	Sequence-level

In GRPO-style training, why do we freeze the "old policy" log-probabilities before computing the importance ratio, and what happens if we don't?

We freeze them to save memory — computing two forward passes at once would exceed GPU VRAM. The ratio r = π_θ/π_old must treat π_old as a constant (stop-gradient). If we differentiate through π_old too, the ratio π/π = 1 has zero gradient everywhere — the loss would be identically zero and the model would not update. As the CS336 code shows: with grad on both, w.grad = 0; with no_grad on old, w.grad is nonzero. We freeze old log-probs to prevent numerical overflow when the ratio grows large during training. Freezing old log-probs ensures the clip is exactly symmetric around r=1.

Chapter 7: Showcase: GRPO Training Lab

This is the interactive payoff. You're running a miniature GRPO training loop on a sorting task — exactly the task in Percy's Lecture 17 code. The model receives a 3-number sequence and must output the sorted version. Watch gradient descent happen in real time: see the reward signal, the advantage computation, and how the loss and accuracy evolve.

GRPO Training Loop — sorting task simulation

Control the delta mode (how advantages are computed) and loss mode. Press Play to run training. Watch reward and loss curves evolve. Each "step" is one gradient update on a batch of G rollouts per prompt.

Group size G 8

Delta mode

Learning rate 3

What to look for. With raw rewards (no baseline), training is noisy and often fails to converge — the model learns nothing or collapses to a degenerate solution. With centered rewards, convergence is smoother (suboptimal responses get negative signal). With GRPO normalization, the reward scale is consistent across prompts — this helps the most when prompts vary in difficulty.

RL is not easy. Percy's Lecture 17 notes this explicitly: "RL systems are much more complex than pretraining." Even on this toy sorting task, you can observe the model getting stuck in local optima, oscillating, or making no progress on hard prompts. The hyperparameters (learning rate, group size G, delta mode) all matter enormously. Real GRPO at LM scale adds distributed inference, memory management for old models, reward model serving, and KL penalty tuning on top of all this.

Chapter 8: RL vs SFT: When Each Helps

The CS336 series has now covered both SFT (Lec 15) and RL (Lecs 16–17). A natural question: when do you need RL, and when is SFT sufficient? Percy's Lecture 17 summary gives the key insight: "Reinforcement learning is the key to surpassing human abilities — if you can measure it, you can optimize it."

What SFT can and can't do

SFT (Supervised Fine-Tuning) trains the model to imitate a fixed dataset of demonstrations. It is simple, stable, and highly effective — LIMA showed that 1,000 high-quality examples can produce an instruction-following model competitive with RLHF. But SFT is fundamentally bounded by the quality of its demonstrations. If the training data never contains a chain-of-thought that solves a competition math problem correctly, SFT cannot exceed that ceiling. SFT imitates; it cannot exceed its training distribution.

RL can exceed the demonstration ceiling — but only when the reward signal is well-specified. The policy explores, tries things the dataset never contained, and gets rewarded for anything that works. For verifiable tasks (math, code, formal proofs), the reward is exact and exploration is productive. For open-ended tasks (writing quality, helpfulness, creativity), the reward is harder to specify, and RL can lead to reward hacking or mode collapse toward whatever the reward model values.

The key asymmetry. SFT needs good demonstrations — hard to collect at scale, but training is stable. RL needs a good reward signal — easy for verifiable tasks, hard for subjective ones. For math and code, the verifier is free and exact, so RL is clearly superior to SFT at pushing capability beyond the training distribution. For open-ended chat quality, SFT + RLHF (preference RM) is still the dominant approach.

The exploration problem

RL for LMs faces a severe exploration problem. For a 500-token response, the space of possible responses is |V|⁵⁰⁰ — astronomically large. Most random explorations are wrong. Unlike a robot that can take small steps in a continuous space, an LM makes discrete token choices where one wrong step can cascade into a completely wrong response. This is why RL for LMs works much better starting from a strong pretrained model — the initialization already places the policy in a productive region of the space.

Cold-start RL failure. Training RL from a randomly initialized LM almost never works — the exploration is too sparse to find any reward. DeepSeek-R1-Zero's key engineering decision was to start from the DeepSeek-V3 base model (already trained on trillions of tokens), where the policy already generates coherent text and occasionally solves math problems. The RL then sharpens and extends this existing capability.

Worked comparison: SFT vs GRPO on sorting

In Percy's lecture code, the sorting task with 3 numbers has a clear optimal policy: always output sorted(prompt). SFT with optimal demonstrations would converge immediately. But GRPO starting from random initialization must explore — and the lecture shows it can get stuck in local optima (e.g., always outputting [0,1,2] regardless of input) because some responses get reward > 0 and the policy concentrates on them.

The lesson: RL is a powerful tool, but it requires (1) a good initialization, (2) a well-specified reward, (3) enough exploration diversity (GRPO ensures this by sampling G rollouts per prompt), and (4) careful hyperparameter tuning. SFT is often the right choice when you have good demonstrations; RL is the right choice when you need to push beyond them.

	SFT	RLHF (learned RM)	RLVR (verifier)
When to use	Good demonstrations available	Subjective quality (style, helpfulness)	Verifiable correctness (math, code)
Ceiling	Training data quality	Human preference quality	None (can exceed human)
Stability	Very stable	Moderate (reward hacking)	Stable (exact reward)
Data cost	High (demos)	High (preference labels)	Low (problems + answers)
Reward hacking	N/A	Yes (proxy exploitation)	Minimal (verifier is exact)
Example	Alpaca, LIMA	InstructGPT, Claude RLHF	DeepSeek-R1, Qwen3-Thinking

Why does RL for LMs typically require starting from a strong pretrained model, rather than training RL from scratch (random initialization)?

Because RL requires a value function, and value functions can't be initialized from random weights. Because the PPO clip ratio r = π/π_old is undefined when π_old is random. The token response space is exponentially large (|V|^T). From a random policy, almost no responses earn any reward — the gradient signal is nearly always zero. A strong pretrained model already generates coherent text that occasionally solves tasks, giving RL a foothold. RL then sharpens and extends existing capabilities rather than building them from scratch. Because the verifier reward function can only be computed on grammatically correct responses, which a random model cannot produce.

Chapter 9: Connections & the RL Frontier

You've reached the end of the CS336 series — 17 lectures from character-level tokenization to the frontier of RL for language models. This chapter connects everything, maps what's known vs open, and points to where the field is heading.

CS336 Series Capstone Cheat Sheet

Topic	Key Formula / Concept	Lecture
Tokenization (BPE)	Merge most frequent pair iteratively; vocab size V controls granularity	L1
Transformer architecture	Attention: softmax(QK^T/√d)V; residual + LayerNorm each block	L2–L3
Training objective	Cross-entropy: −∑ log π_θ(x_t\|x_<t); minimize NLL over tokens	L4–L6
Scaling laws	L = C/D^α + C/N^β; Chinchilla: N ≈ D optimally	L7
Efficient attention	FlashAttention: fuse softmax + matmul, O(N) HBM reads; KV cache at inference	L8–L9
Distributed training	DP × TP × PP = 3D parallelism; ZeRO shards optimizer state	L10–L12
Inference optimization	Speculative decoding, quantization, continuous batching	L13–L14
SFT + RLHF	DPO loss: −logσ(β(logπ/π_ref\|y_w − logπ/π_ref\|y_l))	L15
RLVR + GRPO	A_i = (r_i−μ)/σ; verifier reward; no learned RM needed	L16
Policy gradient theorem	∇J = E[R · ∇logπ]; baselines unbiased; PPO clip stabilizes	L17

Open frontiers in RL for LMs

Process vs outcome rewards. Current RLVR uses outcome rewards (did the final answer match?). Process rewards reward each logical step. This would solve the credit assignment problem — but requires step-level verification, which is hard to automate. OpenAI's PRM800K and DeepMind's work on process supervision point in this direction.

RL from AI feedback (RLAIF). Instead of human preference labels, use a stronger AI to judge response quality. Constitutional AI (Anthropic) and RLAIF (Google) showed this can approach human-labeled RLHF quality at much lower cost. Combined with RLVR for verifiable tasks and RLAIF for subjective tasks, this suggests a path toward continuous self-improvement.

Multi-turn and agentic RL. Current RL training treats generation as a single-turn bandit. Real-world tasks (coding agents, research assistants) are multi-turn: the model must plan, execute, observe, and revise over many steps. Standard REINFORCE and GRPO don't naturally extend here — you need credit assignment across turns, memory over interaction history, and rewards that evaluate end-to-end task completion.

Limits of RLVR. RLVR is powerful but domain-limited. Math and code have exact verifiers. Most of the knowledge and capabilities we want from LMs — nuanced explanation, creative synthesis, cross-domain reasoning — don't have exact verifiers. The question of how to extend RL training to these domains without learned reward models (which can be hacked) or human labels (which are expensive) is a major open problem.

Percy's closing summary from Lec 17. "Reinforcement learning is the key to surpassing human abilities — if you can measure it, you can optimize it. The policy gradient framework is conceptually clear; you just need baselines to reduce variance. But RL systems are much more complex than pretraining (inference workloads, managing multiple models). The hard part is not the math — it's the engineering."

What comes next in CS336

Two guest lectures close the series: Junyang Lin (Qwen team, Alibaba) on Qwen 3 — the culmination of iterative RLVR, thinking-mode fusion, and efficient alignment at scale. Mike Lewis (Meta FAIR) on Llama — large-scale pretraining, open-weights philosophy, and the role of open models in the ecosystem. Both connect directly to everything derived in this lesson.

Related lessons in this library

← CS336 Lec 15: SFT & RLHF ← CS336 Lec 16: RLVR & GRPO ↻ CS336 Lec 1: Start Over from Zero Policy Gradients (deep dive) RL Algorithms Survey Actor-Critic Methods Reward & Alignment

"What I cannot create, I do not understand."
— Richard Feynman. The site motto. Earning it took 17 lectures.