Language Modeling from Scratch · CS336 · Lecture 16

Alignment II: RLVR & GRPO

RLHF works when humans judge quality — but what if you can just check the answer? When the reward is verifiable (math solution correct or not, code passes tests or not), you don't need a learned reward model at all. This lesson derives RL with Verifiable Rewards (RLVR) from scratch: why verifiable rewards escape reward-hacking, the REINFORCE gradient and why baselines matter, GRPO (Group Relative Policy Optimization — derive the group-normalized advantage A=(r−μ)/σ, worked 8-sample example), the GRPO clipped objective with KL, why dropping the critic cuts memory and complexity versus PPO, and three landmark RLVR systems (DeepSeek-R1, Kimi K1.5, Qwen 3). Five interactive canvases, full PyTorch code, and hand-derived worked examples for every formula.

Prerequisites: CS336 Lec 15 (RLHF, PPO, DPO, reward models). Basic probability (expectation, variance). Gradient intuition from pretraining.

Chapters

Live Canvases

Derived

From First Principles

Chapter 0: The Verifiability Shift

You have an RLHF-trained assistant. It follows instructions, refuses harmful requests, and sounds helpful. But ask it to solve a hard math problem — a competition problem with a numerical answer — and something subtle breaks. The reward model, trained on human preference judgments, learned that confident-sounding responses get high ratings. So the model learns to be confidently wrong.

This is not a small failure. It is structural. The reward model is a proxy — a learned approximation to human judgment. Optimize the policy hard against a proxy and you get reward hacking: the model finds response patterns that score well on the proxy but are not actually correct. For math, the proxy says "sounds good." The verifier says "wrong answer."

The core insight of RLVR. For tasks where you can automatically check correctness — math problems (does the final answer match?), code (do the tests pass?), formal proofs — you do not need a learned reward model at all. The reward function is the verifier. It cannot be hacked because it is exact. Either 42 == 42, or it isn't.

This shift — from learned reward models (subjective, hackable) to verifiable rewards (objective, exact) — is what unlocks the reasoning capabilities of models like DeepSeek-R1, Kimi K1.5, and Qwen 3. The key insight from Lec 15: overoptimization is a problem precisely because the reward model is imperfect. Remove the imperfect RM and the problem largely disappears — in the verifiable domain.

RLHF (Lec 15)

Reward = learned RM(x, y). Human raters label preferences (y_w, y_l). RM generalizes those labels to new (x, y) pairs. Subjective, noisy, hackable.

↓ when can we do better?

RLVR (This Lesson)

Reward = verifier(x, y). Math: check if final answer matches. Code: run test suite. Proof: formal checker. Objective, exact, cannot be hacked — the verifier is the ground truth.

The recipe is simple in principle: generate multiple responses to a prompt, verify each one, reward the correct ones, penalize the incorrect ones, and update the policy to produce more correct responses. The interesting engineering is in the how — specifically, how to estimate the gradient of expected reward without a learned value function.

RLHF vs RLVR: reward signal comparison

Click a domain to see how the reward signal works. Notice how math/code rewards are binary and exact, while preference rewards are continuous and noisy.

A reward model trained on human preferences gives high scores to confident-sounding responses, causing the policy to produce confident but wrong math answers. This is an example of:

Catastrophic forgetting — RLHF overwrites the model's math knowledge. Mode collapse — the model stops generating diverse answers. Reward hacking — the policy has optimized against a proxy (the RM) rather than the true objective (correctness), exploiting the RM's bias toward confident style. Overparameterization — the reward model is too large and memorizes training preferences.

Chapter 1: The REINFORCE Gradient

We want to maximize the expected reward of our policy. Formally: find θ that maximizes J(θ) = E_{y ∼ π_θ(·|x)}[R(y)], where R(y) is the reward (1 if correct, 0 if not). How do we compute ∇_θJ?

The problem: R(y) is defined on discrete sequences y. Gradients don't flow through discrete samples. We need the log-derivative trick (also called REINFORCE or the score function estimator). The derivation is three lines:

∇_θ J(θ) = ∇_θ E_{y~π_θ}[R(y)]

= ∇_θ ∑_y π_θ(y|x) R(y)

= ∑_y R(y) · ∇_θ π_θ(y|x)

= ∑_y R(y) · π_θ(y|x) · ∇_θ log π_θ(y|x)

= E_{y~π_θ} [ R(y) · ∇_θ log π_θ(y|x) ]

The last step uses ∇_θ log p = (∇_θ p)/p, rearranged as ∇_θ p = p · ∇_θ log p. The gradient is now an expectation over the policy — we can estimate it with samples.

Intuition. The REINFORCE gradient says: for each sampled response y, compute ∇ log π_θ(y|x) (the gradient that would increase y's probability). Then weight it by R(y). High-reward responses get a large upward push. Zero-reward responses get no push. This is "reward-weighted imitation" — the policy imitates its own good outputs.

The variance problem — and why baselines help

REINFORCE is correct but has catastrophically high variance. The issue: if all sampled responses happen to be correct (R=1 for all), the gradient pushes up all of them equally — even the mediocre ones. If all happen to be wrong (R=0), the gradient is zero and nothing updates.

The fix is a baseline b: replace R(y) with (R(y) − b). Any baseline that doesn't depend on y is valid — it doesn't bias the gradient (because E[∇ log π · b] = b · E[∇ log π] = 0 by the policy gradient identity). But it can massively reduce variance by centering the rewards.

∇_θ J(θ) = E_{y~π_θ} [ (R(y) − b) · ∇_θ log π_θ(y|x) ]

The classic choice of b is a value function V(x) — the expected reward over all possible responses from state x. This is what PPO does: it trains a critic network to estimate V(x), uses (R(y) − V(x)) as the advantage, and this difference has much lower variance because it subtracts out the expected return. But training the value network costs memory and complexity.

Worked numerical example

Suppose we sample 4 responses. R = [1, 0, 0, 1]. Mean R = 0.5. REINFORCE gradient is proportional to [1, 0, 0, 1] — pushes up responses 1 and 4. With baseline b = 0.5: advantages = [0.5, −0.5, −0.5, 0.5] — now pushes up the correct responses AND pushes down the incorrect ones. The gradient signal is richer and variance is lower.

Misconception: REINFORCE with baseline is biased. It is not. The baseline cancels in expectation. E_y~π[(R(y) − b) ∇ log π(y)] = E_y~π[R(y) ∇ log π(y)] − b · E_y~π[∇ log π(y)] = E_y~π[R(y) ∇ log π(y)] − 0. The second term is zero because E[∇ log π] = ∇ E[1] = ∇ 1 = 0. The baseline is provably unbiased.

python
import torch

def reinforce_gradient_estimate(policy, prompts, rewards, baseline=None):
    # prompts: list of input_ids (B, seq_len)
    # rewards: (B,) tensor of scalar rewards (e.g. 1.0 or 0.0 from verifier)
    # baseline: scalar or (B,) tensor — subtracted from rewards

    if baseline is None:
        baseline = 0.0  # vanilla REINFORCE

    advantages = rewards - baseline   # (B,) — reward-weighted signal

    loss = 0.0
    for i, (ids, adv) in enumerate(zip(prompts, advantages)):
        # log prob of this sequence under current policy
        logits = policy(ids.unsqueeze(0)).logits  # (1, L, V)
        log_probs = torch.log_softmax(logits[:, :-1], dim=-1)
        token_lp = log_probs[0].gather(1, ids[1:].unsqueeze(1)).squeeze(1)
        seq_lp = token_lp.sum()  # sum log-probs over response tokens

        loss = loss - adv * seq_lp  # REINFORCE: maximize adv * log π

    return loss / len(prompts)

The REINFORCE gradient uses the log-derivative trick: ∇_θ E[R(y)] = E[R(y) ∇_θ log π_θ(y)]. Why can we estimate this gradient with samples even though the reward R(y) is non-differentiable w.r.t. θ?

We use a straight-through estimator that treats the discrete sample as continuous during backprop. We approximate R(y) with a differentiable reward model that can be backpropagated through. The log-derivative trick moves the gradient inside the expectation, attaching it to log π_θ(y) — which IS differentiable. R(y) is just a scalar weight on that gradient, not differentiated itself. We discretize the gradient by rounding R(y) to 0 or 1, which makes it piecewise-constant and differentiable.

Chapter 2: GRPO: The Group Baseline

PPO solves the variance problem by training a separate value network V(x) to serve as the baseline. This is principled but expensive: the value network has the same size as the policy (billions of parameters), requires its own optimizer state, and must be trained jointly with the policy — doubling memory and adding significant implementation complexity.

Group Relative Policy Optimization (GRPO) asks: what if we don't train a value network at all? Instead, for each prompt x, we sample a group of G responses — say G=8. We run the verifier on each. Some are correct (reward 1), some wrong (reward 0). The group average reward IS the baseline.

The group as the baseline. If 3 out of 8 responses to the same prompt are correct, the average reward is 0.375. A response with reward 1 gets advantage 1 − 0.375 = 0.625 (above average — push it up). A response with reward 0 gets advantage 0 − 0.375 = −0.375 (below average — push it down). No value network needed — the group itself provides the reference point.

Deriving the GRPO advantage

For a group of G responses {y₁, ..., y_G} sampled from the same prompt x, with rewards {r₁, ..., r_G}, the group mean and standard deviation are:

μ = (1/G) ∑_i=1^G r_i σ = √( (1/G) ∑_i=1^G (r_i − μ)² + ε )

The GRPO advantage for response i is the z-score within the group:

A_i = (r_i − μ) / σ

The ε (typically 10⁻⁴) prevents division by zero when all responses get the same reward (all correct or all incorrect). Dividing by σ normalizes the advantage scale — a prompt where the group spread is large (some correct, some wrong) produces larger advantage magnitudes than a prompt where all responses are nearly the same (all wrong on a hard problem).

Worked example: 8 samples, 3 correct

G = 8 responses to "Solve: 3x + 7 = 22." Rewards: [1, 0, 1, 0, 0, 1, 0, 0]. Three correct (x=5), five wrong.

μ = (1+0+1+0+0+1+0+0)/8 = 3/8 = 0.375

σ² = (1/8)[(1−0.375)²×3 + (0−0.375)²×5]

= (1/8)[0.625²×3 + 0.375²×5] = (1/8)[1.172 + 0.703] = 0.234

σ = √0.234 = 0.484

A_correct = (1 − 0.375) / 0.484 = 0.625 / 0.484 = +1.29

A_wrong = (0 − 0.375) / 0.484 = −0.375 / 0.484 = −0.775

The correct responses get a strong positive push (+1.29). The wrong responses get a moderate negative push (−0.775). Compare to REINFORCE without baseline: correct = +1, wrong = 0 — the wrong responses get no signal. The GRPO baseline explicitly suppresses bad responses, not just reinforces good ones.

GRPO group baseline — group size G and correct fraction slider

Adjust the group size and number of correct responses. See per-sample advantages computed live, and compare to vanilla REINFORCE (no baseline).

Group size G 8

Correct responses 3

In GRPO, all 8 responses to a prompt are incorrect (reward = 0 for all). What are the advantages, and what does this mean for the gradient update?

Advantages = −1 for all responses. The gradient strongly pushes down all responses. Advantages = 0 for all responses but with high variance — the gradient is noisy. Advantages = +1 for all responses because they all tied for the best reward in the group. Advantages = 0 for all responses (mean=0, each r_i−μ=0). The gradient update is zero — the model learns nothing from this group. This is a known failure mode of GRPO on very hard prompts.

Chapter 3: GRPO Objective & KL

We have advantages A_i. Now we need a stable policy update rule. Direct gradient ascent on E[A_i ∇ log π] can cause large, destabilizing updates if the advantages are large. GRPO borrows PPO's clipped ratio to prevent this.

The importance ratio

The responses y_i were sampled from the old policy π_old (before this gradient step). We are updating the new policy π_θ. The ratio r_i(θ) = π_θ(y_i|x) / π_old(y_i|x) measures how much the new policy weights each response relative to the old one. In practice, this is the product of per-token probability ratios over the response length.

r_i(θ) = π_θ(y_i|x) / π_old(y_i|x) = ∏_t [ π_θ(y_i,t|x, y_i,<t) / π_old(y_i,t|x, y_i,<t) ]

The GRPO clipped objective

The PPO clip trick: if r_i(θ) wanders too far from 1 (the policy has changed a lot since the rollout), clip it. The clipped objective for a single sample is:

L_clip,i(θ) = min( r_i(θ) · A_i, clip(r_i(θ), 1−ε, 1+ε) · A_i )

Here ε is typically 0.2. This clip is asymmetric: if A_i > 0 (good response), we clip the ratio at 1+ε so we don't over-increase this response's probability in one step. If A_i < 0 (bad response), we clip at 1−ε so we don't over-decrease.

Adding the KL penalty

Additionally, GRPO adds a KL divergence penalty between the current policy and a frozen reference policy π_ref (usually the SFT checkpoint). This prevents the policy from drifting so far that it loses its language quality:

L_GRPO(θ) = E_x~D E_i=1..G [ L_clip,i(θ) ] − β · KL(π_θ(·|x) ∥ π_ref(·|x))

The KL term can be implemented per-token as a log-ratio penalty added to the reward, or as a separate loss term. DeepSeek-R1 uses it as a separate term; some implementations add it as a token-level reward at each step.

Why keep the KL penalty if rewards are verifiable? Verifiable rewards are exact for correctness, but RLVR can still cause the model to degenerate in language quality — producing grammatically broken, repetitive, or incoherent chains of thought. The KL penalty keeps the model's language distribution close to the SFT reference, preserving fluency even as reasoning improves.

PyTorch GRPO update — full sketch

python
import torch, torch.nn.functional as F

def grpo_loss(policy, ref_policy, old_policy, groups, eps=0.2, beta=0.04):
    # groups: list of dicts {prompt_ids, response_ids, reward}
    # Group responses by prompt, compute advantages within each group

    prompt_to_group = {}
    for g in groups:
        key = tuple(g['prompt_ids'].tolist())
        prompt_to_group.setdefault(key, []).append(g)

    total_loss = 0.0
    n_groups = 0

    for key, grp in prompt_to_group.items():
        rewards = torch.tensor([g['reward'] for g in grp])
        mu = rewards.mean()
        sigma = rewards.std() + 1e-4        # ε for numerical stability
        advantages = (rewards - mu) / sigma   # group z-scores

        for i, g in enumerate(grp):
            ids = g['response_ids']
            adv = advantages[i]

            # Log-probs under current and old policy
            lp_new = seq_log_prob(policy, g['prompt_ids'], ids)
            lp_old = seq_log_prob(old_policy, g['prompt_ids'], ids).detach()
            lp_ref = seq_log_prob(ref_policy, g['prompt_ids'], ids).detach()

            ratio = torch.exp(lp_new - lp_old)              # importance ratio
            clip_ratio = ratio.clamp(1 - eps, 1 + eps)

            # PPO-clip objective (negative because we maximize)
            policy_loss = -torch.min(ratio * adv, clip_ratio * adv)

            # KL penalty: log(π_θ / π_ref)
            kl = lp_new - lp_ref
            total_loss += policy_loss + beta * kl

        n_groups += 1

    return total_loss / n_groups

In the GRPO clipped objective, why do we clip the importance ratio r_i(θ) at [1−ε, 1+ε]?

To prevent the reward from exceeding 1.0 when advantages are large. To ensure the old and new policies stay identical during training. To prevent large, destabilizing policy updates when the new policy has drifted significantly from the policy that generated the rollouts. The clip limits how much a single gradient step can change the policy. To normalize the advantages to the range [−1, +1].

Chapter 4: PPO vs GRPO

GRPO and PPO solve the same problem — policy gradient with variance reduction — but they differ fundamentally in how they compute the baseline. This difference has large practical consequences.

PPO's value network (critic)

PPO trains a separate value network (the critic) V_φ(s) that predicts the expected future reward from any state s. In the LM setting, s = (x, y_<t) — the prompt plus all tokens generated so far. The Generalized Advantage Estimate (GAE) uses this critic to compute per-token advantages:

A_t^PPO = ∑_k≥t (γλ)^k−t (r_k + γ V(s_k+1) − V(s_k))

In the language model bandit setting (single reward at end of sequence), this simplifies: γ = λ = 1, and the advantage at the terminal step is just R − V(x). But computing V(x) requires a full forward pass through a billion-parameter network for every prefix.

GRPO's group baseline

GRPO replaces V(x) with the group mean μ = mean({r₁, ..., r_G}). No value network. No critic training. No extra optimizer state. The group of G rollouts provides a Monte Carlo estimate of E[R|x] — and if G is large enough, this estimate is good enough.

PPO (with critic) vs GRPO (group baseline) — architecture comparison

Toggle between the two architectures to see the full computational graph. Note how GRPO eliminates the critic network entirely.

	PPO (with critic)	GRPO (group baseline)
Baseline	V_φ(x) — learned value network	μ = mean(group rewards)
Extra model	Yes — critic = same size as policy	No — just G rollouts
Memory cost	2× (policy + critic params + optimizer states)	1× (policy only)
Per-token vs per-sequence	Per-token GAE advantages	Per-sequence group advantage
Bias of baseline	Low (learned V is good estimate)	Higher (Monte Carlo, depends on G)
Implementation complexity	High (rollout loop, GAE, critic update)	Low — can write in ~50 lines
Used in	InstructGPT, AlpacaFarm, most RLHF	DeepSeekMath, R1, Kimi K1.5, Qwen3

Is GRPO's baseline "valid"? Technically, no. A valid baseline for an unbiased gradient must be state-dependent (depend on x but not y). GRPO's mean is over responses to x, so it is state-dependent — that part is fine. But dividing by the standard deviation of the group is NOT a valid baseline: it changes the scale of the gradient, introducing a small bias. Liu et al. (2025) showed that a corrected version (REINFORCE with leave-one-out) is unbiased. In practice, the bias is small and GRPO works well empirically.

GRPO eliminates the PPO critic (value network). What is the primary practical benefit of this, and what is the cost?

Benefit: faster inference at deployment. Cost: higher training loss. Benefit: cuts memory and implementation complexity by ~half (no second billion-parameter model to maintain). Cost: the group-mean baseline is a noisier estimate of the true value function, requiring larger G or more rollouts to achieve the same variance reduction as a well-trained critic. Benefit: the gradient becomes unbiased. Cost: advantages are harder to interpret. Benefit: removes the KL penalty requirement. Cost: policy is less stable.

Chapter 5: Reward Design

The verifier provides a binary signal: correct or incorrect. But RLVR practitioners have found that augmenting this signal with format rewards significantly improves training stability and final performance. Reward design is not just about what counts as "correct" — it's about shaping the training signal to elicit the behaviors you want.

DeepSeek-R1-Zero reward components

The R1-Zero recipe uses exactly two reward components:

Accuracy reward: 1.0 if the final answer is correct (matches ground truth), 0.0 otherwise. This is the core RLVR signal.
Format reward: small positive (0.0–0.1) if the model uses the required format — specifically, wrapping its chain-of-thought in <think> tags and the final answer in <answer> tags. Penalizes responses that skip the structured format.

No process rewards. No length rewards in R1-Zero. No human preference signals. Just: did you format correctly and did you get the right answer?

Why format rewards? Without format enforcement, RL can produce responses where the "final answer" is ambiguous — the model might say the correct number somewhere in the middle of its chain of thought but then revise it to a wrong number at the end. The <think>/<answer> structure makes the extraction unambiguous, allowing exact reward computation.

The verifier implementation

For math: extract the content of the <answer> tag, normalize the expression (strip whitespace, convert fractions, handle multiple valid forms like "1/2" and "0.5"), and compare to ground truth. Equivalence checking can use a symbolic math library (sympy) for exact comparison.

python
import re
from sympy import simplify, sympify

def math_verifier(response: str, ground_truth: str) -> float:
    # Extract answer from structured format
    m = re.search(r'<answer>(.*?)</answer>', response, re.DOTALL)
    if not m:
        return 0.0   # no structured answer = wrong

    answer_str = m.group(1).strip()

    try:
        # Symbolic equivalence check: is answer - ground_truth == 0?
        diff = simplify(sympify(answer_str) - sympify(ground_truth))
        return 1.0 if diff == 0 else 0.0
    except:
        # Fallback: string comparison after normalization
        return 1.0 if answer_str == ground_truth.strip() else 0.0

def format_reward(response: str) -> float:
    has_think = '<think>' in response and '</think>' in response
    has_answer = '<answer>' in response and '</answer>' in response
    return 0.1 if (has_think and has_answer) else 0.0

def total_reward(response: str, ground_truth: str) -> float:
    return math_verifier(response, ground_truth) + format_reward(response)

The effect of format vs correctness reward

Ablation results from the original GRPO paper (DeepSeekMath) showed: correctness reward alone achieves strong results on math benchmarks. Format reward adds a small but consistent improvement, mainly by reducing the frequency of unstructured responses that make answer extraction ambiguous. Process supervision (rewarding intermediate steps, not just the final answer) added further gains on some benchmarks but at significant labeling cost.

Reward design: correctness-only vs correctness + format — training dynamics

Observe how reward shaping affects training stability. Toggle the format reward and observe the variance in the reward signal over time.

Training steps 100

In RLVR math training, why is it important to include a format reward (requiring <think>/<answer> tags) in addition to the correctness reward?

Format rewards increase the total reward magnitude, making gradient updates larger and training faster. Without format rewards, the model refuses to generate chains of thought and answers directly. The format makes answer extraction unambiguous — without it, the "correct answer" might appear mid-response and then be revised, making it unclear what counts as the model's final answer for reward computation. Format rewards prevent the KL penalty from dominating the training signal.

Chapter 6: DeepSeek-R1

DeepSeek-R1 launched in January 2025 and became a social phenomenon: an open model that matched or exceeded OpenAI o1 on reasoning benchmarks, with a surprisingly simple RL recipe. Understanding R1 requires distinguishing two experiments: R1-Zero (pure RL from base model, no SFT) and R1 (RL with SFT warm-start).

R1-Zero: pure RLVR from a base model

The R1-Zero experiment is philosophically striking. Start from DeepSeek-V3 base (a pretrained but not instruction-tuned model). Apply GRPO with only two rewards: accuracy and format. No SFT, no human preference data, no chain-of-thought demonstrations.

What emerged: the model spontaneously developed long chains of thought. It learned to re-examine its work ("Wait, let me check this again..."). It learned to try multiple approaches. It learned to use the <think> block to explore, then commit to an answer. These behaviors were not taught — they were discovered by RL as strategies that increase expected reward.

The "aha moment" — and why it may be overstated. The R1 paper described a specific checkpoint where the model suddenly learned to allocate more thinking time to hard problems — what they called an "aha moment." Follow-up analyses (Dr. GRPO paper) found two nuances: (1) much of the observed CoT length increase may be due to GRPO's inherent length bias (the group-normalization objective inadvertently incentivizes longer responses, as shown by Liu et al. 2025); (2) the base model DeepSeek-V3 already exhibits "aha moment" reasoning when prompted appropriately — R1-Zero may be eliciting this latent capability rather than creating it. The lesson: emergence claims in RL are hard to validate without careful ablations.

R1: the full recipe

R1 improves on R1-Zero with a more careful pipeline. The key differences:

Stage 1: Reasoning SFT Warm-Start

Collect long chain-of-thought examples (from earlier RL checkpoints or distillation). SFT the model on these first. This avoids the early instability of RL from a cold-started base model and ensures readable CoT format from the start.

↓

Stage 2: GRPO RL

Apply GRPO with accuracy + format rewards + a language consistency reward (penalizes mixed-language output — RL on multilingual models naturally drifts to language-switching in the CoT, which hurts readability).

↓

Stage 3: General SFT + RLHF

After reasoning RL, run standard post-training: SFT on reasoning + non-reasoning data (800k samples for reasoning tasks with R1 as teacher, 200k for general tasks). Then RLHF to align tone and safety.

SFT for reasoning — a little goes a long way

One of the most practically significant findings: for the SFT warm-start, you don't need huge amounts of long CoT data. Even ~1,000 high-quality math + science problems with long CoT responses (sourced from Gemini or R1 itself as teacher) are enough to bootstrap the RL from a stable starting point. The RL then does the heavy lifting of improving accuracy.

Distillation — can small models reason?

After R1 training, DeepSeek generated 800,000 long CoT traces and used them to fine-tune Qwen 2.5 (7B–72B). The distilled models showed strong reasoning — often competitive with much larger models on math benchmarks. This demonstrates that imitation of reasoning traces is a powerful and cheap way to transfer RLVR-derived capabilities.

DeepSeek-R1-Zero applies RLVR directly to the base model without SFT warm-start. What does this experiment most directly demonstrate?

That base models are already aligned — SFT is unnecessary for RLVR. That MCTS and process reward models are necessary for o1-level reasoning. That RL with verifiable rewards alone — without any CoT demonstrations — can elicit long chain-of-thought reasoning and self-correction behavior from a capable pretrained model, ending speculation that MCTS or PRMs are necessary for strong reasoning. That format rewards (thinking tags) are more important than accuracy rewards for eliciting reasoning.

Chapter 7: Showcase: The Full RLVR Loop

This is the complete RLVR training loop — animated. Watch how GRPO actually works: prompt arrives, G completions are sampled, the verifier scores each, advantages are computed within the group, and the policy update nudges probabilities up for winners and down for losers. Over thousands of steps, accuracy climbs and CoT length grows.

RLVR training loop — animated GRPO rollout & update

Press Play to animate the GRPO loop. Adjust group size G and watch how the advantage distribution changes. The accuracy curve at the bottom shows training progress over simulated steps.

Group size G 6

Learning rate (speed) 2

Reading the animation

The top panel shows the current rollout group: G response bubbles, colored green (correct) or red (incorrect). Below each bubble is its advantage — the z-score within the group. The middle panel shows the policy update direction: green arrows mean "increase this response's probability," red arrows mean "decrease." The bottom panel shows accuracy over training steps — the cumulative fraction of prompts where at least one sampled response is correct, rising as RL improves the policy.

What GRPO cannot learn. GRPO can only improve on prompts where it occasionally gets a correct answer — if reward is 0 for every sample in every group, the advantage is always 0 and the gradient is zero. This means GRPO requires the base model to already have some chance of being correct. For very hard problems (competition-level math beyond the base model's capabilities), GRPO alone cannot bootstrap reasoning from scratch — hence the SFT warm-start in R1.

CoT length emergence

One of the most striking observations in R1-Zero training: the average length of the chain-of-thought in the <think> block increases over RL training, even though no explicit length reward is given. The intuition: longer chains of thought give the model more "computation" — more intermediate tokens to reason through — and empirically lead to higher accuracy on hard problems. RL discovers this and learns to generate longer chains when it helps.

Chapter 8: Kimi K1.5 & Qwen 3

DeepSeek-R1 showed that RLVR works. But it is not the only recipe. Two contemporaneous systems — Kimi K1.5 (MoonShot AI) and Qwen 3 (Alibaba) — achieved comparable results with distinct design choices. Comparing them reveals which aspects of RLVR are essential and which are contingent.

Kimi K1.5: DPO-inspired policy gradient

Kimi takes a different algorithmic route. Instead of GRPO's group z-score, they derive their policy gradient objective from the DPO framework (Lecture 15): assume a nonparametric policy, solve for the implied reward in terms of the log-ratio π/π_ref, then use a squared loss surrogate. The resulting gradient looks like a baselined policy gradient with regularization — similar to GRPO but derived differently.

Key differences from R1:

Length control: Kimi adds an explicit length reward in later training. For each batch, responses longer than the group's median get a penalty; responses shorter get a bonus (applied only to correct responses). This prevents the model from learning "longer = better" and instead learns "correct and concise = better."
Curriculum learning: dataset is ordered by difficulty (estimated by best-of-8 success rate). Training starts on easier problems and gradually moves to harder ones. Problems the model already solves consistently are downweighted to focus compute on challenging problems.
Code rewards: for code problems, test cases are generated from problems with known solutions, providing a broader verifier coverage than just single-answer math.

Kimi's length reward insight. The length reward in Kimi is bidirectional: incorrect responses are incentivized to be shorter than the batch median (cut your losses fast); correct responses are incentivized to be shorter than the median (be efficient). The lambda parameter that controls this reward starts at 0 and is gradually increased mid-training — applying it too early disrupts learning before the model has learned to get correct answers at all.

Qwen 3: RLVR with minimal data

Qwen 3 (released 2025) achieves state-of-the-art results on reasoning benchmarks using a remarkably small RLVR dataset: 3,995 examples. This is not a typo. The entire RLVR stage uses fewer than 4,000 prompts — selected with extreme care for quality and difficulty.

Qwen 3's data selection criteria:

Exclude problems the model can solve correctly without a chain of thought (too easy — no learning signal).
Exclude problems too similar to the validation set (contamination risk).
Manual filtering for "quality of reasoning" — specifically looking for problems where the model produces a valid-seeming CoT that is nonetheless wrong (these are highest-signal training examples).
Difficulty filtering via best-of-N success rate: keep problems where 0.1 < success_rate < 0.9 (hard enough to learn from, not so hard that all samples fail).

Thinking mode fusion

Qwen 3 introduced a notable capability: controllable "thinking" vs "non-thinking" mode. During training, they mix: (1) standard instruction data without chains of thought, (2) RLVR-trained reasoning data with explicit <think> blocks. At inference, the model can be prompted with a flag that enables or disables the thinking mode. This gives users explicit control over the latency/accuracy tradeoff — a short answer for simple queries, a long reasoned answer for complex ones.

Kimi K1.5, R1, Qwen 3 — RLVR recipe comparison

Compare the key design choices across the three landmark RLVR systems. Each row is a dimension of the recipe; differences are highlighted.

Qwen 3 achieves strong RLVR results with only 3,995 training examples. What selection criterion is MOST responsible for the high sample efficiency?

Using only the hardest problems — those the model never gets correct even with sampling. Scaling up the group size G to 64 to compensate for the small dataset. Using problems generated by a stronger teacher model rather than human-written problems. Careful difficulty filtering (0.1 < success_rate < 0.9) — selecting problems that are neither too easy (no learning signal) nor too hard (all samples fail, all advantages zero). Only these middle-difficulty problems provide non-trivial gradient signal.

Chapter 9: Connections

RLVR sits at the intersection of RL theory, LLM post-training, and test-time compute scaling. Here is how this lesson connects to the broader landscape:

Concept from this lesson	Where it leads
REINFORCE gradient derivation	Policy Gradients Gleam — full REINFORCE, variance analysis, TRPO
PPO review & GRPO comparison	CS336 Lec 15 — RLHF, DPO derivation, overoptimization
GRPO group advantage (group z-score)	RL Algorithms Gleam — actor-critic, GAE, advantage estimation
KL penalty in RLVR	Reward & Alignment Gleam — KL geometry, reward-KL tradeoff
Test-time compute (longer CoT)	CS336 Lec 17 — inference scaling, best-of-N, PRM-guided search
Distillation from R1	CS336 Lec 9 — scaling laws, why distillation is efficient

Limitations of RLVR

RLVR is not a universal recipe. Its key limitations:

Verifiability requirement: only works in domains where you can automatically check correctness. Math and code are the canonical cases. Tasks like "write a good essay" or "explain this concept well" are not directly verifiable.
Base model dependence: GRPO cannot bootstrap reasoning from a model that never gets correct answers in the training distribution. A model with near-zero base accuracy on a task cannot improve via GRPO.
Length bias: the group-normalization objective (dividing by std) unintentionally rewards responses where variance is high, which often correlates with longer responses. This requires careful length penalty design (as in Kimi).
Distribution shift: after RLVR, models can show decreased performance on non-RLVR tasks (the reasoning alignment tax, analogous to the general alignment tax from Lec 15). Stage 3 SFT+RLHF in R1 partially recovers this.

Cheat sheet

Formula	What it is
`∇ J = E[R(y) ∇ log π(y)]`	REINFORCE policy gradient (log-derivative trick)
`∇ J = E[(R(y)−b) ∇ log π(y)]`	REINFORCE with baseline (unbiased, lower variance)
`A_i = (r_i−μ) / σ`	GRPO group advantage (group z-score)
`L_clip = min(r(θ)·A, clip(r(θ),1±ε)·A)`	GRPO/PPO clipped objective
`L_GRPO = E[L_clip] − β·KL(π∥π_ref)`	Full GRPO objective with KL regularization

"What I cannot create, I do not understand." — Richard Feynman. You can now implement the full RLVR pipeline: the verifier, the GRPO rollout loop, the group advantage computation, and the clipped policy objective. What you cannot yet create: a verifier for open-ended tasks, a recipe that escapes the base-model floor, or a theory that explains why RL reliably elicits reasoning over other capability-eliciting approaches. Those are open problems.