Language Modeling from Scratch · CS336 · Lecture 16

Alignment II: RLVR & GRPO

RLHF works when humans judge quality — but what if you can just check the answer? When the reward is verifiable (math solution correct or not, code passes tests or not), you don't need a learned reward model at all. This lesson derives RL with Verifiable Rewards (RLVR) from scratch: why verifiable rewards escape reward-hacking, the REINFORCE gradient and why baselines matter, GRPO (Group Relative Policy Optimization — derive the group-normalized advantage A=(r−μ)/σ, worked 8-sample example), the GRPO clipped objective with KL, why dropping the critic cuts memory and complexity versus PPO, and three landmark RLVR systems (DeepSeek-R1, Kimi K1.5, Qwen 3). Five interactive canvases, full PyTorch code, and hand-derived worked examples for every formula.

Prerequisites: CS336 Lec 15 (RLHF, PPO, DPO, reward models). Basic probability (expectation, variance). Gradient intuition from pretraining.
10
Chapters
5
Live Canvases
Derived
From First Principles

Chapter 0: The Verifiability Shift

You have an RLHF-trained assistant. It follows instructions, refuses harmful requests, and sounds helpful. But ask it to solve a hard math problem — a competition problem with a numerical answer — and something subtle breaks. The reward model, trained on human preference judgments, learned that confident-sounding responses get high ratings. So the model learns to be confidently wrong.

This is not a small failure. It is structural. The reward model is a proxy — a learned approximation to human judgment. Optimize the policy hard against a proxy and you get reward hacking: the model finds response patterns that score well on the proxy but are not actually correct. For math, the proxy says "sounds good." The verifier says "wrong answer."

The core insight of RLVR. For tasks where you can automatically check correctness — math problems (does the final answer match?), code (do the tests pass?), formal proofs — you do not need a learned reward model at all. The reward function is the verifier. It cannot be hacked because it is exact. Either 42 == 42, or it isn't.

This shift — from learned reward models (subjective, hackable) to verifiable rewards (objective, exact) — is what unlocks the reasoning capabilities of models like DeepSeek-R1, Kimi K1.5, and Qwen 3. The key insight from Lec 15: overoptimization is a problem precisely because the reward model is imperfect. Remove the imperfect RM and the problem largely disappears — in the verifiable domain.

RLHF (Lec 15)
Reward = learned RM(x, y). Human raters label preferences (yw, yl). RM generalizes those labels to new (x, y) pairs. Subjective, noisy, hackable.
↓ when can we do better?
RLVR (This Lesson)
Reward = verifier(x, y). Math: check if final answer matches. Code: run test suite. Proof: formal checker. Objective, exact, cannot be hacked — the verifier is the ground truth.

The recipe is simple in principle: generate multiple responses to a prompt, verify each one, reward the correct ones, penalize the incorrect ones, and update the policy to produce more correct responses. The interesting engineering is in the how — specifically, how to estimate the gradient of expected reward without a learned value function.

RLHF vs RLVR: reward signal comparison

Click a domain to see how the reward signal works. Notice how math/code rewards are binary and exact, while preference rewards are continuous and noisy.

A reward model trained on human preferences gives high scores to confident-sounding responses, causing the policy to produce confident but wrong math answers. This is an example of:

Chapter 1: The REINFORCE Gradient

We want to maximize the expected reward of our policy. Formally: find θ that maximizes J(θ) = Ey ∼ πθ(·|x)[R(y)], where R(y) is the reward (1 if correct, 0 if not). How do we compute ∇θJ?

The problem: R(y) is defined on discrete sequences y. Gradients don't flow through discrete samples. We need the log-derivative trick (also called REINFORCE or the score function estimator). The derivation is three lines:

θ J(θ) = ∇θ Ey~πθ[R(y)]
= ∇θy πθ(y|x) R(y)
= ∑y R(y) · ∇θ πθ(y|x)
= ∑y R(y) · πθ(y|x) · ∇θ log πθ(y|x)
= Ey~πθ [ R(y) · ∇θ log πθ(y|x) ]

The last step uses ∇θ log p = (∇θ p)/p, rearranged as ∇θ p = p · ∇θ log p. The gradient is now an expectation over the policy — we can estimate it with samples.

Intuition. The REINFORCE gradient says: for each sampled response y, compute ∇ log πθ(y|x) (the gradient that would increase y's probability). Then weight it by R(y). High-reward responses get a large upward push. Zero-reward responses get no push. This is "reward-weighted imitation" — the policy imitates its own good outputs.

The variance problem — and why baselines help

REINFORCE is correct but has catastrophically high variance. The issue: if all sampled responses happen to be correct (R=1 for all), the gradient pushes up all of them equally — even the mediocre ones. If all happen to be wrong (R=0), the gradient is zero and nothing updates.

The fix is a baseline b: replace R(y) with (R(y) − b). Any baseline that doesn't depend on y is valid — it doesn't bias the gradient (because E[∇ log π · b] = b · E[∇ log π] = 0 by the policy gradient identity). But it can massively reduce variance by centering the rewards.

θ J(θ) = Ey~πθ [ (R(y) − b) · ∇θ log πθ(y|x) ]

The classic choice of b is a value function V(x) — the expected reward over all possible responses from state x. This is what PPO does: it trains a critic network to estimate V(x), uses (R(y) − V(x)) as the advantage, and this difference has much lower variance because it subtracts out the expected return. But training the value network costs memory and complexity.

Worked numerical example

Suppose we sample 4 responses. R = [1, 0, 0, 1]. Mean R = 0.5. REINFORCE gradient is proportional to [1, 0, 0, 1] — pushes up responses 1 and 4. With baseline b = 0.5: advantages = [0.5, −0.5, −0.5, 0.5] — now pushes up the correct responses AND pushes down the incorrect ones. The gradient signal is richer and variance is lower.

Misconception: REINFORCE with baseline is biased. It is not. The baseline cancels in expectation. Ey~π[(R(y) − b) ∇ log π(y)] = Ey~π[R(y) ∇ log π(y)] − b · Ey~π[∇ log π(y)] = Ey~π[R(y) ∇ log π(y)] − 0. The second term is zero because E[∇ log π] = ∇ E[1] = ∇ 1 = 0. The baseline is provably unbiased.
python
import torch

def reinforce_gradient_estimate(policy, prompts, rewards, baseline=None):
    # prompts: list of input_ids (B, seq_len)
    # rewards: (B,) tensor of scalar rewards (e.g. 1.0 or 0.0 from verifier)
    # baseline: scalar or (B,) tensor — subtracted from rewards

    if baseline is None:
        baseline = 0.0  # vanilla REINFORCE

    advantages = rewards - baseline   # (B,) — reward-weighted signal

    loss = 0.0
    for i, (ids, adv) in enumerate(zip(prompts, advantages)):
        # log prob of this sequence under current policy
        logits = policy(ids.unsqueeze(0)).logits  # (1, L, V)
        log_probs = torch.log_softmax(logits[:, :-1], dim=-1)
        token_lp = log_probs[0].gather(1, ids[1:].unsqueeze(1)).squeeze(1)
        seq_lp = token_lp.sum()  # sum log-probs over response tokens

        loss = loss - adv * seq_lp  # REINFORCE: maximize adv * log π

    return loss / len(prompts)
The REINFORCE gradient uses the log-derivative trick: ∇θ E[R(y)] = E[R(y) ∇θ log πθ(y)]. Why can we estimate this gradient with samples even though the reward R(y) is non-differentiable w.r.t. θ?

Chapter 2: GRPO: The Group Baseline

PPO solves the variance problem by training a separate value network V(x) to serve as the baseline. This is principled but expensive: the value network has the same size as the policy (billions of parameters), requires its own optimizer state, and must be trained jointly with the policy — doubling memory and adding significant implementation complexity.

Group Relative Policy Optimization (GRPO) asks: what if we don't train a value network at all? Instead, for each prompt x, we sample a group of G responses — say G=8. We run the verifier on each. Some are correct (reward 1), some wrong (reward 0). The group average reward IS the baseline.

The group as the baseline. If 3 out of 8 responses to the same prompt are correct, the average reward is 0.375. A response with reward 1 gets advantage 1 − 0.375 = 0.625 (above average — push it up). A response with reward 0 gets advantage 0 − 0.375 = −0.375 (below average — push it down). No value network needed — the group itself provides the reference point.

Deriving the GRPO advantage

For a group of G responses {y1, ..., yG} sampled from the same prompt x, with rewards {r1, ..., rG}, the group mean and standard deviation are:

μ = (1/G) ∑i=1G ri     σ = √( (1/G) ∑i=1G (ri − μ)2 + ε )

The GRPO advantage for response i is the z-score within the group:

Ai = (ri − μ) / σ

The ε (typically 10−4) prevents division by zero when all responses get the same reward (all correct or all incorrect). Dividing by σ normalizes the advantage scale — a prompt where the group spread is large (some correct, some wrong) produces larger advantage magnitudes than a prompt where all responses are nearly the same (all wrong on a hard problem).

Worked example: 8 samples, 3 correct

G = 8 responses to "Solve: 3x + 7 = 22." Rewards: [1, 0, 1, 0, 0, 1, 0, 0]. Three correct (x=5), five wrong.

μ = (1+0+1+0+0+1+0+0)/8 = 3/8 = 0.375
σ2 = (1/8)[(1−0.375)2×3 + (0−0.375)2×5]
= (1/8)[0.6252×3 + 0.3752×5] = (1/8)[1.172 + 0.703] = 0.234
σ = √0.234 = 0.484
Acorrect = (1 − 0.375) / 0.484 = 0.625 / 0.484 = +1.29
Awrong = (0 − 0.375) / 0.484 = −0.375 / 0.484 = −0.775

The correct responses get a strong positive push (+1.29). The wrong responses get a moderate negative push (−0.775). Compare to REINFORCE without baseline: correct = +1, wrong = 0 — the wrong responses get no signal. The GRPO baseline explicitly suppresses bad responses, not just reinforces good ones.

GRPO group baseline — group size G and correct fraction slider

Adjust the group size and number of correct responses. See per-sample advantages computed live, and compare to vanilla REINFORCE (no baseline).

Group size G 8
Correct responses 3
In GRPO, all 8 responses to a prompt are incorrect (reward = 0 for all). What are the advantages, and what does this mean for the gradient update?

Chapter 3: GRPO Objective & KL

We have advantages Ai. Now we need a stable policy update rule. Direct gradient ascent on E[Ai ∇ log π] can cause large, destabilizing updates if the advantages are large. GRPO borrows PPO's clipped ratio to prevent this.

The importance ratio

The responses yi were sampled from the old policy πold (before this gradient step). We are updating the new policy πθ. The ratio ri(θ) = πθ(yi|x) / πold(yi|x) measures how much the new policy weights each response relative to the old one. In practice, this is the product of per-token probability ratios over the response length.

ri(θ) = πθ(yi|x) / πold(yi|x) = ∏t [ πθ(yi,t|x, yi,<t) / πold(yi,t|x, yi,<t) ]

The GRPO clipped objective

The PPO clip trick: if ri(θ) wanders too far from 1 (the policy has changed a lot since the rollout), clip it. The clipped objective for a single sample is:

Lclip,i(θ) = min( ri(θ) · Ai,   clip(ri(θ), 1−ε, 1+ε) · Ai )

Here ε is typically 0.2. This clip is asymmetric: if Ai > 0 (good response), we clip the ratio at 1+ε so we don't over-increase this response's probability in one step. If Ai < 0 (bad response), we clip at 1−ε so we don't over-decrease.

Adding the KL penalty

Additionally, GRPO adds a KL divergence penalty between the current policy and a frozen reference policy πref (usually the SFT checkpoint). This prevents the policy from drifting so far that it loses its language quality:

LGRPO(θ) = Ex~D Ei=1..G [ Lclip,i(θ) ] − β · KL(πθ(·|x) ∥ πref(·|x))

The KL term can be implemented per-token as a log-ratio penalty added to the reward, or as a separate loss term. DeepSeek-R1 uses it as a separate term; some implementations add it as a token-level reward at each step.

Why keep the KL penalty if rewards are verifiable? Verifiable rewards are exact for correctness, but RLVR can still cause the model to degenerate in language quality — producing grammatically broken, repetitive, or incoherent chains of thought. The KL penalty keeps the model's language distribution close to the SFT reference, preserving fluency even as reasoning improves.

PyTorch GRPO update — full sketch

python
import torch, torch.nn.functional as F

def grpo_loss(policy, ref_policy, old_policy, groups, eps=0.2, beta=0.04):
    # groups: list of dicts {prompt_ids, response_ids, reward}
    # Group responses by prompt, compute advantages within each group

    prompt_to_group = {}
    for g in groups:
        key = tuple(g['prompt_ids'].tolist())
        prompt_to_group.setdefault(key, []).append(g)

    total_loss = 0.0
    n_groups = 0

    for key, grp in prompt_to_group.items():
        rewards = torch.tensor([g['reward'] for g in grp])
        mu = rewards.mean()
        sigma = rewards.std() + 1e-4        # ε for numerical stability
        advantages = (rewards - mu) / sigma   # group z-scores

        for i, g in enumerate(grp):
            ids = g['response_ids']
            adv = advantages[i]

            # Log-probs under current and old policy
            lp_new = seq_log_prob(policy, g['prompt_ids'], ids)
            lp_old = seq_log_prob(old_policy, g['prompt_ids'], ids).detach()
            lp_ref = seq_log_prob(ref_policy, g['prompt_ids'], ids).detach()

            ratio = torch.exp(lp_new - lp_old)              # importance ratio
            clip_ratio = ratio.clamp(1 - eps, 1 + eps)

            # PPO-clip objective (negative because we maximize)
            policy_loss = -torch.min(ratio * adv, clip_ratio * adv)

            # KL penalty: log(π_θ / π_ref)
            kl = lp_new - lp_ref
            total_loss += policy_loss + beta * kl

        n_groups += 1

    return total_loss / n_groups
In the GRPO clipped objective, why do we clip the importance ratio ri(θ) at [1−ε, 1+ε]?

Chapter 4: PPO vs GRPO

GRPO and PPO solve the same problem — policy gradient with variance reduction — but they differ fundamentally in how they compute the baseline. This difference has large practical consequences.

PPO's value network (critic)

PPO trains a separate value network (the critic) Vφ(s) that predicts the expected future reward from any state s. In the LM setting, s = (x, y<t) — the prompt plus all tokens generated so far. The Generalized Advantage Estimate (GAE) uses this critic to compute per-token advantages:

AtPPO = ∑k≥t (γλ)k−t (rk + γ V(sk+1) − V(sk))

In the language model bandit setting (single reward at end of sequence), this simplifies: γ = λ = 1, and the advantage at the terminal step is just R − V(x). But computing V(x) requires a full forward pass through a billion-parameter network for every prefix.

GRPO's group baseline

GRPO replaces V(x) with the group mean μ = mean({r1, ..., rG}). No value network. No critic training. No extra optimizer state. The group of G rollouts provides a Monte Carlo estimate of E[R|x] — and if G is large enough, this estimate is good enough.

PPO (with critic) vs GRPO (group baseline) — architecture comparison

Toggle between the two architectures to see the full computational graph. Note how GRPO eliminates the critic network entirely.

PPO (with critic)GRPO (group baseline)
BaselineVφ(x) — learned value networkμ = mean(group rewards)
Extra modelYes — critic = same size as policyNo — just G rollouts
Memory cost2× (policy + critic params + optimizer states)1× (policy only)
Per-token vs per-sequencePer-token GAE advantagesPer-sequence group advantage
Bias of baselineLow (learned V is good estimate)Higher (Monte Carlo, depends on G)
Implementation complexityHigh (rollout loop, GAE, critic update)Low — can write in ~50 lines
Used inInstructGPT, AlpacaFarm, most RLHFDeepSeekMath, R1, Kimi K1.5, Qwen3
Is GRPO's baseline "valid"? Technically, no. A valid baseline for an unbiased gradient must be state-dependent (depend on x but not y). GRPO's mean is over responses to x, so it is state-dependent — that part is fine. But dividing by the standard deviation of the group is NOT a valid baseline: it changes the scale of the gradient, introducing a small bias. Liu et al. (2025) showed that a corrected version (REINFORCE with leave-one-out) is unbiased. In practice, the bias is small and GRPO works well empirically.
GRPO eliminates the PPO critic (value network). What is the primary practical benefit of this, and what is the cost?

Chapter 5: Reward Design

The verifier provides a binary signal: correct or incorrect. But RLVR practitioners have found that augmenting this signal with format rewards significantly improves training stability and final performance. Reward design is not just about what counts as "correct" — it's about shaping the training signal to elicit the behaviors you want.

DeepSeek-R1-Zero reward components

The R1-Zero recipe uses exactly two reward components:

No process rewards. No length rewards in R1-Zero. No human preference signals. Just: did you format correctly and did you get the right answer?

Why format rewards? Without format enforcement, RL can produce responses where the "final answer" is ambiguous — the model might say the correct number somewhere in the middle of its chain of thought but then revise it to a wrong number at the end. The <think>/<answer> structure makes the extraction unambiguous, allowing exact reward computation.

The verifier implementation

For math: extract the content of the <answer> tag, normalize the expression (strip whitespace, convert fractions, handle multiple valid forms like "1/2" and "0.5"), and compare to ground truth. Equivalence checking can use a symbolic math library (sympy) for exact comparison.

python
import re
from sympy import simplify, sympify

def math_verifier(response: str, ground_truth: str) -> float:
    # Extract answer from structured format
    m = re.search(r'<answer>(.*?)</answer>', response, re.DOTALL)
    if not m:
        return 0.0   # no structured answer = wrong

    answer_str = m.group(1).strip()

    try:
        # Symbolic equivalence check: is answer - ground_truth == 0?
        diff = simplify(sympify(answer_str) - sympify(ground_truth))
        return 1.0 if diff == 0 else 0.0
    except:
        # Fallback: string comparison after normalization
        return 1.0 if answer_str == ground_truth.strip() else 0.0

def format_reward(response: str) -> float:
    has_think = '<think>' in response and '</think>' in response
    has_answer = '<answer>' in response and '</answer>' in response
    return 0.1 if (has_think and has_answer) else 0.0

def total_reward(response: str, ground_truth: str) -> float:
    return math_verifier(response, ground_truth) + format_reward(response)

The effect of format vs correctness reward

Ablation results from the original GRPO paper (DeepSeekMath) showed: correctness reward alone achieves strong results on math benchmarks. Format reward adds a small but consistent improvement, mainly by reducing the frequency of unstructured responses that make answer extraction ambiguous. Process supervision (rewarding intermediate steps, not just the final answer) added further gains on some benchmarks but at significant labeling cost.

Reward design: correctness-only vs correctness + format — training dynamics

Observe how reward shaping affects training stability. Toggle the format reward and observe the variance in the reward signal over time.

Training steps 100
In RLVR math training, why is it important to include a format reward (requiring <think>/<answer> tags) in addition to the correctness reward?

Chapter 6: DeepSeek-R1

DeepSeek-R1 launched in January 2025 and became a social phenomenon: an open model that matched or exceeded OpenAI o1 on reasoning benchmarks, with a surprisingly simple RL recipe. Understanding R1 requires distinguishing two experiments: R1-Zero (pure RL from base model, no SFT) and R1 (RL with SFT warm-start).

R1-Zero: pure RLVR from a base model

The R1-Zero experiment is philosophically striking. Start from DeepSeek-V3 base (a pretrained but not instruction-tuned model). Apply GRPO with only two rewards: accuracy and format. No SFT, no human preference data, no chain-of-thought demonstrations.

What emerged: the model spontaneously developed long chains of thought. It learned to re-examine its work ("Wait, let me check this again..."). It learned to try multiple approaches. It learned to use the <think> block to explore, then commit to an answer. These behaviors were not taught — they were discovered by RL as strategies that increase expected reward.

The "aha moment" — and why it may be overstated. The R1 paper described a specific checkpoint where the model suddenly learned to allocate more thinking time to hard problems — what they called an "aha moment." Follow-up analyses (Dr. GRPO paper) found two nuances: (1) much of the observed CoT length increase may be due to GRPO's inherent length bias (the group-normalization objective inadvertently incentivizes longer responses, as shown by Liu et al. 2025); (2) the base model DeepSeek-V3 already exhibits "aha moment" reasoning when prompted appropriately — R1-Zero may be eliciting this latent capability rather than creating it. The lesson: emergence claims in RL are hard to validate without careful ablations.

R1: the full recipe

R1 improves on R1-Zero with a more careful pipeline. The key differences:

Stage 1: Reasoning SFT Warm-Start
Collect long chain-of-thought examples (from earlier RL checkpoints or distillation). SFT the model on these first. This avoids the early instability of RL from a cold-started base model and ensures readable CoT format from the start.
Stage 2: GRPO RL
Apply GRPO with accuracy + format rewards + a language consistency reward (penalizes mixed-language output — RL on multilingual models naturally drifts to language-switching in the CoT, which hurts readability).
Stage 3: General SFT + RLHF
After reasoning RL, run standard post-training: SFT on reasoning + non-reasoning data (800k samples for reasoning tasks with R1 as teacher, 200k for general tasks). Then RLHF to align tone and safety.

SFT for reasoning — a little goes a long way

One of the most practically significant findings: for the SFT warm-start, you don't need huge amounts of long CoT data. Even ~1,000 high-quality math + science problems with long CoT responses (sourced from Gemini or R1 itself as teacher) are enough to bootstrap the RL from a stable starting point. The RL then does the heavy lifting of improving accuracy.

Distillation — can small models reason?

After R1 training, DeepSeek generated 800,000 long CoT traces and used them to fine-tune Qwen 2.5 (7B–72B). The distilled models showed strong reasoning — often competitive with much larger models on math benchmarks. This demonstrates that imitation of reasoning traces is a powerful and cheap way to transfer RLVR-derived capabilities.

DeepSeek-R1-Zero applies RLVR directly to the base model without SFT warm-start. What does this experiment most directly demonstrate?

Chapter 7: Showcase: The Full RLVR Loop

This is the complete RLVR training loop — animated. Watch how GRPO actually works: prompt arrives, G completions are sampled, the verifier scores each, advantages are computed within the group, and the policy update nudges probabilities up for winners and down for losers. Over thousands of steps, accuracy climbs and CoT length grows.

RLVR training loop — animated GRPO rollout & update

Press Play to animate the GRPO loop. Adjust group size G and watch how the advantage distribution changes. The accuracy curve at the bottom shows training progress over simulated steps.

Group size G 6
Learning rate (speed) 2

Reading the animation

The top panel shows the current rollout group: G response bubbles, colored green (correct) or red (incorrect). Below each bubble is its advantage — the z-score within the group. The middle panel shows the policy update direction: green arrows mean "increase this response's probability," red arrows mean "decrease." The bottom panel shows accuracy over training steps — the cumulative fraction of prompts where at least one sampled response is correct, rising as RL improves the policy.

What GRPO cannot learn. GRPO can only improve on prompts where it occasionally gets a correct answer — if reward is 0 for every sample in every group, the advantage is always 0 and the gradient is zero. This means GRPO requires the base model to already have some chance of being correct. For very hard problems (competition-level math beyond the base model's capabilities), GRPO alone cannot bootstrap reasoning from scratch — hence the SFT warm-start in R1.

CoT length emergence

One of the most striking observations in R1-Zero training: the average length of the chain-of-thought in the <think> block increases over RL training, even though no explicit length reward is given. The intuition: longer chains of thought give the model more "computation" — more intermediate tokens to reason through — and empirically lead to higher accuracy on hard problems. RL discovers this and learns to generate longer chains when it helps.

Chapter 8: Kimi K1.5 & Qwen 3

DeepSeek-R1 showed that RLVR works. But it is not the only recipe. Two contemporaneous systems — Kimi K1.5 (MoonShot AI) and Qwen 3 (Alibaba) — achieved comparable results with distinct design choices. Comparing them reveals which aspects of RLVR are essential and which are contingent.

Kimi K1.5: DPO-inspired policy gradient

Kimi takes a different algorithmic route. Instead of GRPO's group z-score, they derive their policy gradient objective from the DPO framework (Lecture 15): assume a nonparametric policy, solve for the implied reward in terms of the log-ratio π/πref, then use a squared loss surrogate. The resulting gradient looks like a baselined policy gradient with regularization — similar to GRPO but derived differently.

Key differences from R1:

Kimi's length reward insight. The length reward in Kimi is bidirectional: incorrect responses are incentivized to be shorter than the batch median (cut your losses fast); correct responses are incentivized to be shorter than the median (be efficient). The lambda parameter that controls this reward starts at 0 and is gradually increased mid-training — applying it too early disrupts learning before the model has learned to get correct answers at all.

Qwen 3: RLVR with minimal data

Qwen 3 (released 2025) achieves state-of-the-art results on reasoning benchmarks using a remarkably small RLVR dataset: 3,995 examples. This is not a typo. The entire RLVR stage uses fewer than 4,000 prompts — selected with extreme care for quality and difficulty.

Qwen 3's data selection criteria:

Thinking mode fusion

Qwen 3 introduced a notable capability: controllable "thinking" vs "non-thinking" mode. During training, they mix: (1) standard instruction data without chains of thought, (2) RLVR-trained reasoning data with explicit <think> blocks. At inference, the model can be prompted with a flag that enables or disables the thinking mode. This gives users explicit control over the latency/accuracy tradeoff — a short answer for simple queries, a long reasoned answer for complex ones.

Kimi K1.5, R1, Qwen 3 — RLVR recipe comparison

Compare the key design choices across the three landmark RLVR systems. Each row is a dimension of the recipe; differences are highlighted.

Qwen 3 achieves strong RLVR results with only 3,995 training examples. What selection criterion is MOST responsible for the high sample efficiency?

Chapter 9: Connections

RLVR sits at the intersection of RL theory, LLM post-training, and test-time compute scaling. Here is how this lesson connects to the broader landscape:

Concept from this lessonWhere it leads
REINFORCE gradient derivationPolicy Gradients Gleam — full REINFORCE, variance analysis, TRPO
PPO review & GRPO comparisonCS336 Lec 15 — RLHF, DPO derivation, overoptimization
GRPO group advantage (group z-score)RL Algorithms Gleam — actor-critic, GAE, advantage estimation
KL penalty in RLVRReward & Alignment Gleam — KL geometry, reward-KL tradeoff
Test-time compute (longer CoT)CS336 Lec 17 — inference scaling, best-of-N, PRM-guided search
Distillation from R1CS336 Lec 9 — scaling laws, why distillation is efficient

Limitations of RLVR

RLVR is not a universal recipe. Its key limitations:

Cheat sheet

FormulaWhat it is
∇ J = E[R(y) ∇ log π(y)]REINFORCE policy gradient (log-derivative trick)
∇ J = E[(R(y)−b) ∇ log π(y)]REINFORCE with baseline (unbiased, lower variance)
Ai = (ri−μ) / σGRPO group advantage (group z-score)
Lclip = min(r(θ)·A, clip(r(θ),1±ε)·A)GRPO/PPO clipped objective
LGRPO = E[Lclip] − β·KL(π∥πref)Full GRPO objective with KL regularization
"What I cannot create, I do not understand." — Richard Feynman. You can now implement the full RLVR pipeline: the verifier, the GRPO rollout loop, the group advantage computation, and the clipped policy objective. What you cannot yet create: a verifier for open-ended tasks, a recipe that escapes the base-model floor, or a theory that explains why RL reliably elicits reasoning over other capability-eliciting approaches. Those are open problems.