RLHF works when humans judge quality — but what if you can just check the answer? When the reward is verifiable (math solution correct or not, code passes tests or not), you don't need a learned reward model at all. This lesson derives RL with Verifiable Rewards (RLVR) from scratch: why verifiable rewards escape reward-hacking, the REINFORCE gradient and why baselines matter, GRPO (Group Relative Policy Optimization — derive the group-normalized advantage A=(r−μ)/σ, worked 8-sample example), the GRPO clipped objective with KL, why dropping the critic cuts memory and complexity versus PPO, and three landmark RLVR systems (DeepSeek-R1, Kimi K1.5, Qwen 3). Five interactive canvases, full PyTorch code, and hand-derived worked examples for every formula.
You have an RLHF-trained assistant. It follows instructions, refuses harmful requests, and sounds helpful. But ask it to solve a hard math problem — a competition problem with a numerical answer — and something subtle breaks. The reward model, trained on human preference judgments, learned that confident-sounding responses get high ratings. So the model learns to be confidently wrong.
This is not a small failure. It is structural. The reward model is a proxy — a learned approximation to human judgment. Optimize the policy hard against a proxy and you get reward hacking: the model finds response patterns that score well on the proxy but are not actually correct. For math, the proxy says "sounds good." The verifier says "wrong answer."
This shift — from learned reward models (subjective, hackable) to verifiable rewards (objective, exact) — is what unlocks the reasoning capabilities of models like DeepSeek-R1, Kimi K1.5, and Qwen 3. The key insight from Lec 15: overoptimization is a problem precisely because the reward model is imperfect. Remove the imperfect RM and the problem largely disappears — in the verifiable domain.
The recipe is simple in principle: generate multiple responses to a prompt, verify each one, reward the correct ones, penalize the incorrect ones, and update the policy to produce more correct responses. The interesting engineering is in the how — specifically, how to estimate the gradient of expected reward without a learned value function.
Click a domain to see how the reward signal works. Notice how math/code rewards are binary and exact, while preference rewards are continuous and noisy.
We want to maximize the expected reward of our policy. Formally: find θ that maximizes J(θ) = Ey ∼ πθ(·|x)[R(y)], where R(y) is the reward (1 if correct, 0 if not). How do we compute ∇θJ?
The problem: R(y) is defined on discrete sequences y. Gradients don't flow through discrete samples. We need the log-derivative trick (also called REINFORCE or the score function estimator). The derivation is three lines:
The last step uses ∇θ log p = (∇θ p)/p, rearranged as ∇θ p = p · ∇θ log p. The gradient is now an expectation over the policy — we can estimate it with samples.
REINFORCE is correct but has catastrophically high variance. The issue: if all sampled responses happen to be correct (R=1 for all), the gradient pushes up all of them equally — even the mediocre ones. If all happen to be wrong (R=0), the gradient is zero and nothing updates.
The fix is a baseline b: replace R(y) with (R(y) − b). Any baseline that doesn't depend on y is valid — it doesn't bias the gradient (because E[∇ log π · b] = b · E[∇ log π] = 0 by the policy gradient identity). But it can massively reduce variance by centering the rewards.
The classic choice of b is a value function V(x) — the expected reward over all possible responses from state x. This is what PPO does: it trains a critic network to estimate V(x), uses (R(y) − V(x)) as the advantage, and this difference has much lower variance because it subtracts out the expected return. But training the value network costs memory and complexity.
Suppose we sample 4 responses. R = [1, 0, 0, 1]. Mean R = 0.5. REINFORCE gradient is proportional to [1, 0, 0, 1] — pushes up responses 1 and 4. With baseline b = 0.5: advantages = [0.5, −0.5, −0.5, 0.5] — now pushes up the correct responses AND pushes down the incorrect ones. The gradient signal is richer and variance is lower.
python import torch def reinforce_gradient_estimate(policy, prompts, rewards, baseline=None): # prompts: list of input_ids (B, seq_len) # rewards: (B,) tensor of scalar rewards (e.g. 1.0 or 0.0 from verifier) # baseline: scalar or (B,) tensor — subtracted from rewards if baseline is None: baseline = 0.0 # vanilla REINFORCE advantages = rewards - baseline # (B,) — reward-weighted signal loss = 0.0 for i, (ids, adv) in enumerate(zip(prompts, advantages)): # log prob of this sequence under current policy logits = policy(ids.unsqueeze(0)).logits # (1, L, V) log_probs = torch.log_softmax(logits[:, :-1], dim=-1) token_lp = log_probs[0].gather(1, ids[1:].unsqueeze(1)).squeeze(1) seq_lp = token_lp.sum() # sum log-probs over response tokens loss = loss - adv * seq_lp # REINFORCE: maximize adv * log π return loss / len(prompts)
PPO solves the variance problem by training a separate value network V(x) to serve as the baseline. This is principled but expensive: the value network has the same size as the policy (billions of parameters), requires its own optimizer state, and must be trained jointly with the policy — doubling memory and adding significant implementation complexity.
Group Relative Policy Optimization (GRPO) asks: what if we don't train a value network at all? Instead, for each prompt x, we sample a group of G responses — say G=8. We run the verifier on each. Some are correct (reward 1), some wrong (reward 0). The group average reward IS the baseline.
For a group of G responses {y1, ..., yG} sampled from the same prompt x, with rewards {r1, ..., rG}, the group mean and standard deviation are:
The GRPO advantage for response i is the z-score within the group:
The ε (typically 10−4) prevents division by zero when all responses get the same reward (all correct or all incorrect). Dividing by σ normalizes the advantage scale — a prompt where the group spread is large (some correct, some wrong) produces larger advantage magnitudes than a prompt where all responses are nearly the same (all wrong on a hard problem).
G = 8 responses to "Solve: 3x + 7 = 22." Rewards: [1, 0, 1, 0, 0, 1, 0, 0]. Three correct (x=5), five wrong.
The correct responses get a strong positive push (+1.29). The wrong responses get a moderate negative push (−0.775). Compare to REINFORCE without baseline: correct = +1, wrong = 0 — the wrong responses get no signal. The GRPO baseline explicitly suppresses bad responses, not just reinforces good ones.
Adjust the group size and number of correct responses. See per-sample advantages computed live, and compare to vanilla REINFORCE (no baseline).
We have advantages Ai. Now we need a stable policy update rule. Direct gradient ascent on E[Ai ∇ log π] can cause large, destabilizing updates if the advantages are large. GRPO borrows PPO's clipped ratio to prevent this.
The responses yi were sampled from the old policy πold (before this gradient step). We are updating the new policy πθ. The ratio ri(θ) = πθ(yi|x) / πold(yi|x) measures how much the new policy weights each response relative to the old one. In practice, this is the product of per-token probability ratios over the response length.
The PPO clip trick: if ri(θ) wanders too far from 1 (the policy has changed a lot since the rollout), clip it. The clipped objective for a single sample is:
Here ε is typically 0.2. This clip is asymmetric: if Ai > 0 (good response), we clip the ratio at 1+ε so we don't over-increase this response's probability in one step. If Ai < 0 (bad response), we clip at 1−ε so we don't over-decrease.
Additionally, GRPO adds a KL divergence penalty between the current policy and a frozen reference policy πref (usually the SFT checkpoint). This prevents the policy from drifting so far that it loses its language quality:
The KL term can be implemented per-token as a log-ratio penalty added to the reward, or as a separate loss term. DeepSeek-R1 uses it as a separate term; some implementations add it as a token-level reward at each step.
python import torch, torch.nn.functional as F def grpo_loss(policy, ref_policy, old_policy, groups, eps=0.2, beta=0.04): # groups: list of dicts {prompt_ids, response_ids, reward} # Group responses by prompt, compute advantages within each group prompt_to_group = {} for g in groups: key = tuple(g['prompt_ids'].tolist()) prompt_to_group.setdefault(key, []).append(g) total_loss = 0.0 n_groups = 0 for key, grp in prompt_to_group.items(): rewards = torch.tensor([g['reward'] for g in grp]) mu = rewards.mean() sigma = rewards.std() + 1e-4 # ε for numerical stability advantages = (rewards - mu) / sigma # group z-scores for i, g in enumerate(grp): ids = g['response_ids'] adv = advantages[i] # Log-probs under current and old policy lp_new = seq_log_prob(policy, g['prompt_ids'], ids) lp_old = seq_log_prob(old_policy, g['prompt_ids'], ids).detach() lp_ref = seq_log_prob(ref_policy, g['prompt_ids'], ids).detach() ratio = torch.exp(lp_new - lp_old) # importance ratio clip_ratio = ratio.clamp(1 - eps, 1 + eps) # PPO-clip objective (negative because we maximize) policy_loss = -torch.min(ratio * adv, clip_ratio * adv) # KL penalty: log(π_θ / π_ref) kl = lp_new - lp_ref total_loss += policy_loss + beta * kl n_groups += 1 return total_loss / n_groups
GRPO and PPO solve the same problem — policy gradient with variance reduction — but they differ fundamentally in how they compute the baseline. This difference has large practical consequences.
PPO trains a separate value network (the critic) Vφ(s) that predicts the expected future reward from any state s. In the LM setting, s = (x, y<t) — the prompt plus all tokens generated so far. The Generalized Advantage Estimate (GAE) uses this critic to compute per-token advantages:
In the language model bandit setting (single reward at end of sequence), this simplifies: γ = λ = 1, and the advantage at the terminal step is just R − V(x). But computing V(x) requires a full forward pass through a billion-parameter network for every prefix.
GRPO replaces V(x) with the group mean μ = mean({r1, ..., rG}). No value network. No critic training. No extra optimizer state. The group of G rollouts provides a Monte Carlo estimate of E[R|x] — and if G is large enough, this estimate is good enough.
Toggle between the two architectures to see the full computational graph. Note how GRPO eliminates the critic network entirely.
| PPO (with critic) | GRPO (group baseline) | |
|---|---|---|
| Baseline | Vφ(x) — learned value network | μ = mean(group rewards) |
| Extra model | Yes — critic = same size as policy | No — just G rollouts |
| Memory cost | 2× (policy + critic params + optimizer states) | 1× (policy only) |
| Per-token vs per-sequence | Per-token GAE advantages | Per-sequence group advantage |
| Bias of baseline | Low (learned V is good estimate) | Higher (Monte Carlo, depends on G) |
| Implementation complexity | High (rollout loop, GAE, critic update) | Low — can write in ~50 lines |
| Used in | InstructGPT, AlpacaFarm, most RLHF | DeepSeekMath, R1, Kimi K1.5, Qwen3 |
The verifier provides a binary signal: correct or incorrect. But RLVR practitioners have found that augmenting this signal with format rewards significantly improves training stability and final performance. Reward design is not just about what counts as "correct" — it's about shaping the training signal to elicit the behaviors you want.
The R1-Zero recipe uses exactly two reward components:
<think> tags and the final answer in <answer> tags. Penalizes responses that skip the structured format.No process rewards. No length rewards in R1-Zero. No human preference signals. Just: did you format correctly and did you get the right answer?
<think>/<answer> structure makes the extraction unambiguous, allowing exact reward computation.For math: extract the content of the <answer> tag, normalize the expression (strip whitespace, convert fractions, handle multiple valid forms like "1/2" and "0.5"), and compare to ground truth. Equivalence checking can use a symbolic math library (sympy) for exact comparison.
python import re from sympy import simplify, sympify def math_verifier(response: str, ground_truth: str) -> float: # Extract answer from structured format m = re.search(r'<answer>(.*?)</answer>', response, re.DOTALL) if not m: return 0.0 # no structured answer = wrong answer_str = m.group(1).strip() try: # Symbolic equivalence check: is answer - ground_truth == 0? diff = simplify(sympify(answer_str) - sympify(ground_truth)) return 1.0 if diff == 0 else 0.0 except: # Fallback: string comparison after normalization return 1.0 if answer_str == ground_truth.strip() else 0.0 def format_reward(response: str) -> float: has_think = '<think>' in response and '</think>' in response has_answer = '<answer>' in response and '</answer>' in response return 0.1 if (has_think and has_answer) else 0.0 def total_reward(response: str, ground_truth: str) -> float: return math_verifier(response, ground_truth) + format_reward(response)
Ablation results from the original GRPO paper (DeepSeekMath) showed: correctness reward alone achieves strong results on math benchmarks. Format reward adds a small but consistent improvement, mainly by reducing the frequency of unstructured responses that make answer extraction ambiguous. Process supervision (rewarding intermediate steps, not just the final answer) added further gains on some benchmarks but at significant labeling cost.
Observe how reward shaping affects training stability. Toggle the format reward and observe the variance in the reward signal over time.
DeepSeek-R1 launched in January 2025 and became a social phenomenon: an open model that matched or exceeded OpenAI o1 on reasoning benchmarks, with a surprisingly simple RL recipe. Understanding R1 requires distinguishing two experiments: R1-Zero (pure RL from base model, no SFT) and R1 (RL with SFT warm-start).
The R1-Zero experiment is philosophically striking. Start from DeepSeek-V3 base (a pretrained but not instruction-tuned model). Apply GRPO with only two rewards: accuracy and format. No SFT, no human preference data, no chain-of-thought demonstrations.
What emerged: the model spontaneously developed long chains of thought. It learned to re-examine its work ("Wait, let me check this again..."). It learned to try multiple approaches. It learned to use the <think> block to explore, then commit to an answer. These behaviors were not taught — they were discovered by RL as strategies that increase expected reward.
R1 improves on R1-Zero with a more careful pipeline. The key differences:
One of the most practically significant findings: for the SFT warm-start, you don't need huge amounts of long CoT data. Even ~1,000 high-quality math + science problems with long CoT responses (sourced from Gemini or R1 itself as teacher) are enough to bootstrap the RL from a stable starting point. The RL then does the heavy lifting of improving accuracy.
After R1 training, DeepSeek generated 800,000 long CoT traces and used them to fine-tune Qwen 2.5 (7B–72B). The distilled models showed strong reasoning — often competitive with much larger models on math benchmarks. This demonstrates that imitation of reasoning traces is a powerful and cheap way to transfer RLVR-derived capabilities.
This is the complete RLVR training loop — animated. Watch how GRPO actually works: prompt arrives, G completions are sampled, the verifier scores each, advantages are computed within the group, and the policy update nudges probabilities up for winners and down for losers. Over thousands of steps, accuracy climbs and CoT length grows.
Press Play to animate the GRPO loop. Adjust group size G and watch how the advantage distribution changes. The accuracy curve at the bottom shows training progress over simulated steps.
The top panel shows the current rollout group: G response bubbles, colored green (correct) or red (incorrect). Below each bubble is its advantage — the z-score within the group. The middle panel shows the policy update direction: green arrows mean "increase this response's probability," red arrows mean "decrease." The bottom panel shows accuracy over training steps — the cumulative fraction of prompts where at least one sampled response is correct, rising as RL improves the policy.
One of the most striking observations in R1-Zero training: the average length of the chain-of-thought in the <think> block increases over RL training, even though no explicit length reward is given. The intuition: longer chains of thought give the model more "computation" — more intermediate tokens to reason through — and empirically lead to higher accuracy on hard problems. RL discovers this and learns to generate longer chains when it helps.
DeepSeek-R1 showed that RLVR works. But it is not the only recipe. Two contemporaneous systems — Kimi K1.5 (MoonShot AI) and Qwen 3 (Alibaba) — achieved comparable results with distinct design choices. Comparing them reveals which aspects of RLVR are essential and which are contingent.
Kimi takes a different algorithmic route. Instead of GRPO's group z-score, they derive their policy gradient objective from the DPO framework (Lecture 15): assume a nonparametric policy, solve for the implied reward in terms of the log-ratio π/πref, then use a squared loss surrogate. The resulting gradient looks like a baselined policy gradient with regularization — similar to GRPO but derived differently.
Key differences from R1:
Qwen 3 (released 2025) achieves state-of-the-art results on reasoning benchmarks using a remarkably small RLVR dataset: 3,995 examples. This is not a typo. The entire RLVR stage uses fewer than 4,000 prompts — selected with extreme care for quality and difficulty.
Qwen 3's data selection criteria:
Qwen 3 introduced a notable capability: controllable "thinking" vs "non-thinking" mode. During training, they mix: (1) standard instruction data without chains of thought, (2) RLVR-trained reasoning data with explicit <think> blocks. At inference, the model can be prompted with a flag that enables or disables the thinking mode. This gives users explicit control over the latency/accuracy tradeoff — a short answer for simple queries, a long reasoned answer for complex ones.
Compare the key design choices across the three landmark RLVR systems. Each row is a dimension of the recipe; differences are highlighted.
RLVR sits at the intersection of RL theory, LLM post-training, and test-time compute scaling. Here is how this lesson connects to the broader landscape:
| Concept from this lesson | Where it leads |
|---|---|
| REINFORCE gradient derivation | Policy Gradients Gleam — full REINFORCE, variance analysis, TRPO |
| PPO review & GRPO comparison | CS336 Lec 15 — RLHF, DPO derivation, overoptimization |
| GRPO group advantage (group z-score) | RL Algorithms Gleam — actor-critic, GAE, advantage estimation |
| KL penalty in RLVR | Reward & Alignment Gleam — KL geometry, reward-KL tradeoff |
| Test-time compute (longer CoT) | CS336 Lec 17 — inference scaling, best-of-N, PRM-guided search |
| Distillation from R1 | CS336 Lec 9 — scaling laws, why distillation is efficient |
RLVR is not a universal recipe. Its key limitations:
| Formula | What it is |
|---|---|
∇ J = E[R(y) ∇ log π(y)] | REINFORCE policy gradient (log-derivative trick) |
∇ J = E[(R(y)−b) ∇ log π(y)] | REINFORCE with baseline (unbiased, lower variance) |
Ai = (ri−μ) / σ | GRPO group advantage (group z-score) |
Lclip = min(r(θ)·A, clip(r(θ),1±ε)·A) | GRPO/PPO clipped objective |
LGRPO = E[Lclip] − β·KL(π∥πref) | Full GRPO objective with KL regularization |