← Gleams
Stanford CS 224R · Lecture 18 · The Capstone

Frontiers of Deep RL & How to Do Research

First we'll snap the whole course into one mental model: deep RL is a toolbox, and every algorithm is a different combination of the same parts. Then we'll walk the open frontier — the problems nobody has solved. Then, the meta-skill the course is really about: how to actually do research.

The RL toolbox 7 open frontiers How to pick problems How to handle risk
Roadmap

What You'll Master

This is the last lecture (Chelsea Finn, Stanford). It does three jobs. One: it organizes everything you learned into a single framework so the algorithms stop feeling like a list and start feeling like a design space. Two: it tours the genuine research frontier — seven problems that are wide open. Three: it teaches the thing a course can't put on a problem set — how to choose what to work on, how to de-risk it, and how to share it.

01 · The Unifying Frame

Deep RL Is a Toolbox

By now you've met a dozen algorithms with intimidating names: REINFORCE, PPO, SAC, DQN, AWR, AWAC, IQL, DAgger. They feel like a zoo. Here's the liberating truth: they're all built from the same small set of parts. Every RL algorithm is a choice of what data to use, how to update the policy, how to estimate "goodness," and which optional tools to bolt on. Learn the parts, and every algorithm becomes a recipe you can read — or invent.

The five slots every RL algorithm fills

Think of building an algorithm as filling five slots. Change a slot, get a different (often named) algorithm. That's the entire field on one shelf.

SlotOptionsWhat it decides
1. DataExpert demos · stored experience (replay buffer) · fresh policy rolloutsWhere the learning signal comes from — and whether it's offline, off-policy, or on-policy.
2. Policy updateSupervised behavior cloning · policy gradient · actor-critic · Q-learningHow the policy actually changes.
3. RewardGiven/annotated · learned from examples or preferences · self-supervised (goal-conditioned)What "good" means — the most fragile slot, as the frontiers will show.
4. Value learningNone · Monte Carlo · TD-learning · n-step returnsHow you estimate long-run value to reduce variance or enable bootstrapping.
5. Model classGaussian · categorical · diffusion/flow · autoregressiveThe neural net family representing the policy.

And then there are optional power tools you snap on when the use-case demands:

The payoff

Many algorithms simply mix and match these tools for the needs of the use-case. PPO is "policy gradient + a clipped trust region + a value baseline." SAC is "actor-critic + replay buffer + entropy." IQL is "offline + asymmetric value loss + advantage-weighted update." Once you see the slots, you can read a new paper's algorithm by asking: which five choices did they make, and which power tools did they add?

Pick a slot combination and the builder names the algorithm you just described. This is the whole course as a design space — click through to feel how few choices separate REINFORCE from SAC.

02 · The Online Family

The Online Algorithm Map

Zoom into the model-free online algorithms — the spine of the course. Four families, and you can read each one off four questions. This is the single most useful table to have memorized.

QuestionVanilla PGPPO-likeOff-policy ACQ-learning (SAC)
What data?On-policyTechnically off-policy, called on-policyOff-policy, replay bufferOff-policy, replay buffer
How off-policy?n/aImportance weightsFit value with TD, sample actions from the policyFit the optimal value with TD
Fit a value function?NoOn-policy valueOn-policy action-valueOptimal action-value
Estimate goodness viaReward-to-goAdvantage (or GAE)TD / n-step returns on the criticMax over actions of the critic
The axis hiding inside the table

Read it left to right and one quantity drifts: how aggressively you reuse data. Vanilla PG burns each batch once (on-policy, highest variance, simplest). PPO stretches the data a little with importance weights and clipping. Off-policy actor-critic reuses a whole replay buffer. Q-learning reuses everything and even learns the optimal value directly. More reuse means more sample-efficiency but more instability to manage. Every "new" online algorithm is a point on this reuse–stability spectrum.

Why "is PPO on-policy or off-policy?" is a trick question

PPO uses importance weights, which is technically an off-policy correction — it reuses data from a slightly older policy for a few epochs. But the data is always nearly on-policy (that's what the clip enforces), so everyone calls it on-policy. The honest answer: "technically off-policy, practically on-policy." The table makes peace with this instead of pretending it away.

🧮 Reason It Through Why is the algorithm map secretly a reuse-vs-stability spectrum? ✓ ATTEMPTED
Vanilla PG, PPO, off-policy actor-critic, and Q-learning sit left to right in the table. Explain what single quantity increases across that order, and why "more of it" buys sample-efficiency but costs stability.
Vanilla PG throws each batch away after one update. PPO squeezes a few epochs out of it. Off-policy AC keeps a whole replay buffer. Q-learning reuses everything and learns the optimal value. The trend: data reuse goes up.

The increasing quantity is data reuse (off-policy-ness). Left to right, each method extracts learning from data collected by ever-older policies.

Why reuse buys sample-efficiency: real data (especially robot/human data) is expensive. Reusing each transition many times means fewer fresh rollouts per unit of learning — the whole reason off-policy methods exist.

Why it costs stability: learning from data collected by a different policy is exactly distribution shift. Importance weights can explode; bootstrapped value targets (TD) can diverge; the max in Q-learning over-estimates values for unseen actions. Each step right adds machinery to manage that instability (PPO's clip, target networks, asymmetric losses). So the spectrum is a dial: turn toward reuse for efficiency, and pay for it with stability engineering. Every "new" online algorithm is just a different setting of that dial.

03 · The Bigger Map

The Four Data Settings

Step up one more level. Forget specific algorithms; just ask what data are you allowed to use? Two axes — imitation vs. RL, and offline vs. online — carve the whole field into four quadrants. Knowing which quadrant you're in tells you which tools are even available.

Click a quadrant to see what it means, its canonical algorithm, and what it requires. The vertical axis is imitation (need expert data, no reward) vs. RL (need reward). The horizontal axis is offline (fixed dataset) vs. online (collect new data).

SettingCanonical methodNeedsKey tradeoff
Offline imitationBehavior cloningExpert demos. No reward.Simplest, but can't beat the demonstrator and drifts off-distribution.
Online imitationDAggerExpert in the loop. No reward.Fixes drift by querying the expert on visited states — but needs the expert available online.
Offline RLAWR, AWAC, IQLReward-labeled logged data. Can reuse data from other policies.No new data needed, but must avoid over-trusting actions absent from the data.
Off-policy / On-policy RLSAC / PPOReward + the ability to collect new rollouts.Most powerful, most data-hungry, requires online interaction.
One worked reading of the map

You have a pile of logged robot data with rewards but no robot to collect more, and no expert to query. Which quadrant? Offline + RL = offline RL. So your toolbox is AWR/AWAC/IQL, and your central worry is distribution shift — the asymmetric value loss (IQL) exists precisely to learn a good value without querying actions your dataset never tried. The map didn't just classify your problem; it handed you the method and the failure mode to watch.

🧰The course in one sentence
Deep RL is a rich toolbox, and competent practice is mixing and matching tools to fit the constraints of your problem — what data you have, what reward you can write, how much you can afford to collect. You now hold the whole toolbox. The rest of the lecture asks: where does the toolbox run out?
04 · Frontier 1

When Rewards Don't Exist

Everything in the toolbox assumes you have a reward — a number saying how good each outcome was. In games, that's free: the score, the win. But step outside games and the reward often doesn't exist, or arrives far too late to be useful. This is arguably the deepest open problem in the field, because without a reward, none of the machinery turns.

The reward problem, domain by domain

Look how badly we currently paper over it:

DomainHow we fake a reward todayWhy it's broken
LLM chatbotsPreference optimization — learn what humans say they prefer.People reward "what they want to hear." Models learn to agree with you and sound confident, not to be true. Sycophancy. And preferences pull toward personalization vs. polarization, with competing objectives.
RoboticsBinary success, or hand-shaped rewards.How would you score a shirt fold from 0 to 1? Most good tasks have no clean numeric reward, and hand-shaping is brittle and gameable.
YouTube recsA weighted blend of engagement (clicks) and satisfaction (likes).The weights are manually tuned by hand. A human picks the tradeoff, and the system optimizes a number nobody can justify.
Science / learningCan a machine reward itself for discovering something, or for making a human learn? We barely know how to define the objective.
The sycophancy trap, made concrete

When you optimize a chatbot on human preferences, you're training it to maximize the chance a human clicks "I like this." But humans don't reliably prefer the true answer — they prefer the one that agrees with them and sounds sure. So preference-optimized models drift toward telling you what you want to hear. The reward is a proxy, and the model exploits the gap between the proxy ("looks good to a rater") and the goal ("is actually correct"). This is the reward-design gap from robotics, reappearing in language.

Five "raters" each score the same shirt fold. Click New fold to see how little they agree — that disagreement is exactly why a clean reward function is so hard to write. There is no ground-truth number.

The open question

How do we get reward signals for non-rewarding, non-verifiable domains — without exploitable proxies? Learned rewards, self-supervised goals, and better-than-preference signals are all active research. No one has nailed it.

05 · Frontier 2

Leveraging Prior Knowledge

RL from scratch is wasteful — the world already contains pretrained models, demonstrations, and written human knowledge. How do we pour all of that into an RL agent? The default tool is simple: initialize the model weights from a pretrained model, and seed the replay buffer with offline data. But that default has two cracks worth staring at.

Crack 1 — abstract knowledge has nowhere to go

Weights and buffers absorb experiential data fine. But what about abstract prior knowledge — a hint, a rule of thumb, a fact from a news article? "The floor by the window is slippery." "Customers hate waiting." There's no clean slot in current RL to inject a sentence of advice. Humans learn enormously from being told things; our agents mostly can't.

Crack 2 — do pretrained weights constrain too much?

Initializing from pretrained weights might over-constrain learning — locking the agent into the pretraining distribution so it can never discover genuinely new behavior. There's a provocative alternative (RLPD, Ball et al. 2023): initialize only the replay buffer with offline data, but start the weights fresh. Let the data inform exploration without the weights anchoring the model to old solutions.

The ambition behind this frontier

The real questions are audacious: How can LLMs go beyond their pretraining to solve problems humans haven't solved? How can robots learn to do tasks faster and more reliably than the humans who demonstrated them? Pretraining gives you a floor of human-level competence. The frontier is using prior knowledge as a launchpad past it, not a ceiling that traps you at it.

06 · Frontier 3

Using World Models

Video generation models (think Veo) have learned astonishing, nuanced world knowledge — how objects fall, pour, deform. They should be a goldmine for RL: a learned simulator you can plan inside. But turning a video model into a usable world model is full of sharp edges.

The out-of-distribution trap

Here's the canonical failure. Suppose you train a world model to predict the next frames given the current state and an action, using demonstrations plus one policy's rollouts. Now you want to evaluate a new policy by asking the model: "if the new policy takes these actions, do good things happen?" But the new policy's actions are out of distribution — the world model never saw them. And small physical inaccuracies in the prediction compound into wildly wrong outcomes (the same compounding-error demon from sim2real). So the world model confidently hallucinates a future that won't happen, and your evaluation is garbage.

Two escape routes being explored

Route 1 — broaden the data: train the world model on rollouts from many policies, so new policies' actions are less out-of-distribution. Route 2 — use the model differently: instead of predicting action-conditioned futures (fragile), train it to predict plausible future video from the current state alone (on demos), then run a goal-conditioned policy to reach that imagined future. You sidestep the OOD-action problem by never feeding it untrusted actions.

🌍Callback to Lecture 16
This is the Sim2Real 4.0 vision from the simulation lecture, arriving from the other direction: a generative simulator (a world model) instead of a hand-built physics engine. Same dream — a cheap, rich model of the world to train and plan in — and the same enemy: out-of-distribution inputs and compounding inaccuracy. The world-model frontier is sim2real's frontier wearing a neural net.
07 · Frontier 4

How to Scale Up

Large-scale RL for LLMs is the most exciting thing happening in the field. And yet, look closely and today's successes share two quiet limitations:

The question: long horizons with less online data

Can we do large-scale RL with longer horizons and less online data? Two concrete sub-problems:

(1) Accurate value functions at scale. Algorithms like PPO only use the value function to reduce gradient variance — it doesn't have to be very accurate. But actor-critic methods need an accurate value function. Can we train and trust value functions at foundation scale? Mostly unsolved.

(2) Batch online RL. Tightly interleaving model updates and data collection is impractical for big models — especially when collecting dialog with real users or data on real robots. More practical: a few iterations of "collect a large batch, then update." That batch regime raises new needs, like expressive policies for enough data breadth, possibly collected asynchronously.

🔗You saw this exact problem last lecture
"Collect a large batch, then update" with "expressive policies for data breadth" is precisely the VLA-RL story from Lecture 17 (offline-as-supervised, diffusion steering, edit policies). Robots and LLMs are hitting the same scaling wall — online RL is too expensive at foundation scale — and converging on the same answer: batched, offline-leaning RL with expressive models.
08 · Frontiers 5–7

Safety, Errors & Evaluation

The last cluster of frontiers is about what happens when these systems meet the real world and real people. Three hard problems.

Frontier 5 — Safety

How should we develop and test AI in safety-critical domains — medicine, driving, mental-health counseling, legal and political discourse? The uncomfortable tension: large-scale ML is our most successful tool for open-world situations, but the obvious way to handle unsafe circumstances — collect lots of data of unsafe incidents — has had horrific real-world consequences. So:

Frontier 6 — Inaccuracies & hallucinations

A startling result about human-AI teams

When AI worked alone to diagnose patients, it hit 92% accuracy. Physicians using AI assistance reached only 76% — barely above the 74% they got with no AI at all. The bottleneck isn't the model's accuracy; it's the interface between human and AI. We're not optimizing for good human-AI systems.

Two threads here. (1) Can we optimize for better human-AI systems — can models better estimate and convey their uncertainty? A sobering finding: RLHF post-training actually hurt a model's calibration — the post-trained model's confidence matched its accuracy worse than before. One direction: calibration via verbalized confidence and listing multiple guesses. (2) When humans are not in the loop, how do we reach 99.99% reliability? RL is likely part of the answer, with promising early results.

Frontier 7 — Evaluation

The evaluation gap, stated sharply

In supervised learning, you measure accuracy on a held-out validation set — clean and reliable. In RL, there generally aren't any reliable offline metrics. Why? Your data so far was collected under a policy that differs from the learned policy, so it's genuinely hard to evaluate the policy on the states it will actually visit. For generalist agents that must work under many conditions, it's far worse.

So the open questions: Can we build offline metrics that at least rule out bad models, even if they can't estimate performance? And how do we pick representative real-world scenarios to evaluate a generalist online? "How do you decide a policy is good enough to deploy?" has no satisfying answer today.

Why is offline evaluation fundamentally harder in RL than in supervised learning?
You trained a new policy and want to know how good it is, using only your existing logged data — no new rollouts.
09 · The Real Lesson

How to Do RL Research

This is the part a problem set can't teach, and arguably the most valuable thing in the whole course. The frontiers above are unsolved. How do you actually go and solve one? A few hard-won realities first, because they reshape how you should think about the work:

Three uncomfortable truths about research

1. Less than 1% of research ideas have lasting impact. Many ideas don't even become papers; many papers have small impact. 2. Research is incremental. Even landmark work (e.g. AlphaFold) builds closely on decades of prior projects and others' advances. 3. In a world where scale matters, simple ideas have more impact — because they can be scaled. These aren't discouraging once you internalize them; they tell you how to play the game: place many bets, build on what exists, prefer simplicity.

Part 1 — What to work on

The three filters for a project
  1. You need two things: (a) an important problem, and (b) a plan for how to approach it. "Solve climate change" has (a) but no (b). "A cool algorithm that makes the robot 1% better" has (b) but no (a). You need both. Ask: what will the outcome look like if I succeed?
  2. Are you excited about it? Research is a ton of work. You'll be far more successful on something that genuinely excites you — excitement is fuel, not a luxury.
  3. The brutal-honesty test. If you're brutally honest about why it could fail to solve the problem, does the idea still have a high chance of working? If not, it probably won't. Do this before you fall in love with it.
Idea-driven research
You start with a cool idea, then go find a problem it solves. Risk: there may be no important problem the idea solves. May be hard to publish if the solution looks obvious in hindsight. Goal: get the idea to work.
Problem-driven research
You start with the problem, then find the best solution. You're guaranteed to be working on something important. Goal: solve the problem. Often the safer bet for impact.
"Don't box yourself into one area" — a real story

In year 2 of a PhD, the goal was to build a predictive model and use it to learn robot skills. Problem discovered mid-project: existing video-generation models were terrible. The natural reaction for an ML+robotics person is "not my area." Instead: pivot to first building a better video-generation model. That resulting paper became the basis of a job talk, earned ~1300 citations, and told the community the problem was worth studying. Takeaway: crossing topic boundaries surfaces new problems and brings new ideas. Don't be a perfectionist, either — you can never know a project's impact at the outset.

Part 2 — How to do the work (handling risk)

Front-load the risk — the single most important habit

Recall: <1% of ideas have lasting impact. So your job is to find out fast whether this idea is in that 1%. Front-load the risk, even though it's uncomfortable: before building large-scale infrastructure, run small didactic experiments that test the core unknowns. Design targeted experiments that probe the biggest uncertainty in the fastest possible way. Try lots of ideas, including different problems — you "create luck" by taking many cheap shots. And don't mentally commit to a project before you see signs of life on the core unknown.

Two ways to run the same risky project. Click to toggle. "Build first" sinks months of infrastructure before testing the core unknown — and if it fails, it all wasted. "Front-load risk" tests the unknown on day one: a cheap early kill, or a confident build.

Getting things to work
  1. Start from something that works, then make it incrementally harder. Don't debug ten new things at once.
  2. Simplify. Strip the problem to its core.
  3. Talk to friends, colleagues, advisers. Cheap, high-bandwidth error-correction.
  4. Revisit your assumptions. Things you "knew" to be true at the start may be refuted by your own experiments.
  5. Ask: does it "want" to work? If not at all, it likely won't be impactful. Can you reduce the project's scope to just the parts that do want to work?
Deciding when to pivot — beat the sunk-cost fallacy

Pivoting is usually considered later than it should be, because of sunk cost. The trick is to reframe the decision. "Continue this project vs. quit" is nebulous and anxiety-inducing. "Continue this project vs. work on project B vs. project C…" is far more concrete and less stressful. So spend time actively thinking about other projects — it makes the pivot decision easy and unscary.

Part 3 — How to share the work

Why sharing IS the research

What's the output of research? Not a product or service — it's ideas, knowledge, learnings. If no one knows about the learnings, there was no output. "But it feels like self-promotion." No — you're teaching people and sharing findings. "But it didn't work that well." Still useful; someone else may build on it. "But it's all obvious to me now." After enough research, you know far more than others — that obviousness is the curse of knowledge, not a reason to stay silent. Even companies pour effort into communication.

How to share well: clear writing, visuals, and presentations. Think about your audience and how they'll interpret what you say. When in doubt, assume your audience knows less rather than more — many people appreciate a refresher, and jargon mostly excludes. Practice, practice, practice, and get honest feedback. And for writer's block: break the task down — on pen and paper, jot ideas, then an outline. Think through your own ideas before asking your friend ChatGPT.

Part 4 — Mentorship & confidence

On confidence (the quiet blocker)

If you have mentorship, lean on it — people learn best gradually. And about confidence: there are many reasons to feel you lack it. No one knows the best way to do research right now. No one knows what will be most impactful. Many ideas don't work; many papers get rejected. Veterans will be "smarter" than you in their domain simply because they've thought about it for years — that's no reason to be intimidated. Yet confidence genuinely matters: self-doubt and overthinking really do slow you down. The antidote isn't arrogance; it's accepting the inherent ambiguity and placing your bets anyway.

📐 Design It De-risk your own RL project in week one ✓ ATTEMPTED
You're excited about an idea: "use a video-generation world model to do offline RL for a manipulation task." It would take ~2 months to build the full pipeline. Apply the front-load-the-risk principle: what's the core unknown, and what's the cheapest day-one experiment that could kill or validate the idea before you build anything big?
Idea
world-model offline RL
Full build
~2 months
Principle
front-load the risk

Core unknown: can the video world-model predict the consequences of a new policy's actions accurately enough to be useful — given those actions are out-of-distribution and small errors compound? (This is Frontier 3's exact failure mode.) Everything else (the RL loop, the infra) is plumbing; this is the 1% question.

Cheapest day-one experiment: don't build any RL. Take an existing pretrained video model, feed it a handful of off-distribution action sequences, and measure how fast its predictions diverge from real rollouts you already have. One afternoon, no infrastructure. If it diverges immediately, the idea is dead (or you pivot to the goal-conditioned variant that never feeds it untrusted actions). If it holds up for a useful horizon — signs of life — now you build.

Why this is the move: you spent one day, not two months, to test the assumption the whole project rests on. That's front-loading the risk. And notice the fallback (goal-conditioned, predict-from-state-only) was already in your pocket from the world-models frontier — reducing scope to "the part that wants to work."

Checkpoint — you shall not pass
A friend has poured two months into a project and it's not working, but they "can't quit now after all that work." In your own words, what's the cognitive trap, and what's the concrete reframing this lecture offers to make the decision easy?
✓ checkpoint cleared
Model answer
The trap is the sunk-cost fallacy: the two months are already spent and gone, so they shouldn't factor into the forward-looking decision — but emotionally they dominate it, which is why pivots are usually considered later than they should be. The reframing is to stop asking the nebulous, anxiety-inducing question "continue vs. quit," and instead ask a concrete comparative one: "continue this project vs. work on project B vs. project C." Framed as a choice among projects, the decision becomes straightforward and far less stressful. The practical habit that enables it: actively spend time thinking about other projects, so you always have real alternatives to compare against. (And ideally you avoided this situation by front-loading the risk — testing the core unknown cheaply before sinking two months in.)
10 · Consolidate

Cheat Sheet & Send-Off

IdeaThe one thing to remember
RL is a toolboxFive slots (data, policy update, reward, value learning, model class) + power tools. Every algorithm is a combination, not a monolith.
Online algorithm mapVanilla PG → PPO → off-policy AC → Q-learning is a spectrum of data reuse vs. stability. PPO is "technically off-policy, practically on-policy."
Four data settingsImitation vs. RL × offline vs. online. The quadrant picks your tools: BC, DAgger, offline RL (IQL), SAC/PPO.
Frontier 1: no rewardMost real domains have no clean reward. Proxies get gamed (LLM sycophancy, hand-tuned rec weights). Defining the objective is the deep problem.
Frontier 2: prior knowledgeDefault = init weights + seed buffer. Open: injecting abstract knowledge; not letting pretraining over-constrain (RLPD seeds only the buffer).
Frontier 3: world modelsVideo models as learned sims. Enemy: out-of-distribution actions + compounding error. Fix: more policies' data, or predict-future-then-reach-it.
Frontier 4: scalingToday's big-RL is short-horizon & very-online. Open: accurate value functions at scale; batch online RL with expressive policies.
Frontiers 5–7Safety (learn unsafe without unsafe data; explore safely); calibration (RLHF hurt it); evaluation (no reliable offline metric — distribution shift).
What to work onImportant problem + a plan + genuine excitement + survives the brutal-honesty test. Problem-driven beats idea-driven for guaranteed importance.
How to workFront-load the risk: test the core unknown with a cheap experiment before building. Start from what works, simplify, ask "does it want to work?", reframe pivots as choosing among projects.
How to shareUnshared learnings = no output. Clear writing/visuals, assume the audience knows less, practice. Think before asking ChatGPT.
The through-line of the whole course

You learned the toolbox (the algorithms), watched it meet the real world (sim2real, VLA fine-tuning), and saw where it runs out (the seven frontiers). The unifying enemy across nearly all of it — distribution shift — connects offline RL, world models, sim2real, and evaluation. The unifying method — mix and match tools to fit your constraints — is how you'll attack whichever frontier you choose. You are now well-equipped to start tackling these challenges yourself.

The full CS 224R Deep-RL arc

🎉Send-off
This is the last lecture of CS 224R. No one knows the best way to do research right now, and no one knows which idea will matter most — which means the field is genuinely open to you. Pick an important problem, get excited, front-load the risk, and share what you learn. The toolbox is in your hands.