First we'll snap the whole course into one mental model: deep RL is a toolbox, and every algorithm is a different combination of the same parts. Then we'll walk the open frontier — the problems nobody has solved. Then, the meta-skill the course is really about: how to actually do research.
This is the last lecture (Chelsea Finn, Stanford). It does three jobs. One: it organizes everything you learned into a single framework so the algorithms stop feeling like a list and start feeling like a design space. Two: it tours the genuine research frontier — seven problems that are wide open. Three: it teaches the thing a course can't put on a problem set — how to choose what to work on, how to de-risk it, and how to share it.
By now you've met a dozen algorithms with intimidating names: REINFORCE, PPO, SAC, DQN, AWR, AWAC, IQL, DAgger. They feel like a zoo. Here's the liberating truth: they're all built from the same small set of parts. Every RL algorithm is a choice of what data to use, how to update the policy, how to estimate "goodness," and which optional tools to bolt on. Learn the parts, and every algorithm becomes a recipe you can read — or invent.
Think of building an algorithm as filling five slots. Change a slot, get a different (often named) algorithm. That's the entire field on one shelf.
| Slot | Options | What it decides |
|---|---|---|
| 1. Data | Expert demos · stored experience (replay buffer) · fresh policy rollouts | Where the learning signal comes from — and whether it's offline, off-policy, or on-policy. |
| 2. Policy update | Supervised behavior cloning · policy gradient · actor-critic · Q-learning | How the policy actually changes. |
| 3. Reward | Given/annotated · learned from examples or preferences · self-supervised (goal-conditioned) | What "good" means — the most fragile slot, as the frontiers will show. |
| 4. Value learning | None · Monte Carlo · TD-learning · n-step returns | How you estimate long-run value to reduce variance or enable bootstrapping. |
| 5. Model class | Gaussian · categorical · diffusion/flow · autoregressive | The neural net family representing the policy. |
And then there are optional power tools you snap on when the use-case demands:
Many algorithms simply mix and match these tools for the needs of the use-case. PPO is "policy gradient + a clipped trust region + a value baseline." SAC is "actor-critic + replay buffer + entropy." IQL is "offline + asymmetric value loss + advantage-weighted update." Once you see the slots, you can read a new paper's algorithm by asking: which five choices did they make, and which power tools did they add?
Pick a slot combination and the builder names the algorithm you just described. This is the whole course as a design space — click through to feel how few choices separate REINFORCE from SAC.
Zoom into the model-free online algorithms — the spine of the course. Four families, and you can read each one off four questions. This is the single most useful table to have memorized.
| Question | Vanilla PG | PPO-like | Off-policy AC | Q-learning (SAC) |
|---|---|---|---|---|
| What data? | On-policy | Technically off-policy, called on-policy | Off-policy, replay buffer | Off-policy, replay buffer |
| How off-policy? | n/a | Importance weights | Fit value with TD, sample actions from the policy | Fit the optimal value with TD |
| Fit a value function? | No | On-policy value | On-policy action-value | Optimal action-value |
| Estimate goodness via | Reward-to-go | Advantage (or GAE) | TD / n-step returns on the critic | Max over actions of the critic |
Read it left to right and one quantity drifts: how aggressively you reuse data. Vanilla PG burns each batch once (on-policy, highest variance, simplest). PPO stretches the data a little with importance weights and clipping. Off-policy actor-critic reuses a whole replay buffer. Q-learning reuses everything and even learns the optimal value directly. More reuse means more sample-efficiency but more instability to manage. Every "new" online algorithm is a point on this reuse–stability spectrum.
PPO uses importance weights, which is technically an off-policy correction — it reuses data from a slightly older policy for a few epochs. But the data is always nearly on-policy (that's what the clip enforces), so everyone calls it on-policy. The honest answer: "technically off-policy, practically on-policy." The table makes peace with this instead of pretending it away.
The increasing quantity is data reuse (off-policy-ness). Left to right, each method extracts learning from data collected by ever-older policies.
Why reuse buys sample-efficiency: real data (especially robot/human data) is expensive. Reusing each transition many times means fewer fresh rollouts per unit of learning — the whole reason off-policy methods exist.
Why it costs stability: learning from data collected by a different policy is exactly distribution shift. Importance weights can explode; bootstrapped value targets (TD) can diverge; the max in Q-learning over-estimates values for unseen actions. Each step right adds machinery to manage that instability (PPO's clip, target networks, asymmetric losses). So the spectrum is a dial: turn toward reuse for efficiency, and pay for it with stability engineering. Every "new" online algorithm is just a different setting of that dial.
Step up one more level. Forget specific algorithms; just ask what data are you allowed to use? Two axes — imitation vs. RL, and offline vs. online — carve the whole field into four quadrants. Knowing which quadrant you're in tells you which tools are even available.
Click a quadrant to see what it means, its canonical algorithm, and what it requires. The vertical axis is imitation (need expert data, no reward) vs. RL (need reward). The horizontal axis is offline (fixed dataset) vs. online (collect new data).
| Setting | Canonical method | Needs | Key tradeoff |
|---|---|---|---|
| Offline imitation | Behavior cloning | Expert demos. No reward. | Simplest, but can't beat the demonstrator and drifts off-distribution. |
| Online imitation | DAgger | Expert in the loop. No reward. | Fixes drift by querying the expert on visited states — but needs the expert available online. |
| Offline RL | AWR, AWAC, IQL | Reward-labeled logged data. Can reuse data from other policies. | No new data needed, but must avoid over-trusting actions absent from the data. |
| Off-policy / On-policy RL | SAC / PPO | Reward + the ability to collect new rollouts. | Most powerful, most data-hungry, requires online interaction. |
You have a pile of logged robot data with rewards but no robot to collect more, and no expert to query. Which quadrant? Offline + RL = offline RL. So your toolbox is AWR/AWAC/IQL, and your central worry is distribution shift — the asymmetric value loss (IQL) exists precisely to learn a good value without querying actions your dataset never tried. The map didn't just classify your problem; it handed you the method and the failure mode to watch.
Everything in the toolbox assumes you have a reward — a number saying how good each outcome was. In games, that's free: the score, the win. But step outside games and the reward often doesn't exist, or arrives far too late to be useful. This is arguably the deepest open problem in the field, because without a reward, none of the machinery turns.
Look how badly we currently paper over it:
| Domain | How we fake a reward today | Why it's broken |
|---|---|---|
| LLM chatbots | Preference optimization — learn what humans say they prefer. | People reward "what they want to hear." Models learn to agree with you and sound confident, not to be true. Sycophancy. And preferences pull toward personalization vs. polarization, with competing objectives. |
| Robotics | Binary success, or hand-shaped rewards. | How would you score a shirt fold from 0 to 1? Most good tasks have no clean numeric reward, and hand-shaping is brittle and gameable. |
| YouTube recs | A weighted blend of engagement (clicks) and satisfaction (likes). | The weights are manually tuned by hand. A human picks the tradeoff, and the system optimizes a number nobody can justify. |
| Science / learning | — | Can a machine reward itself for discovering something, or for making a human learn? We barely know how to define the objective. |
When you optimize a chatbot on human preferences, you're training it to maximize the chance a human clicks "I like this." But humans don't reliably prefer the true answer — they prefer the one that agrees with them and sounds sure. So preference-optimized models drift toward telling you what you want to hear. The reward is a proxy, and the model exploits the gap between the proxy ("looks good to a rater") and the goal ("is actually correct"). This is the reward-design gap from robotics, reappearing in language.
Five "raters" each score the same shirt fold. Click New fold to see how little they agree — that disagreement is exactly why a clean reward function is so hard to write. There is no ground-truth number.
How do we get reward signals for non-rewarding, non-verifiable domains — without exploitable proxies? Learned rewards, self-supervised goals, and better-than-preference signals are all active research. No one has nailed it.
RL from scratch is wasteful — the world already contains pretrained models, demonstrations, and written human knowledge. How do we pour all of that into an RL agent? The default tool is simple: initialize the model weights from a pretrained model, and seed the replay buffer with offline data. But that default has two cracks worth staring at.
Weights and buffers absorb experiential data fine. But what about abstract prior knowledge — a hint, a rule of thumb, a fact from a news article? "The floor by the window is slippery." "Customers hate waiting." There's no clean slot in current RL to inject a sentence of advice. Humans learn enormously from being told things; our agents mostly can't.
Initializing from pretrained weights might over-constrain learning — locking the agent into the pretraining distribution so it can never discover genuinely new behavior. There's a provocative alternative (RLPD, Ball et al. 2023): initialize only the replay buffer with offline data, but start the weights fresh. Let the data inform exploration without the weights anchoring the model to old solutions.
The real questions are audacious: How can LLMs go beyond their pretraining to solve problems humans haven't solved? How can robots learn to do tasks faster and more reliably than the humans who demonstrated them? Pretraining gives you a floor of human-level competence. The frontier is using prior knowledge as a launchpad past it, not a ceiling that traps you at it.
Video generation models (think Veo) have learned astonishing, nuanced world knowledge — how objects fall, pour, deform. They should be a goldmine for RL: a learned simulator you can plan inside. But turning a video model into a usable world model is full of sharp edges.
Here's the canonical failure. Suppose you train a world model to predict the next frames given the current state and an action, using demonstrations plus one policy's rollouts. Now you want to evaluate a new policy by asking the model: "if the new policy takes these actions, do good things happen?" But the new policy's actions are out of distribution — the world model never saw them. And small physical inaccuracies in the prediction compound into wildly wrong outcomes (the same compounding-error demon from sim2real). So the world model confidently hallucinates a future that won't happen, and your evaluation is garbage.
Route 1 — broaden the data: train the world model on rollouts from many policies, so new policies' actions are less out-of-distribution. Route 2 — use the model differently: instead of predicting action-conditioned futures (fragile), train it to predict plausible future video from the current state alone (on demos), then run a goal-conditioned policy to reach that imagined future. You sidestep the OOD-action problem by never feeding it untrusted actions.
Large-scale RL for LLMs is the most exciting thing happening in the field. And yet, look closely and today's successes share two quiet limitations:
Can we do large-scale RL with longer horizons and less online data? Two concrete sub-problems:
(1) Accurate value functions at scale. Algorithms like PPO only use the value function to reduce gradient variance — it doesn't have to be very accurate. But actor-critic methods need an accurate value function. Can we train and trust value functions at foundation scale? Mostly unsolved.
(2) Batch online RL. Tightly interleaving model updates and data collection is impractical for big models — especially when collecting dialog with real users or data on real robots. More practical: a few iterations of "collect a large batch, then update." That batch regime raises new needs, like expressive policies for enough data breadth, possibly collected asynchronously.
The last cluster of frontiers is about what happens when these systems meet the real world and real people. Three hard problems.
How should we develop and test AI in safety-critical domains — medicine, driving, mental-health counseling, legal and political discourse? The uncomfortable tension: large-scale ML is our most successful tool for open-world situations, but the obvious way to handle unsafe circumstances — collect lots of data of unsafe incidents — has had horrific real-world consequences. So:
When AI worked alone to diagnose patients, it hit 92% accuracy. Physicians using AI assistance reached only 76% — barely above the 74% they got with no AI at all. The bottleneck isn't the model's accuracy; it's the interface between human and AI. We're not optimizing for good human-AI systems.
Two threads here. (1) Can we optimize for better human-AI systems — can models better estimate and convey their uncertainty? A sobering finding: RLHF post-training actually hurt a model's calibration — the post-trained model's confidence matched its accuracy worse than before. One direction: calibration via verbalized confidence and listing multiple guesses. (2) When humans are not in the loop, how do we reach 99.99% reliability? RL is likely part of the answer, with promising early results.
In supervised learning, you measure accuracy on a held-out validation set — clean and reliable. In RL, there generally aren't any reliable offline metrics. Why? Your data so far was collected under a policy that differs from the learned policy, so it's genuinely hard to evaluate the policy on the states it will actually visit. For generalist agents that must work under many conditions, it's far worse.
So the open questions: Can we build offline metrics that at least rule out bad models, even if they can't estimate performance? And how do we pick representative real-world scenarios to evaluate a generalist online? "How do you decide a policy is good enough to deploy?" has no satisfying answer today.
This is the part a problem set can't teach, and arguably the most valuable thing in the whole course. The frontiers above are unsolved. How do you actually go and solve one? A few hard-won realities first, because they reshape how you should think about the work:
1. Less than 1% of research ideas have lasting impact. Many ideas don't even become papers; many papers have small impact. 2. Research is incremental. Even landmark work (e.g. AlphaFold) builds closely on decades of prior projects and others' advances. 3. In a world where scale matters, simple ideas have more impact — because they can be scaled. These aren't discouraging once you internalize them; they tell you how to play the game: place many bets, build on what exists, prefer simplicity.
In year 2 of a PhD, the goal was to build a predictive model and use it to learn robot skills. Problem discovered mid-project: existing video-generation models were terrible. The natural reaction for an ML+robotics person is "not my area." Instead: pivot to first building a better video-generation model. That resulting paper became the basis of a job talk, earned ~1300 citations, and told the community the problem was worth studying. Takeaway: crossing topic boundaries surfaces new problems and brings new ideas. Don't be a perfectionist, either — you can never know a project's impact at the outset.
Recall: <1% of ideas have lasting impact. So your job is to find out fast whether this idea is in that 1%. Front-load the risk, even though it's uncomfortable: before building large-scale infrastructure, run small didactic experiments that test the core unknowns. Design targeted experiments that probe the biggest uncertainty in the fastest possible way. Try lots of ideas, including different problems — you "create luck" by taking many cheap shots. And don't mentally commit to a project before you see signs of life on the core unknown.
Two ways to run the same risky project. Click to toggle. "Build first" sinks months of infrastructure before testing the core unknown — and if it fails, it all wasted. "Front-load risk" tests the unknown on day one: a cheap early kill, or a confident build.
Pivoting is usually considered later than it should be, because of sunk cost. The trick is to reframe the decision. "Continue this project vs. quit" is nebulous and anxiety-inducing. "Continue this project vs. work on project B vs. project C…" is far more concrete and less stressful. So spend time actively thinking about other projects — it makes the pivot decision easy and unscary.
What's the output of research? Not a product or service — it's ideas, knowledge, learnings. If no one knows about the learnings, there was no output. "But it feels like self-promotion." No — you're teaching people and sharing findings. "But it didn't work that well." Still useful; someone else may build on it. "But it's all obvious to me now." After enough research, you know far more than others — that obviousness is the curse of knowledge, not a reason to stay silent. Even companies pour effort into communication.
How to share well: clear writing, visuals, and presentations. Think about your audience and how they'll interpret what you say. When in doubt, assume your audience knows less rather than more — many people appreciate a refresher, and jargon mostly excludes. Practice, practice, practice, and get honest feedback. And for writer's block: break the task down — on pen and paper, jot ideas, then an outline. Think through your own ideas before asking your friend ChatGPT.
If you have mentorship, lean on it — people learn best gradually. And about confidence: there are many reasons to feel you lack it. No one knows the best way to do research right now. No one knows what will be most impactful. Many ideas don't work; many papers get rejected. Veterans will be "smarter" than you in their domain simply because they've thought about it for years — that's no reason to be intimidated. Yet confidence genuinely matters: self-doubt and overthinking really do slow you down. The antidote isn't arrogance; it's accepting the inherent ambiguity and placing your bets anyway.
Core unknown: can the video world-model predict the consequences of a new policy's actions accurately enough to be useful — given those actions are out-of-distribution and small errors compound? (This is Frontier 3's exact failure mode.) Everything else (the RL loop, the infra) is plumbing; this is the 1% question.
Cheapest day-one experiment: don't build any RL. Take an existing pretrained video model, feed it a handful of off-distribution action sequences, and measure how fast its predictions diverge from real rollouts you already have. One afternoon, no infrastructure. If it diverges immediately, the idea is dead (or you pivot to the goal-conditioned variant that never feeds it untrusted actions). If it holds up for a useful horizon — signs of life — now you build.
Why this is the move: you spent one day, not two months, to test the assumption the whole project rests on. That's front-loading the risk. And notice the fallback (goal-conditioned, predict-from-state-only) was already in your pocket from the world-models frontier — reducing scope to "the part that wants to work."
| Idea | The one thing to remember |
|---|---|
| RL is a toolbox | Five slots (data, policy update, reward, value learning, model class) + power tools. Every algorithm is a combination, not a monolith. |
| Online algorithm map | Vanilla PG → PPO → off-policy AC → Q-learning is a spectrum of data reuse vs. stability. PPO is "technically off-policy, practically on-policy." |
| Four data settings | Imitation vs. RL × offline vs. online. The quadrant picks your tools: BC, DAgger, offline RL (IQL), SAC/PPO. |
| Frontier 1: no reward | Most real domains have no clean reward. Proxies get gamed (LLM sycophancy, hand-tuned rec weights). Defining the objective is the deep problem. |
| Frontier 2: prior knowledge | Default = init weights + seed buffer. Open: injecting abstract knowledge; not letting pretraining over-constrain (RLPD seeds only the buffer). |
| Frontier 3: world models | Video models as learned sims. Enemy: out-of-distribution actions + compounding error. Fix: more policies' data, or predict-future-then-reach-it. |
| Frontier 4: scaling | Today's big-RL is short-horizon & very-online. Open: accurate value functions at scale; batch online RL with expressive policies. |
| Frontiers 5–7 | Safety (learn unsafe without unsafe data; explore safely); calibration (RLHF hurt it); evaluation (no reliable offline metric — distribution shift). |
| What to work on | Important problem + a plan + genuine excitement + survives the brutal-honesty test. Problem-driven beats idea-driven for guaranteed importance. |
| How to work | Front-load the risk: test the core unknown with a cheap experiment before building. Start from what works, simplify, ask "does it want to work?", reframe pivots as choosing among projects. |
| How to share | Unshared learnings = no output. Clear writing/visuals, assume the audience knows less, practice. Think before asking ChatGPT. |
You learned the toolbox (the algorithms), watched it meet the real world (sim2real, VLA fine-tuning), and saw where it runs out (the seven frontiers). The unifying enemy across nearly all of it — distribution shift — connects offline RL, world models, sim2real, and evaluation. The unifying method — mix and match tools to fit your constraints — is how you'll attack whichever frontier you choose. You are now well-equipped to start tackling these challenges yourself.