A giant vision-language-action model imitates its way to 80% — and stalls. The last 20% is where robots become useful. This is how reinforcement learning pushes a foundation-scale policy past the plateau, when the model is too big, too weird, and too expensive to train the normal way.
This is an open, active research problem — the lecture (Chelsea Finn, Stanford) covers recent themes and informed opinion, not settled answers. We rebuild each idea from its motivation: why the obvious approach fails, and what clever reframing makes RL tractable on a model with billions of parameters. By the end you can read any 2025–2026 VLA-RL paper and slot it into one of three themes.
The previous lecture asked: can we train robots in simulation and transfer to reality? This one asks a sharper question. We now have robot foundation models — huge pretrained networks that take a camera image and a language instruction and output robot actions. They're trained by imitation: copy thousands of human teleoperation demonstrations. And they're shockingly capable. So what's left to do?
Performance plateaus around 80% success. In unseen rooms, with novel objects, a state-of-the-art VLA succeeds 4 times in 5. That sounds great until you realize: for a robot to act autonomously, you often need 99%+ reliability. A robot that drops the cup one time in five is a toy, not a product. That last 20% — the long tail of weird situations the demos didn't cover — is where all the value lives, and imitation alone can't get there.
This is precisely the story of large language models. Supervised fine-tuning (SFT) on demonstrations gets you a decent model; RL post-training (RLHF, RLVR) is what made them genuinely reliable and capable. Robots are at the same inflection point. Imitation learning is the SFT of robotics; RL fine-tuning is the natural next step. And crucially, the pretrained VLA isn't thrown away — it's an excellent initialization for RL, the same way a pretrained LLM is for RLHF.
So the goal of this entire lecture: take a pretrained, imitation-trained VLA and use reinforcement learning to push it from 80% to 99%+. Simple to state. The reason it's a research frontier is that VLAs are uniquely hostile to RL, as we'll see. (One footnote: DAgger — iteratively collecting expert corrections on the robot's own visited states — also helps, and is often used alongside RL.)
Imitation learning (gold) climbs fast then plateaus near 80% — more demos barely help. RL post-training (blue) starts from that plateau and grinds out the reliability tail. The gap between 80% and 99% is the entire reason this lecture exists.
You can't understand why RL on a VLA is hard until you know exactly what's inside one. The most common design is a vision-language-action (VLA) model, and it's built in a very specific way.
The part of a VLA that outputs continuous actions. Three deliberate design choices: (1) it uses diffusion or flow matching to model the rich, multimodal distribution over actions; (2) it attends to all the backbone's activations so it sees everything the VLM understood; (3) it's designed to avoid multiple forward passes through the whole backbone — the giant backbone runs once, the small expert iterates. Often, the gradient is not passed back into the backbone, keeping the expensive part frozen.
Every one of these design choices — diffusion action heads, frozen backbones, attending to all activations, predicting chunks of actions at once — was made to serve imitation learning. Each one quietly breaks an assumption that standard RL algorithms rely on. Hold this thought; Chapter 3 is the bill coming due.
Standard RL was designed for small networks trained from scratch. A VLA violates almost every one of those comfortable assumptions. Two clusters of pain:
Everything that makes a VLA a great imitation learner — scale, frozen backbone, expressive diffusion head, action chunks — makes it a terrible fit for textbook RL. The rest of this lecture is three different strategies for routing around this, rather than smashing into it head-on. None of them is "just run PPO on the weights."
Before any algorithm, one structural decision dominates everything for large models: do you collect data and update tightly interleaved (online), or do you collect a big batch and train on it separately (offline)?
Suppose you set a hyperparameter wrong — bad learning rate, wrong number of epochs, broken gradient clipping. Online RL: the bug corrupted your tightly-coupled loop, so you must rerun the experiment and recollect all the data. On a real robot, that's hours or days of physical operation, gone. Offline RL: your dataset is still sitting there, untouched — just rerun training on the existing data. For large models where each run is enormously expensive and tuning is constant, this is a massive practical advantage. Offline is much simpler for large models.
This is why the lecture starts with offline RL, even though online is theoretically more powerful. When iteration cost dominates, the method that lets you fix a bug without re-touching the robot wins. Keep this lens: throughout, "is this online or offline, and what does a mistake cost?" is the question that decides what's practical at foundation scale.
Press Inject a bug at a random point. Watch what each paradigm has to redo: online RL must recollect all the robot data after the bug; offline RL just reruns training on the data it already has.
The honest first question. PPO is the workhorse of RL. Can we just point it at a VLA and turn the crank? People have tried — SimpleVLA-RL fine-tunes OpenVLA with RL; πRL fine-tunes π0.5 with RL.
Yes, PPO can improve a VLA. But it requires a massive amount of online policy rollouts — so massive that many papers don't even report the sample count (never a good sign). And the results so far are limited to simulation-based training, where rollouts are cheap. On a real robot, where every rollout is slow and physical, naive PPO's appetite is a dealbreaker.
PPO is on-policy: it can only learn from data collected by the current policy, then throws it away after a few updates. Every improvement step demands fresh rollouts. In simulation with thousands of parallel envs, fine. On one real robot collecting one slow stream of experience, you'd wait forever. This single fact — on-policy methods waste real-robot data — motivates all three themes that follow. They all find ways to reuse data the way offline/off-policy methods can.
So PPO is the baseline, not the answer. The three themes below are the field's attempts to get RL's benefits without RL's real-world sample cost — by leaning on supervised learning, on cheap latent spaces, and on small auxiliary policies.
Can we formulate RL improvement as a supervised-learning problem? If so, it inherits all the machinery that already scales beautifully to giant models and datasets — the same machinery imitation learning uses. This is the most "VLA-native" approach: it reuses the model's existing strength (supervised fitting) instead of fighting it.
The recipe has two parts, and neither requires a policy gradient or a tractable action likelihood:
How much better an action is than the policy's average from that state — the action's value minus the state's value. Positive advantage = "this beat expectation, do more of it." Binarizing it (good vs. bad) throws away magnitude but makes the policy update a dead-simple supervised classification target, which is exactly why it scales. This is the same idea behind advantage-weighted regression (AWR/AWAC) from the offline-RL literature, simplified to a binary signal.
Put together as a loop (this is Physical Intelligence's π-star-0.6, "a VLA that learns from experience"):
RL post-training gave a 2× improvement in throughput over imitation-only post-training. The flagship demo: a robot making lattes — reliably, for 13 hours of continuous operation, including collaborating with a person. That 13-hour figure is the whole point: it's the reliability that imitation's 80% plateau could never reach.
No — it's a strong, scalable start with clear room to improve: (1) TD updates (bootstrapping from your own value estimates) should beat pure Monte Carlo, even at scale, by propagating value information faster. (2) More powerful policy-improvement methods than binary advantage should help. (3) Online RL should ultimately be more data-efficient and reach higher performance, because it can actively seek out failure modes and test new strategies — at the cost of more infrastructure. Themes 2 and 3 chase exactly that online efficiency.
It's RL because the value function injects reward information. Plain imitation treats every demonstrated action as equally worth copying. Here, the learned value function scores each action by its expected future reward (via time-to-go), and the advantage compares it to the state's baseline. That reward-derived signal is what makes it reinforcement learning, not imitation.
Why binarized-conditioning improves the policy: you fine-tune the VLA to predict actions conditioned on a good/bad token. The model learns two modes: "what good actions look like here" and "what bad actions look like." At deployment you always condition on "good," sampling from the better-than-average action distribution. Mathematically this is a simplified advantage-weighted regression: instead of weighting each action's imitation loss by a continuous function of its advantage, you hard-threshold to a binary weight. You lose magnitude information but gain a clean, scalable supervised target.
Why it scales: both steps are pure supervised learning — exactly what billion-parameter transformers are good at. No policy gradient, no diffusion-likelihood, no critic instability. That's the whole appeal of Theme 1.
Theme 1 fine-tuned the whole VLA (with supervised updates). Theme 2 asks a more radical question: can we improve the VLA without fine-tuning it end-to-end at all? The VLA is huge and fragile; touching its weights is expensive and risky. What if we leave it frozen and learn a small policy on top?
Learn a separate, small Gaussian policy that operates on the VLA's representation — not on the robot's raw action space, and not by changing the VLA's weights. A Gaussian policy has a clean, tractable likelihood, so all of standard off-policy RL (SAC and friends) works out of the box. The trick is choosing the right representation to act on.
Here's the clever part. Recall the action expert is a diffusion model: it turns a noise vector into an action. Different noise vectors lead to different actions. So instead of controlling the action directly, control the noise.
Treat the diffusion policy's input noise as the new "action space." Train a small RL policy whose job is to output the noise vector that, when fed through the frozen VLA, produces a good action. The VLA never changes; you've turned "improve the robot" into "pick better noise" — a low-dimensional, tractable RL problem with a clean Gaussian policy.
DSRL learned from just 65 online episodes — about 10,000 steps. That's roughly 100× more sample-efficient than PPO on the same problem. By acting on the frozen VLA's noise input instead of its weights, you get real-robot RL that actually fits in a real-robot data budget. And afterward, you can distill the improved behavior back into the VLA itself.
A sibling idea (RLT): instead of steering noise, compress the VLA's visual representation into a compact latent, and run RL on top of that compressed representation. Same philosophy — don't do RL on billions of weights or raw pixels; do it on a small, information-rich latent the VLA already computed. The frozen foundation model becomes a feature extractor; RL is cheap on its features.
Theme 3 is the closest to "just do actor-critic RL," but cleverly contained. Keep the frozen VLA as a strong base policy, and learn a small Gaussian policy that edits the base's actions — nudging them toward higher value.
The VLA proposes an action; a small learned edit policy proposes a correction; the corrected action is what runs. Train the edit policy with actor-critic RL to maximize the Q-function. The base VLA stays frozen and keeps the behavior sane; the tiny edit policy does the RL-style optimization. (e.g. Probe-Learn-Distill, and EXPO.)
On its own, an edit policy is fragile, for two textbook-RL reasons. (1) The edit policy lags behind the Q-function — the critic improves faster than the actor can chase it, so the edits are always optimizing a slightly stale target. (2) The edit policy can collapse — the same actor-degeneracy that plagues normal RL. Left alone, the edits drift or blow up.
EXPO's stabilizer is elegant and very much in the spirit of modern test-time scaling. Instead of trusting the (lagging) edit policy to directly output the best action, generate several candidates and let the freshest Q-function pick:
Selecting with the freshest Q-function reduces the lag (you're using up-to-the-moment value estimates, not the edit policy's stale guess) and is resilient to edit-policy collapse (even if the edit policy degenerates, best-of-N can still fall back to a good base action). It's arguably a form of test-time scaling: spend more compute at inference (more candidates) to get a better action, exactly like sampling-and-ranking for LLMs. A subtle but important detail: when fitting the Q-function, you also use this on-the-fly best-of-N policy to pick the next-state actions in the Bellman backup — ablations show that skipping this hurts performance badly.
EXPO-FT fine-tuned a VLA from just ~19 minutes of real-world experience (about 11,000 steps), running ~10× faster than the base, with higher reliability than SFT and DAgger and learning more efficiently than DSRL and HIL-SERL. Ablations confirm the design: remove the edit policy and value maximization stalls; remove the on-the-fly best-of-N in the Bellman backup and performance craters in some environments.
PPO is out immediately — it needs millions of rollouts and only works in sim. With 30 minutes of real data, you need extreme sample efficiency.
Best fit: Theme 2 (DSRL) or Theme 3 (EXPO). Both hit your budget — DSRL learned from ~65 episodes, EXPO-FT from ~19 minutes. Since your VLA is diffusion-based, DSRL is especially natural: steer the diffusion noise, keep the VLA frozen, run SAC on the low-dimensional noise space. No end-to-end fine-tuning of a fragile billion-param model.
If unstable, go EXPO-style: add a small Gaussian edit policy on the base actions, and crucially add best-of-N selection with the latest Q-function. That directly attacks the two instabilities (Q-lag and edit-policy collapse) and gives you a test-time-scaling knob — sample more candidates when reliability matters most.
Theme 1 (offline-as-supervised) is the safe, scalable baseline to also run: collect the batch, fit a time-to-go value, advantage-condition, supervised fine-tune. It won't be the most sample-efficient, but a bug costs you a re-train, not a re-collect.
The single most important mental model in this lecture: different noise → different action, and RL just learns to pick good noise. Drag the noise vector below and watch the frozen VLA denoise it into different actions, each with a different value. The steering policy's whole job is to find the high-value region of noise — without ever touching the VLA.
Left: the noise vector you control. Right: the action the frozen VLA produces from it, colored by Q-value (green = good, red = bad). Drag the noise dot, or let the steering policy climb to the best noise.
Notice what you're not doing: you never changed the VLA. You only searched its noise input for the action it already could produce but wasn't reliably choosing. That's why Theme 2 is so sample-efficient — the foundation model's competence is intact; RL only learns a small policy over a small space. The same picture explains Theme 3 (search over edits instead of noise) and contrasts with Theme 1 (re-fit the whole model with supervised advantage targets).
buffer.add(state, noise, ...). By making the noise the RL action, you've converted "improve a frozen diffusion VLA" into a small, standard SAC problem over a low-dimensional Gaussian. After training, you can distill the steered behavior back into the VLA so deployment needs no separate steering policy.| Concept | The one thing to remember |
|---|---|
| The goal | Imitation-trained VLAs plateau at ~80%; autonomy needs 99%+. RL fine-tuning pushes past the plateau, with the VLA as initialization (the robotics analog of RLHF after SFT). |
| VLA anatomy | Pretrained VLM + co-training mixture (robot demos + VLM tasks + human video) + a small diffusion/flow action expert; backbone often frozen. |
| Why RL is hard | Huge (expensive grads, split cloud/local); imitation-pretrained (no critic, diffusion head has intractable likelihood, action chunking is RL-unfriendly). |
| Online vs offline | Offline = collect big batch, then train. A bug costs a re-train, not a re-collect — decisive for expensive large models. |
| Just PPO? | Works but needs massive rollouts; results mostly sim-only. On-policy wastes real-robot data — the motivation for all three themes. |
| Theme 1: offline-as-supervised | Fit a value (time-to-go, Monte Carlo) → binarize advantage → advantage-conditioned supervised fine-tune. Scales like imitation. (π*0.6: 2× throughput, 13-hr lattes.) |
| Theme 2: RL on representation | Freeze the VLA; learn a small Gaussian policy over its noise (DSRL) or compressed features (RLT). ~100× more efficient than PPO; ~65 episodes. |
| Theme 3: edit policies | Small Gaussian policy edits the base VLA's actions via actor-critic. Stabilize with best-of-N selection by the latest Q (a test-time-scaling trick). EXPO-FT: ~19 min of robot time. |
| Diffusion steering insight | Different noise → different action. RL just learns to pick good noise, never touching the VLA. |
Exciting progress: RL substantially improves the performance and speed of state-of-the-art VLAs, with real evidence of reaching deployment-grade reliability. But no satisfying solution yet: online RL should be more efficient and effective than the offline setting, and the reliance on residual/edit policies and latent-space tricks feels unsatisfying compared to cleanly doing RL on the VLA weights directly. This is a wide-open research frontier — you now understand its three load-bearing ideas.