← Gleams
Stanford CS 224R · Lecture 17 · Robot Foundation Models

RL for Robot Foundation Models

A giant vision-language-action model imitates its way to 80% — and stalls. The last 20% is where robots become useful. This is how reinforcement learning pushes a foundation-scale policy past the plateau, when the model is too big, too weird, and too expensive to train the normal way.

VLA fine-tuning Offline RL as supervised learning Diffusion steering Edit policies
Roadmap

What You'll Master

This is an open, active research problem — the lecture (Chelsea Finn, Stanford) covers recent themes and informed opinion, not settled answers. We rebuild each idea from its motivation: why the obvious approach fails, and what clever reframing makes RL tractable on a model with billions of parameters. By the end you can read any 2025–2026 VLA-RL paper and slot it into one of three themes.

01 · Motivation

The 80% Plateau

The previous lecture asked: can we train robots in simulation and transfer to reality? This one asks a sharper question. We now have robot foundation models — huge pretrained networks that take a camera image and a language instruction and output robot actions. They're trained by imitation: copy thousands of human teleoperation demonstrations. And they're shockingly capable. So what's left to do?

The wall every imitation-trained robot hits

Performance plateaus around 80% success. In unseen rooms, with novel objects, a state-of-the-art VLA succeeds 4 times in 5. That sounds great until you realize: for a robot to act autonomously, you often need 99%+ reliability. A robot that drops the cup one time in five is a toy, not a product. That last 20% — the long tail of weird situations the demos didn't cover — is where all the value lives, and imitation alone can't get there.

The exact analogy to LLMs

This is precisely the story of large language models. Supervised fine-tuning (SFT) on demonstrations gets you a decent model; RL post-training (RLHF, RLVR) is what made them genuinely reliable and capable. Robots are at the same inflection point. Imitation learning is the SFT of robotics; RL fine-tuning is the natural next step. And crucially, the pretrained VLA isn't thrown away — it's an excellent initialization for RL, the same way a pretrained LLM is for RLHF.

So the goal of this entire lecture: take a pretrained, imitation-trained VLA and use reinforcement learning to push it from 80% to 99%+. Simple to state. The reason it's a research frontier is that VLAs are uniquely hostile to RL, as we'll see. (One footnote: DAgger — iteratively collecting expert corrections on the robot's own visited states — also helps, and is often used alongside RL.)

Imitation learning (gold) climbs fast then plateaus near 80% — more demos barely help. RL post-training (blue) starts from that plateau and grinds out the reliability tail. The gap between 80% and 99% is the entire reason this lecture exists.

02 · The Object We're Training

Anatomy of a VLA

You can't understand why RL on a VLA is hard until you know exactly what's inside one. The most common design is a vision-language-action (VLA) model, and it's built in a very specific way.

How a VLA is assembled
  1. Start from a pretrained vision-language model (VLM). Take a model that already understands images and text — one that can caption photos and answer questions. It brings broad visual and semantic knowledge for free. (Alternative designs start from a generative video model instead.)
  2. Co-train on a data mixture. Not just robot data — a blend of: robot demonstrations (the imitation-learning core), VLM tasks (Q&A, captioning, detection — to preserve the model's general grounding), and human video (motion prediction — to learn physical dynamics from cheap data).
  3. Bolt on an "action expert." A separate module that produces the actual continuous robot actions, using diffusion or flow matching. It attends to all the activations of the big LLM backbone but is deliberately small.
Definition
Action expert

The part of a VLA that outputs continuous actions. Three deliberate design choices: (1) it uses diffusion or flow matching to model the rich, multimodal distribution over actions; (2) it attends to all the backbone's activations so it sees everything the VLM understood; (3) it's designed to avoid multiple forward passes through the whole backbone — the giant backbone runs once, the small expert iterates. Often, the gradient is not passed back into the backbone, keeping the expensive part frozen.

Why these choices matter for RL (foreshadowing)

Every one of these design choices — diffusion action heads, frozen backbones, attending to all activations, predicting chunks of actions at once — was made to serve imitation learning. Each one quietly breaks an assumption that standard RL algorithms rely on. Hold this thought; Chapter 3 is the bill coming due.

🔗Prerequisite
The action expert uses diffusion / flow matching to turn noise into actions. If that phrase is fuzzy, the key idea you need: a diffusion policy starts from random noise and iteratively "denoises" it into a clean action, guided by the observation. Different starting noise → different action. We'll exploit exactly this in Chapter 7.
03 · The Bill Comes Due

Why RL on VLAs Is Hard

Standard RL was designed for small networks trained from scratch. A VLA violates almost every one of those comfortable assumptions. Two clusters of pain:

Problem 1 — VLAs are enormous

Problem 2 — VLAs are pretrained with imitation, not RL

The core tension

Everything that makes a VLA a great imitation learner — scale, frozen backbone, expressive diffusion head, action chunks — makes it a terrible fit for textbook RL. The rest of this lecture is three different strategies for routing around this, rather than smashing into it head-on. None of them is "just run PPO on the weights."

Which property of a VLA most directly breaks a vanilla policy-gradient update?
You want to compute the REINFORCE gradient, which needs the log-probability of the action the policy took.
04 · A Pivotal Choice

Online vs. Offline RL for VLAs

Before any algorithm, one structural decision dominates everything for large models: do you collect data and update tightly interleaved (online), or do you collect a big batch and train on it separately (offline)?

Online RL (e.g. SAC)
Collect ~1 timestep of data, take ~1 gradient step, repeat. Data collection and learning are fused into one tight loop. Classic, sample-efficient in principle.
(Iterated) Offline RL
Collect a huge batch — say 1,000 episodes — then do ~10,000 gradient steps on that fixed dataset. Collection and training are separated into phases. Repeat the whole cycle a few times.
The decisive argument: what happens when you make a mistake?

Suppose you set a hyperparameter wrong — bad learning rate, wrong number of epochs, broken gradient clipping. Online RL: the bug corrupted your tightly-coupled loop, so you must rerun the experiment and recollect all the data. On a real robot, that's hours or days of physical operation, gone. Offline RL: your dataset is still sitting there, untouched — just rerun training on the existing data. For large models where each run is enormously expensive and tuning is constant, this is a massive practical advantage. Offline is much simpler for large models.

This is why the lecture starts with offline RL, even though online is theoretically more powerful. When iteration cost dominates, the method that lets you fix a bug without re-touching the robot wins. Keep this lens: throughout, "is this online or offline, and what does a mistake cost?" is the question that decides what's practical at foundation scale.

Press Inject a bug at a random point. Watch what each paradigm has to redo: online RL must recollect all the robot data after the bug; offline RL just reruns training on the data it already has.

05 · The Obvious Attempt

Can We Just Use PPO?

The honest first question. PPO is the workhorse of RL. Can we just point it at a VLA and turn the crank? People have tried — SimpleVLA-RL fine-tunes OpenVLA with RL; πRL fine-tunes π0.5 with RL.

The answer: yes, but…

Yes, PPO can improve a VLA. But it requires a massive amount of online policy rollouts — so massive that many papers don't even report the sample count (never a good sign). And the results so far are limited to simulation-based training, where rollouts are cheap. On a real robot, where every rollout is slow and physical, naive PPO's appetite is a dealbreaker.

Why PPO is so sample-hungry here

PPO is on-policy: it can only learn from data collected by the current policy, then throws it away after a few updates. Every improvement step demands fresh rollouts. In simulation with thousands of parallel envs, fine. On one real robot collecting one slow stream of experience, you'd wait forever. This single fact — on-policy methods waste real-robot data — motivates all three themes that follow. They all find ways to reuse data the way offline/off-policy methods can.

So PPO is the baseline, not the answer. The three themes below are the field's attempts to get RL's benefits without RL's real-world sample cost — by leaning on supervised learning, on cheap latent spaces, and on small auxiliary policies.

06 · Theme 1

Offline RL as Supervised Learning

Key Theme #1

Can we formulate RL improvement as a supervised-learning problem? If so, it inherits all the machinery that already scales beautifully to giant models and datasets — the same machinery imitation learning uses. This is the most "VLA-native" approach: it reuses the model's existing strength (supervised fitting) instead of fighting it.

The recipe has two parts, and neither requires a policy gradient or a tractable action likelihood:

Offline RL via advantage-conditioned supervised learning
  1. Learn a value function by simple supervised regression. Take a pretrained VLM, condition it on the current images, the language prompt, and episode metadata, and train it to predict time-to-go (how long until the task finishes). Because it's fit on completed demonstrations, the target is known — this is just Monte Carlo value estimation: the actual observed outcome is the label. No bootstrapping, no critic instability, pure supervised learning.
  2. Improve the policy with that value function. For each action in the data, estimate its advantage — was this action better or worse than average from this state? Then binarize it: just a thumbs-up or thumbs-down. Finally, fine-tune the VLA with supervised learning, conditioned on that good/bad label: imitate the good actions, told they're good. The policy never needs a gradient through a reward — it just learns "in states like this, do more of the thumbs-up actions."
Definition
Advantage

How much better an action is than the policy's average from that state — the action's value minus the state's value. Positive advantage = "this beat expectation, do more of it." Binarizing it (good vs. bad) throws away magnitude but makes the policy update a dead-simple supervised classification target, which is exactly why it scales. This is the same idea behind advantage-weighted regression (AWR/AWAC) from the offline-RL literature, simplified to a binary signal.

Put together as a loop (this is Physical Intelligence's π-star-0.6, "a VLA that learns from experience"):

1. Collect
Large batch of rollouts + human interventions on the robot
2. Fit value
Supervised: predict time-to-go (Monte Carlo)
3. Improve policy
Advantage-conditioned supervised fine-tuning of the VLA
↻ repeat (iterated offline RL)
Does it actually work? Yes — lattes.

RL post-training gave a 2× improvement in throughput over imitation-only post-training. The flagship demo: a robot making lattes — reliably, for 13 hours of continuous operation, including collaborating with a person. That 13-hour figure is the whole point: it's the reliability that imitation's 80% plateau could never reach.

Is this the best recipe? The lecturer's honest caveats

No — it's a strong, scalable start with clear room to improve: (1) TD updates (bootstrapping from your own value estimates) should beat pure Monte Carlo, even at scale, by propagating value information faster. (2) More powerful policy-improvement methods than binary advantage should help. (3) Online RL should ultimately be more data-efficient and reach higher performance, because it can actively seek out failure modes and test new strategies — at the cost of more infrastructure. Themes 2 and 3 chase exactly that online efficiency.

🧮 Reason It Through Why does "binarize the advantage + supervised fine-tune" improve the policy? ✓ ATTEMPTED
It looks like plain imitation learning — you're just doing supervised learning on actions from the dataset. So how is it RL at all, and why does it beat imitating everything?
The rollout data contains both good and bad actions (the robot was at 80%, so 1-in-5 actions led to failure). Plain imitation would copy the failures too. The advantage label tells the policy which is which.
By conditioning the supervised target on "this action was good," the model learns a conditional policy. At test time you condition on "good," and it produces the good-action distribution — effectively imitating only the better-than-average actions.

It's RL because the value function injects reward information. Plain imitation treats every demonstrated action as equally worth copying. Here, the learned value function scores each action by its expected future reward (via time-to-go), and the advantage compares it to the state's baseline. That reward-derived signal is what makes it reinforcement learning, not imitation.

Why binarized-conditioning improves the policy: you fine-tune the VLA to predict actions conditioned on a good/bad token. The model learns two modes: "what good actions look like here" and "what bad actions look like." At deployment you always condition on "good," sampling from the better-than-average action distribution. Mathematically this is a simplified advantage-weighted regression: instead of weighting each action's imitation loss by a continuous function of its advantage, you hard-threshold to a binary weight. You lose magnitude information but gain a clean, scalable supervised target.

Why it scales: both steps are pure supervised learning — exactly what billion-parameter transformers are good at. No policy gradient, no diffusion-likelihood, no critic instability. That's the whole appeal of Theme 1.

Checkpoint — you shall not pass
A skeptic says: "Theme 1 is just imitation learning with extra steps — you're still doing supervised learning on dataset actions." In your own words, what makes it genuinely reinforcement learning, and why does it beat plain imitation on the same data?
✓ checkpoint cleared
Model answer
It's reinforcement learning because a value function injects reward information that plain imitation never sees. The value model is trained (by Monte-Carlo time-to-go) to predict each action's expected future reward; the advantage then says whether an action beat the state's baseline. Plain imitation copies every demonstrated action equally — including the ~20% that led to failure. Theme 1 instead conditions the supervised target on a good/bad label, so the model learns "what good actions look like here" and, at deployment, is asked to produce good ones. It's a simplified advantage-weighted regression: same scalable supervised machinery, but the reward-derived advantage steers which actions get reinforced. So it keeps imitation's scalability while gaining RL's ability to prefer better-than-average behavior.
07 · Theme 2

RL on the VLA's Representation

Theme 1 fine-tuned the whole VLA (with supervised updates). Theme 2 asks a more radical question: can we improve the VLA without fine-tuning it end-to-end at all? The VLA is huge and fragile; touching its weights is expensive and risky. What if we leave it frozen and learn a small policy on top?

Key Theme #2

Learn a separate, small Gaussian policy that operates on the VLA's representation — not on the robot's raw action space, and not by changing the VLA's weights. A Gaussian policy has a clean, tractable likelihood, so all of standard off-policy RL (SAC and friends) works out of the box. The trick is choosing the right representation to act on.

Version A — Diffusion steering (the star result)

Here's the clever part. Recall the action expert is a diffusion model: it turns a noise vector into an action. Different noise vectors lead to different actions. So instead of controlling the action directly, control the noise.

Definition
Diffusion steering (DSRL)

Treat the diffusion policy's input noise as the new "action space." Train a small RL policy whose job is to output the noise vector that, when fed through the frozen VLA, produces a good action. The VLA never changes; you've turned "improve the robot" into "pick better noise" — a low-dimensional, tractable RL problem with a clean Gaussian policy.

DSRL — Steering Your Diffusion Policy with Latent-Space RL (CoRL 2025)
  1. Policy sampling: the small steering policy looks at the state and outputs a noise vector. Feed that noise through the frozen VLA to denoise it into an action chunk. Run the chunk in the environment, observe the next state.
  2. Training: store each (state, noise, reward) transition in a replay buffer. Sample minibatches and update a Q-function and the steering policy using SAC — standard off-policy actor-critic, because the steering policy is a clean Gaussian over noise.
The headline number

DSRL learned from just 65 online episodes — about 10,000 steps. That's roughly 100× more sample-efficient than PPO on the same problem. By acting on the frozen VLA's noise input instead of its weights, you get real-robot RL that actually fits in a real-robot data budget. And afterward, you can distill the improved behavior back into the VLA itself.

Version B — Compress, then RL

A sibling idea (RLT): instead of steering noise, compress the VLA's visual representation into a compact latent, and run RL on top of that compressed representation. Same philosophy — don't do RL on billions of weights or raw pixels; do it on a small, information-rich latent the VLA already computed. The frozen foundation model becomes a feature extractor; RL is cheap on its features.

💡The unifying move
Both versions share one insight: do RL in a small, well-behaved space the foundation model hands you — its noise input or its compressed features — rather than in the giant, awkward space of its weights or the raw action space. You inherit the VLA's competence for free and only learn the small delta that RL is good at.
08 · Theme 3

Small Edit Policies

Theme 3 is the closest to "just do actor-critic RL," but cleverly contained. Keep the frozen VLA as a strong base policy, and learn a small Gaussian policy that edits the base's actions — nudging them toward higher value.

Key Theme #3

The VLA proposes an action; a small learned edit policy proposes a correction; the corrected action is what runs. Train the edit policy with actor-critic RL to maximize the Q-function. The base VLA stays frozen and keeps the behavior sane; the tiny edit policy does the RL-style optimization. (e.g. Probe-Learn-Distill, and EXPO.)

The instability problem this creates

On its own, an edit policy is fragile, for two textbook-RL reasons. (1) The edit policy lags behind the Q-function — the critic improves faster than the actor can chase it, so the edits are always optimizing a slightly stale target. (2) The edit policy can collapse — the same actor-degeneracy that plagues normal RL. Left alone, the edits drift or blow up.

The fix: best-of-N sampling against the latest Q

EXPO's stabilizer is elegant and very much in the spirit of modern test-time scaling. Instead of trusting the (lagging) edit policy to directly output the best action, generate several candidates and let the freshest Q-function pick:

EXPO — stable RL with expressive policies (ICLR 2026)
  1. Sample multiple candidate actions from the base VLA policy.
  2. For each, sample multiple edited versions from the small edit policy.
  3. From the whole pool of base + edited candidates, pick the one with the highest Q-value — using the latest Q-function, evaluated on the fly.
Why this works

Selecting with the freshest Q-function reduces the lag (you're using up-to-the-moment value estimates, not the edit policy's stale guess) and is resilient to edit-policy collapse (even if the edit policy degenerates, best-of-N can still fall back to a good base action). It's arguably a form of test-time scaling: spend more compute at inference (more candidates) to get a better action, exactly like sampling-and-ranking for LLMs. A subtle but important detail: when fitting the Q-function, you also use this on-the-fly best-of-N policy to pick the next-state actions in the Bellman backup — ablations show that skipping this hurts performance badly.

Results: 19 minutes of robot time

EXPO-FT fine-tuned a VLA from just ~19 minutes of real-world experience (about 11,000 steps), running ~10× faster than the base, with higher reliability than SFT and DAgger and learning more efficiently than DSRL and HIL-SERL. Ablations confirm the design: remove the edit policy and value maximization stalls; remove the on-the-fly best-of-N in the Bellman backup and performance craters in some environments.

📐 Design It Pick a theme for your real-robot budget ✓ ATTEMPTED
You have a pretrained diffusion-based VLA stuck at ~82% on a folding task, ONE real robot, and a budget of about 30 minutes of robot time per training cycle. You cannot afford millions of rollouts. Which of the three themes do you reach for, and why? What's your fallback if it's unstable?
Base
diffusion VLA, 82%
Hardware
1 real robot
Budget
~30 min/cycle

PPO is out immediately — it needs millions of rollouts and only works in sim. With 30 minutes of real data, you need extreme sample efficiency.

Best fit: Theme 2 (DSRL) or Theme 3 (EXPO). Both hit your budget — DSRL learned from ~65 episodes, EXPO-FT from ~19 minutes. Since your VLA is diffusion-based, DSRL is especially natural: steer the diffusion noise, keep the VLA frozen, run SAC on the low-dimensional noise space. No end-to-end fine-tuning of a fragile billion-param model.

If unstable, go EXPO-style: add a small Gaussian edit policy on the base actions, and crucially add best-of-N selection with the latest Q-function. That directly attacks the two instabilities (Q-lag and edit-policy collapse) and gives you a test-time-scaling knob — sample more candidates when reliability matters most.

Theme 1 (offline-as-supervised) is the safe, scalable baseline to also run: collect the batch, fit a time-to-go value, advantage-condition, supervised fine-tune. It won't be the most sample-efficient, but a bug costs you a re-train, not a re-collect.

09 · Put It Together

Diffusion Steering, Interactive

The single most important mental model in this lecture: different noise → different action, and RL just learns to pick good noise. Drag the noise vector below and watch the frozen VLA denoise it into different actions, each with a different value. The steering policy's whole job is to find the high-value region of noise — without ever touching the VLA.

Left: the noise vector you control. Right: the action the frozen VLA produces from it, colored by Q-value (green = good, red = bad). Drag the noise dot, or let the steering policy climb to the best noise.

What the showcase is teaching

Notice what you're not doing: you never changed the VLA. You only searched its noise input for the action it already could produce but wasn't reliably choosing. That's why Theme 2 is so sample-efficient — the foundation model's competence is intact; RL only learns a small policy over a small space. The same picture explains Theme 3 (search over edits instead of noise) and contrasts with Theme 1 (re-fit the whole model with supervised advantage targets).

💻 Implement It Write the DSRL action-sampling step ✓ ATTEMPTED
Fill in one environment step of diffusion steering: the small policy picks noise, the frozen VLA denoises it into an action chunk, you act, and you store the transition for SAC. The key idea: the RL "action" is the noise, not the robot action.
signaturedef dsrl_step(state, steer_policy, frozen_vla, env, buffer): """One step: noise -> action via frozen VLA, store (s, noise, r).""" # TODO: sample noise from steer_policy, denoise via VLA, act, store
Sanity check
The replay buffer stores (state, NOISE, reward, next_state) — NOT (state, robot_action, ...). SAC updates a Q over (state, noise). The VLA's weights are never in the optimizer.
solutiondef dsrl_step(state, steer_policy, frozen_vla, env, buffer): """One step: noise -> action via frozen VLA, store (s, noise, r).""" # 1. Steering policy outputs a noise vector (the RL "action"). # Gaussian policy -> tractable log-prob, clean for SAC. noise = steer_policy.sample(state) # shape: [noise_dim] # 2. Frozen VLA denoises that noise into an action CHUNK. # No gradient through the VLA; it's a fixed function here. with torch.no_grad(): action_chunk = frozen_vla.denoise(state, noise) # [H, action_dim] # 3. Execute the chunk, observe outcome. next_state, reward, done = env.step(action_chunk) # 4. Store the transition keyed on NOISE, not the robot action. buffer.add(state, noise, reward, next_state, done) return next_state, reward, done # SAC update (elsewhere): Q(state, noise), pi_steer(noise | state). # frozen_vla.parameters() are NOT passed to any optimizer.
The whole trick in one line: buffer.add(state, noise, ...). By making the noise the RL action, you've converted "improve a frozen diffusion VLA" into a small, standard SAC problem over a low-dimensional Gaussian. After training, you can distill the steered behavior back into the VLA so deployment needs no separate steering policy.
10 · Consolidate

Cheat Sheet & Outlook

ConceptThe one thing to remember
The goalImitation-trained VLAs plateau at ~80%; autonomy needs 99%+. RL fine-tuning pushes past the plateau, with the VLA as initialization (the robotics analog of RLHF after SFT).
VLA anatomyPretrained VLM + co-training mixture (robot demos + VLM tasks + human video) + a small diffusion/flow action expert; backbone often frozen.
Why RL is hardHuge (expensive grads, split cloud/local); imitation-pretrained (no critic, diffusion head has intractable likelihood, action chunking is RL-unfriendly).
Online vs offlineOffline = collect big batch, then train. A bug costs a re-train, not a re-collect — decisive for expensive large models.
Just PPO?Works but needs massive rollouts; results mostly sim-only. On-policy wastes real-robot data — the motivation for all three themes.
Theme 1: offline-as-supervisedFit a value (time-to-go, Monte Carlo) → binarize advantage → advantage-conditioned supervised fine-tune. Scales like imitation. (π*0.6: 2× throughput, 13-hr lattes.)
Theme 2: RL on representationFreeze the VLA; learn a small Gaussian policy over its noise (DSRL) or compressed features (RLT). ~100× more efficient than PPO; ~65 episodes.
Theme 3: edit policiesSmall Gaussian policy edits the base VLA's actions via actor-critic. Stabilize with best-of-N selection by the latest Q (a test-time-scaling trick). EXPO-FT: ~19 min of robot time.
Diffusion steering insightDifferent noise → different action. RL just learns to pick good noise, never touching the VLA.
The honest outlook (the lecturer's own)

Exciting progress: RL substantially improves the performance and speed of state-of-the-art VLAs, with real evidence of reaching deployment-grade reliability. But no satisfying solution yet: online RL should be more efficient and effective than the offline setting, and the reliance on residual/edit policies and latent-space tricks feels unsatisfying compared to cleanly doing RL on the VLA weights directly. This is a wide-open research frontier — you now understand its three load-bearing ideas.

Where to go next

🔗The thread continues
Lecture 16 made a robot move by crossing the sim-to-real gap. This lecture made a foundation-scale robot reliable by crossing the 80%-to-99% gap with RL. The final lecture zooms all the way out: the complete deep-RL toolbox, the open frontiers (non-verifiable rewards, world models, scaling, safety, evaluation), and how to actually do research in this field.