A policy trained in a perfect dream. A robot deployed in a messy world. This is the story of how we get the dream to survive contact with reality — and every trick the field invented to close the gap.
This lecture (Guanya Shi, CMU) biases toward breadth — the whole landscape of sim2real robot learning. We will not skim. Every method below gets rebuilt from its underlying equation, with the data flow traced and the engineering decision justified. By the end you can look at any sim2real paper and place it on the map.
You want a humanoid robot to do a backflip. The obvious plan: let it try, learn from its mistakes, repeat a million times. But a real humanoid costs as much as a car, and a million falls will turn it into scrap metal long before it learns anything. Reinforcement learning needs tens of millions of trial-and-error steps. No physical robot survives that.
So we cheat. We build a simulator — a video-game version of the robot and the world — and we let a virtual robot fall a billion times for free. When the virtual robot finally masters the backflip, we copy its brain (the trained policy) into the real robot and hope it works.
That last word — hope — is the entire subject of this lecture. The transfer from a simulated policy to a real robot is called sim2real, and the difference between the simulator and reality is the reality gap. Everything we study is a tool for making that gap small enough that hope becomes a guarantee.
Train a control policy π entirely in simulation, then deploy it on physical hardware zero-shot (no real-world retraining) or with minimal real-world fine-tuning. The challenge is that the simulator and the real robot never have identical dynamics.
Four advantages make simulation the default substrate for modern robot learning. Each one maps to a hard physical constraint it dissolves:
| Advantage | What it dissolves | Concrete number |
|---|---|---|
| Cheap, fast, scalable | Wall-clock time & money | A quadruped locomotion policy trained for 18 seconds on a 2020 MacBook Pro (M1) using RLtools. Real-world, that's months. |
| Safe | Hardware destruction | The virtual robot can fall off a cliff, slam into walls, blow its motors — cost: zero. |
| Labeled (oracle access) | Perception & measurement cost | In sim you know the exact contact forces, terrain height, friction, and body velocity — ground truth that a real sensor suite can only approximate. |
| No wear & tear | Mechanical degradation | Real gears strip, bearings wear, joints develop slop. A simulated joint is forever new. |
This one is underrated. To train a robot to walk over rubble, you need to know the terrain height under each foot. In the real world that requires expensive lidar and still comes out noisy. In a simulator like PyBullet or MuJoCo, the terrain height is a variable you already set. You get a perfect, free, dense label for every quantity in the world. This is why "privileged information" (Section 5) is even possible — it only exists because sim hands it to you.
Here is the whole problem stated formally. A simulator is a function that advances the world one tick:
Read every symbol slowly, because the entire field lives inside this line:
t: joint angles, joint velocities, body pose, body velocity. Think "everything you'd need to know to predict the future."The real world runs the exact same kind of equation — xt+1 = freal(xt, ut) — but with a different f and a different e that you don't fully know. The reality gap is precisely the mismatch fsim ≠ freal. Hold onto this equation: every method in this lecture is a different way of attacking the e in this line.
"If we just build a perfect enough simulator, sim2real becomes trivial." False, and it's the trap beginners fall into. A bit-perfect simulator is computationally impossible (you'd need to model every air molecule and every micro-scratch on every gear). More importantly, you don't need perfection — you need a policy that is robust to the imperfection. The field's biggest wins came not from better physics but from smarter training (domain randomization, adaptation). Chasing fidelity is often the wrong axis.
Before we can talk about the gap, we need to know what a robot simulator actually does mechanically. Strip away the marketing and a physics-based simulator is a loop that repeatedly solves Newton's equations for a system of rigid bodies connected by joints, while resolving the nasty bits — contacts and friction — that make robotics hard.
A program where the state evolves according to explicit physical laws (rigid-body dynamics, contact mechanics) rather than learned prediction. Contrast with a world model, a neural network that learns to predict xt+1 from data — that's a non-physics-based simulator, and a frontier topic of its own.
One simulation tick (say 1/200th of a second) does roughly this:
ut (e.g. target joint angles) into joint torques via the motor model — this is where parameter e like motor gain and torque limits enters.M(q)q̈ = τ + ..., where M is the mass matrix (depends on link masses/inertias — more e).e). This is the hardest, slowest, least-accurate part.Δt to get xt+1.Notice how many places e sneaks in: motor gains, masses, inertias, friction, restitution, joint damping. Each is a parameter you set in sim and can only estimate in reality. That's the surface area of the reality gap, enumerated.
There are many simulators because rendering graphics and computing physics are mature fields. The taxonomy that trips everyone up:
IsaacSim is a simulator (the physics engine). IsaacLab is a robot-learning framework built on top of it (it gives you environments, RL plumbing, randomization tooling). Mixing these up is the #1 vocabulary error. The simulator computes physics; the framework wraps it into something you can train policies in.
| Name | What it is | Why people use it |
|---|---|---|
| MuJoCo (CPU) | Classic, accurate contact-rich physics engine | The research gold standard for accuracy on contacts and articulated bodies. |
| MuJoCo XLA (MJX) | MuJoCo reimplemented in JAX | Runs on GPU/TPU, vectorizes across thousands of envs. |
| MuJoCo Warp | MuJoCo on NVIDIA Warp | Massively parallel GPU physics. |
| mjlab | IsaacLab-style manager API = MuJoCo Warp + IsaacLab − IsaacSim | IsaacLab ergonomics without the IsaacSim dependency. |
| IsaacSim / IsaacLab | NVIDIA simulator + learning framework | GPU-native, thousands of parallel robots, great for locomotion at scale. |
The reason GPU simulators conquered robot learning is one number: thousands of robots in parallel. Instead of one robot collecting one stream of experience, a modern GPU simulator runs 4,096 (or more) copies of the robot simultaneously, each in a slightly different world. The RL algorithm gulps experience thousands of times faster than wall-clock.
Suppose real-time control runs at 50 Hz (50 steps/second). One real robot gives you 50 transitions per second. A GPU simulator running 4,096 envs at 5× real-time gives you 4096 × 50 × 5 = 1,024,000 transitions per second — a 20,000× data rate. A task needing 2 billion transitions: ~460 years on one real robot, ~33 minutes in parallel sim. That ratio is why sim exists.
The general recipe the whole field follows fits in one picture: physics-based simulator → massively parallel training envs → policy optimization with RL (usually PPO) → real-world deployment, with an optional real2sim arrow feeding real data back to fix the simulator. The rest of this lecture is the contents of the arrows.
Each tile is one parallel environment with its own randomized parameters e. Click Step to advance the shared policy across all of them at once — this is what "massively parallel" looks like.
"Sim2real is never easy." The reason splits cleanly into two failure modes, and naming them precisely tells you which tool to reach for.
Parametric mismatch is fixable by tuning — system identification (Section 6) finds the right numbers. Non-parametric mismatch needs augmenting the model — adding a learned residual term (also Section 6) or making the policy robust enough to not care (Section 4). When a deployed policy fails, the first diagnostic question is always: "is this a wrong-number problem or a missing-physics problem?"
Here's the part that surprises people. A 4% error in friction sounds harmless. Why does it make a backflip policy face-plant? Because a policy trained with RL is an exquisitely tuned exploiter. It learns the precise sequence of torques that works in its world. If the real world responds even slightly differently, the errors compound over time: a small slip at step 1 puts the robot in a state slightly off the trajectory it trained on, which it handles slightly wrong, putting it further off, and within a hundred timesteps it's in a state it has never seen, flailing.
Model the per-step state error as growing by a factor ρ each tick (the dynamics are locally expanding near an aggressive maneuver). Start with a tiny sim/real mismatch of ε0 = 0.001 rad. If ρ = 1.05 (5% growth/step) over a 1-second maneuver at 200 Hz = 200 steps:
The lesson: in unstable, contact-rich maneuvers, errors don't add — they multiply. This is exactly why open-loop replay of a sim trajectory on a real robot almost never works, and why robustness/adaptation is non-negotiable.
A policy's planned trajectory (gold) vs. what happens on the real robot (red) as you increase the parameter mismatch. Drag the slider and watch a small gap turn into a fall.
There's a second gap nobody warns you about: even with perfect physics, how do you specify the task? For locomotion, "go forward, stay upright" is easy to reward. For dexterous manipulation — "pour the water without spilling," "assemble the part" — writing a reward function by hand is brutally hard. This is why human-data methods (Section 7) exist: they sidestep reward design by imitating demonstrations. Keep this in mind — sim2real isn't only a dynamics problem, it's a task-specification problem too.
The first and most widely-used weapon. The idea is almost insultingly simple, which is why it works so well.
Don't train on one world. Train on a thousand random worlds. If your policy works whether friction is 0.6 or 1.2, whether the robot weighs 4.8 kg or 5.4 kg, whether there's 20 ms of sensor delay or 60 ms — then the real world's particular values are just one more sample it already handles.
Formally, we randomize e in our governing equation:
Compare this to ordinary RL, which maximizes return for a single fixed e. The only change is the outer expectation Ee∼p(e) over a distribution of environments p(e). In practice each of the thousands of parallel envs draws its own e at episode start. We train one policy πθ(x) — note: it takes only the state x, not e — and ask it to perform well on average across all of them.
There is no free lunch. Asking one policy to handle every world forces a conservative solution — it can't exploit any particular world's quirks, so it leaves performance on the table. A policy that knew friction was exactly 1.1 could be more aggressive than one hedging across [0.6, 1.2].
No. Widen p(e) too far and you ask the policy to handle worlds that don't exist (friction of 0.05, like wet ice everywhere). It becomes so timid it barely moves — this is over-conservatism. Too narrow, and the real world falls outside the training distribution and the policy fails. Tuning the width of p(e) is the real engineering. A good heuristic: randomize widely enough that reality is comfortably inside the range, no wider.
The original domain-randomization paper randomized perception (textures, lighting, camera position) so a vision policy wouldn't overfit to sim graphics. But the idea generalized to every part of the pipeline:
Real systems that lean on this: Agile But Safe (agile-but-safe.github.io) randomizes e to train one robust quadruped policy; RPL (Robust Perceptive Locomotion) randomizes terrain and dynamics so a humanoid walks over challenging ground it never specifically trained on. In both, the recipe is identical: randomize e, train a single π(x) robust to all of it.
A policy's success rate vs. real-world friction. Toggle randomization width and watch the trade-off: a narrow-trained policy peaks high but is brittle; a wide-trained policy is flatter but survives the unknown real value (dashed line).
J(θ) = Eτ|e[R(τ)], show how introducing a distribution p(e) changes what the optimal policy can depend on, and explain why the result is necessarily more conservative than the per-environment optimum.πθ(x) — it sees only state, not e. So a single set of weights must produce good actions for every e simultaneously. It cannot branch on the world.J*(e) = maxθ Eτ|e[R]. A single shared θ can never beat the best-per-e on every e at once, so Ee[J(θ,e)] ≤ Ee[J*(e)]. The gap is the price of robustness.The DR objective is maxθ Ee∼p(e)[ J(θ, e) ] where J(θ, e) = Eτ∼πθ, fsim(·,·,e)[R(τ)].
Key constraint: the inner policy πθ(x) is shared across all e — the same weights handle every world. Contrast adaptive control (next section), where the policy is πθ(x, e) and can specialize.
The bound: for any fixed θ, J(θ, e) ≤ J*(e) := maxθ' J(θ', e) for every e. Taking expectations, Ee[J(θ,e)] ≤ Ee[J*(e)]. The optimal shared policy maximizes the left side but can rarely achieve equality — equality needs one θ that is simultaneously optimal for every e, which only happens if the worlds want the same strategy.
Conclusion: the DR policy pays an optimality gap Ee[J*(e)] − Ee[J(θ*,e)] ≥ 0 in exchange for needing no knowledge of e at test time. That's the robust-control trade exactly: give up peak performance for guaranteed adequacy across the set. Adaptive control (Section 5) tries to recover that gap by identifying e online.
Domain randomization throws away information: it builds a policy that ignores e. But what if the policy could figure out which world it's in and specialize? That's adaptive control, and it can recover the optimality gap robustness leaves on the table.
An adaptive policy π(x, e) takes the world parameters as input, so it can be aggressive when friction is high and cautious when it's low. It does not conflict with domain randomization — you do both, giving "robust adaptive control." But it has one fatal problem:
e is unknown in the real world. In sim, you hand the policy the true friction. On the real robot, nobody knows the exact friction, mass distribution, or motor wear. So π(x, e) can't be deployed directly — you don't have e to feed it. The entire teacher–student machinery exists to solve this one problem.
The trick: train a policy that needs e in simulation (where it's free), then distill it into a second policy that infers what it needs from observable history. Three phases:
πteacher(x, e) with RL (PPO), giving it privileged information: the true e (contact states, terrain height, friction, applied disturbances) that only the simulator knows. Because it has oracle access, it learns a near-optimal, world-aware policy quickly. Data in: state x + privileged e. Data out: action.πstudent(x, o1:t) using only observable info available on the real robot: a history of proprioception (IMU readings, joint angles/velocities over the last N steps). The student learns to match the teacher's actions. This is the key: the recent history of how the body moved implicitly encodes e — a robot carrying extra mass accelerates differently, and the history reveals it. Data in: state + proprioceptive history. Target: teacher's action.πstudent. It reads its own proprioceptive history, implicitly estimates the world, and acts like the teacher would — without ever being told e.This is the conceptual crux. You can't measure friction directly, but its effects are observable: push off with a fixed torque and the resulting acceleration depends on mass and friction. A short window of (commanded action, resulting motion) pairs is a fingerprint of e. The student is essentially a learned system identifier fused with a controller — it identifies the world from the consequences of its own actions and adapts in real time.
Phase 2 is just imitation learning. The student regresses onto the teacher's outputs — supervised learning with the teacher as an oracle labeler. (If you've seen DAgger, that's the standard tool here: roll out the student, query the teacher for the correct action at each visited state, train on those labels to avoid distribution shift.)
The student doesn't have to imitate in action space. RMA (Rapid Motor Adaptation) distills in latent space:
e into a latent vector z = μ(e) (an "environment embedding"), then acts via π(x, z).ẑ from the proprioceptive history, then reuses the same base policy π(x, ẑ).Two-stage training is a hassle. Asymmetric actor-critic does it in one stage by exploiting a structural asymmetry: the critic is only used during training, never at deployment. So give the critic privileged info and keep the actor deployable.
| Component | Sees | Why |
|---|---|---|
Actor π(x, o1:t) | Only real-deployable info: proprioception + short history | It ships to the robot, so it can only use what the robot has. |
Critic V(x, e) | Privileged info: true e, root velocity, contact forces | It's discarded after training, so privileged access is "free" and gives lower-variance value estimates → better gradients. |
A humanoid loco-manipulation policy. Actor: current proprioception + a 4-step history. Critic: additionally gets root velocity and end-effector force — quantities the robot can't reliably measure but the sim knows. One-stage training, deployable actor, privileged critic. This is asymmetric actor-critic in production.
The teacher (gold, sees true mass) adapts instantly. The student (blue) watches a few steps of proprioceptive history, infers the mass, and converges to the teacher's behavior. Click New robot to draw a random hidden mass and watch the student identify it.
e, conservative, dead simple, no deployment-time inference. Great when the gap is small or the task tolerates conservatism.e online, recovers performance, but needs a 2-stage pipeline (or asymmetric critic) and enough observable history to identify the world.π(x, e), but e is never known on the real robot. So how does the deployed student ever act adaptively? Be specific about what it reads and why that suffices.e directly. Instead it reads a short history of its own proprioception — recent IMU readings, joint angles and velocities. Because the consequences of the student's own commanded actions depend on the hidden world (a heavier robot accelerates less for the same torque; a low-friction floor lets the foot slip), this history is a fingerprint of e. The student is effectively a learned system-identifier fused with a controller: it infers the latent world from the motion it just produced, then acts as the privileged teacher would for that world. Concretely (RMA): it predicts a latent ẑ from history and feeds it to the shared base policy π(x, ẑ). That's why phase 2 only needs observable inputs — the information about e is implicit in the body's behavior, not handed over explicitly.
The previous two methods take the simulator as given and make the policy cope. This third family does the opposite: use real-world data to make the simulator more like reality, then train in the improved sim. The loop is real → sim → real: collect real data, update the sim, retrain, redeploy.
The process of using real-world measurements to correct a simulator — either by identifying its parameters (fixing parametric gaps) or by adding learned residual terms (fixing non-parametric gaps). Then you do sim2real again on the corrected sim. Hence "real2sim2real."
If the gap is parametric (wrong masses, frictions), the fix is to estimate the true values from real data. That's classic system identification: run the robot, record states and actions, find the e that makes fsim(xt, ut, e) best predict the observed xt+1.
Naive system ID collects whatever data you happen to have, which may be uninformative (you can't estimate friction from a robot standing still). SPI-Active (CoRL'25) adds active exploration: it deliberately picks the actions that will be most informative about the unknown parameters, by maximizing the Fisher Information — a measure of how much a measurement tells you about a parameter. Sampling-based system ID + "go do the experiment that resolves your uncertainty fastest." The robot becomes its own scientist.
If the gap is non-parametric (an effect the sim doesn't model at all), no amount of tuning e helps — the term is missing. The fix: learn a residual that the sim is wrong by, from real data, and add it back.
Here gφ is a small neural network trained so that sim-plus-residual matches real trajectories. You keep the cheap, mostly-correct physics engine and let a learned term mop up aerodynamics, cable forces, unmodeled friction — whatever fsim forgot. The pipeline: use real data to augment the simulator, train RL in the augmented sim, deploy.
The single most impactful instance of this idea in legged robotics. The problem: real motors/transmissions have complex dynamics — they don't produce exactly the torque you command, due to gearing, friction, and saturation. Sim's idealized motor model is a major reality gap for locomotion.
τ̂ = hφ(history of position errors & velocities) to predict the real torque the motor actually delivers.Training the actuator net above needs torque labels (a torque sensor), which many robots don't have. The clever fix: make it label-free using RL. Train a residual torque model whose objective is to make the simulated trajectory match the observed real trajectory — you never need a torque sensor, only the (freely observed) joint positions over time. The residual is discovered by trajectory matching, not supervised regression. This is "real2sim by learning dynamics residuals" in its slickest form.
Commanded torque (gold) vs. what an idealized sim motor delivers vs. the real motor (red, with lag and saturation). Click Add actuator net to watch the learned residual pull the sim curve onto the real one.
Diagnosis: mostly parametric — grass has lower, variable friction than lab floor. But also non-parametric: grass deforms and the contact isn't a clean rigid point, which sim's hard-contact model misses.
Layered fix (this is how production systems do it):
(1) Domain-randomize friction widely — the cheapest, do-it-anyway move. Train across friction [0.3, 1.2] so low-grip grass is inside the distribution. Often this alone fixes the slip.
(2) Add teacher–student adaptation so the policy infers the current grip from its recent slip history and adjusts gait online — grass grip varies patch to patch, so online adaptation beats a fixed robust policy.
(3) If still failing, spend the 10 min of data on real2sim: identify grass friction (system ID) and learn a contact residual for the deformable surface, then retrain. Use real data last, because it's the most expensive step.
The meta-lesson: the methods compose. Robustness is the floor, adaptation recovers performance, and real2sim is the targeted fix when the first two aren't enough.
Recall the quiet killer from Section 3: reward design. For locomotion, "go forward, stay up" is a fine reward. For loco-manipulation and dexterous hands — pour, assemble, open a door — hand-crafting a reward is a nightmare. So why design rewards at all? Humans already know how to do these tasks. Let's learn from human demonstrations.
You can't just copy a human's motion onto a robot. A human and a humanoid have different limb lengths, mass distributions, joint limits, and degrees of freedom. A human's hand trajectory is a statement of intent ("move the cup here"), not a sequence of robot-feasible actions. Naively replaying human motion makes the robot fall over. There is a physics gap between human intents and robot actions.
This is the elegant move. Human data gives you the what (the intended motion). Simulation provides physics grounding: it forces the imitated motion to obey the robot's real dynamics, balance, and contacts. So the pipeline is: retarget human motion to the robot's body (kinematics), then use RL in simulation to learn a controller that tracks that motion while staying physically feasible. Sim turns "human intent" into "robot-executable, balanced action."
Whole-body tracking handles locomotion-like motion. But loco-manipulation and scene interaction — carrying a box, leaning on a wall, doing a "wall flip" — involve the robot and the objects it touches. Retargeting the robot alone breaks the interaction (the hand misses the box).
The fix: jointly retarget the robot and the object, preserving the interaction between them via an interaction mesh — a representation of the spatial relationship (contacts, relative positions) between robot and object. Retargeting then preserves "hand is on the box" even as it scales the human motion to the robot. The result enables parkour and a wall flip with up to 890°/second angular rate — relying only on proprioception. The data-generation engine produces interaction-correct demos that sim-based RL then makes feasible.
Kinematic retargeting can still produce dynamically-questionable references. SPIDER (Scalable Physics-Informed DExterous Retargeting) does dynamics-level retargeting: it frames retargeting as an optimal control problem and solves it with sampling-based methods, so the output is already dynamically feasible. The proof: SPIDER's retargeted trajectories can be replayed open-loop on the real robot — pick up a cup, rotate a lightbulb, unplug a charger — with no closed-loop policy at all. That's only possible because the motion was made physically valid from the start.
Every method so far trains a policy with RL. Which RL algorithm? For sim2real, the empirical answer is blunt and important:
On-policy policy-gradient methods like PPO are extremely effective for sim2real. Despite being "sample-inefficient" in theory, PPO shines here because samples are nearly free — massively parallel sim generates billions of transitions cheaply, so PPO's appetite for fresh on-policy data is a non-issue. Its stability (the clipped trust region) matters far more than its sample count. This is why almost every locomotion and humanoid result you've seen is PPO under the hood.
If you haven't internalized PPO, that's the prerequisite for this whole section — see the companion Policy Gradients lesson. Everything below is a refinement of "run the policy, weight log-probs by advantage, ascend, don't step too far."
| Method | Attacks | How |
|---|---|---|
| SAPG (Split & Aggregate Policy Gradient) | PPO under-uses massively parallel envs | With thousands of envs, naive PPO's single big batch wastes the diversity. SAPG splits envs into chunks, computes gradients per chunk, and aggregates — better leveraging the parallelism for higher throughput and performance. |
| FastTD3 / FastSAC | On-policy wastes old data | Bring off-policy methods (which reuse a replay buffer) into the sim2real regime, tuned to exploit parallel envs. More sample reuse where it helps. |
| FPO / FPO++ (Flow Policy Optimization) | Standard PPO assumes a simple Gaussian policy | Policy-gradient training for flow-matching policies — expressive, multimodal action distributions (think diffusion-style policies) optimized with PG. Lets the policy represent richer behaviors than a unimodal Gaussian. |
| BFM-Zero (ICLR'26) | You must retrain for every new reward | An unsupervised RL approach using a forward–backward representation: it learns a behavioral foundation model that can optimize any user-specified reward at test time, no retraining. A "promptable" controller for humanoids. |
A way to pre-learn the long-run consequences of behaviors such that, given any reward function at test time, you can immediately read off a near-optimal policy — without running RL again. BFM-Zero uses it to make a single trained humanoid controller "promptable" with arbitrary rewards, the way an LLM is promptable with arbitrary instructions.
For most sim2real projects, PPO + good domain randomization + a solid reward beats a fancy off-policy method with sloppy randomization. The algorithm is rarely the bottleneck; the training distribution and reward are. These frontier methods earn their keep at the edges — squeezing more from parallel envs, representing multimodal skills, or amortizing across many rewards. Reach for them when PPO is genuinely your limiting factor, not by default.
Step back. The control community has been doing sim2real for decades — they just didn't call it that. Seeing the lineage tells you where the field is going.
Long before deep RL, engineers balanced inverted pendulums and made robots hop using reduced-order models as their "simulator": an inverted-pendulum model, a single-rigid-body model. The controller was online model predictive control (MPC/NMPC) — at every tick, solve an optimization over the simple model to pick the next action.
Sim2Real 1.0 has no pretraining at all. There's no learned policy, no offline training run. It relies 100% on very fast (>100 Hz) online reasoning — re-solving the control problem from scratch every few milliseconds. It's the opposite philosophy from deep RL, which front-loads enormous offline training so that deployment is a cheap forward pass. Both work. That tension — when does "learning" happen, offline or online? — organizes the whole taxonomy.
Guanya Shi's framing places every method on a 2D map:
| Era | Model ("sim") | Learning | Signature |
|---|---|---|---|
| 1.0 | Reduced-order model | Online NMPC, no pretraining | Fast online reasoning, hand-derived models |
| 2.0 | Full simulator | Offline RL training | The "train in sim, deploy" paradigm of this lecture |
| 3.0 | Sim + real2sim correction | Offline RL (RL++) | Close the loop with real data; better algorithms |
| 4.0 | Generative sim / world model | Online + offline combined | Better model × better RL × better online reasoning, fused |
The frontier isn't "pick offline RL or online MPC" — it's both at once, on a model that is itself partly learned (a generative simulator or world model). Imagine a humanoid with a foundation-model controller (offline-trained, promptable like BFM-Zero) that also does fast online correction (MPC-style) against a world model continuously updated from real data (real2sim). Better model × better RL algorithm × better online reasoning, all reinforcing each other.
The sim2real map. Hover/click a node to see what defines each era. The arrow of progress runs toward the top-right: richer learned models fused with both offline training and online reasoning.
For completeness, things a deeper course would add: sim & real co-training (train on a mix of sim and real data simultaneously), simulation for policy evaluation (use sim to predict real performance before deploying), and differentiable simulation (make fsim differentiable so you can backprop through physics for gradient-based policy optimization).
Here is the entire sim2real recipe as one machine. Run a virtual humanoid through the pipeline: pick how much you randomize, whether you add adaptation, and whether you've done real2sim correction — then deploy to a "real" robot with a hidden, unknown e and watch what survives. This is every section above, composed.
Each toggle shrinks the reality gap a different way: randomization widens the policy's competence so reality falls inside it; adaptation lets the policy re-center on the actual world from observed history; real2sim moves the training distribution itself toward reality. Stacking all three is how real humanoid systems achieve zero-shot transfer. No single trick is enough on a hard task — the gap is closed in layers.
e from p(e) at reset.p(e) makes no assumption about which value reality takes — it just guarantees reality is inside the support. Some systems use log-uniform for scale parameters (friction, gains) so they're randomized evenly in ratio, not absolute terms. Why randomize latency? Real control loops have variable delay; a policy that never saw delay learns to react instantly and then oscillates on hardware.Everything, on one page. If you can reconstruct the right column from the left, you've mastered the lecture.
| Concept | The one thing to remember |
|---|---|
| The sim step | xt+1 = fsim(xt, ut, e). Every method attacks the e. Reality gap = fsim ≠ freal. |
| Why sim | Cheap, fast, safe, free labels, no wear. 20,000× data rate via massive parallelism. |
| Two gaps | Parametric (wrong numbers → tune via system ID). Non-parametric (missing physics → learn a residual). |
| Why small gaps kill | Errors compound over time; an aggressive RL policy is a brittle exploiter of its exact training world. |
| Domain randomization | Randomize e ∼ p(e), train one robust π(x). = robust control. Pays an optimality gap; tune the width. |
| Learning to adapt | π(x, e) = adaptive control, but e unknown in real → privileged teacher–student: teacher sees e, student infers it from proprioceptive history. |
| RMA | Distill in latent space: student predicts env embedding ẑ, reuses base policy π(x, ẑ). |
| Asymmetric actor-critic | One stage. Actor sees deployable obs; critic sees privileged e (it's discarded after training). e.g. FALCON. |
| Real2Sim2Real | Use real data to fix the sim: system ID (incl. active exploration, SPI-Active) for numbers; residual nets (actuator net) for missing physics. |
| Human data | Skip reward design: retarget human motion → RL-track in sim for physics grounding. OmniRetarget (interaction mesh), SPIDER (dynamics-level, open-loop replayable). |
| RL algorithm | PPO dominates — samples are free, stability matters. Frontiers: SAPG, FastTD3, FPO (flow policies), BFM-Zero (promptable, FB representation). |
| 1.0 → 4.0 | 1.0 online NMPC on reduced models → 2.0 RL in full sim → 3.0 + real2sim → 4.0 generative sim + online&offline fused. |
e online.Sim2real is the art of making a policy survive the gap between fsim and freal — by making the policy robust to the gap, adaptive to it, or by shrinking the gap itself with real data. The best systems do all three, in layers.