CS224R — Sim2Real Robot Learning: Crossing the Reality Gap

Roadmap

What You'll Master

This lecture (Guanya Shi, CMU) biases toward breadth — the whole landscape of sim2real robot learning. We will not skim. Every method below gets rebuilt from its underlying equation, with the data flow traced and the engineering decision justified. By the end you can look at any sim2real paper and place it on the map.

01The Promise & The Catch 02What Is a Simulator? 03The Reality Gap, Dissected 04Domain Randomization 05Learning to Adapt 06Real2Sim2Real 07Sim2Real from Human Data 08RL Algorithms for Sim2Real 09Sim2Real 1.0 → 4.0 10The Full Pipeline 11Cheat Sheet & Connections

01 · Motivation

The Promise & The Catch

You want a humanoid robot to do a backflip. The obvious plan: let it try, learn from its mistakes, repeat a million times. But a real humanoid costs as much as a car, and a million falls will turn it into scrap metal long before it learns anything. Reinforcement learning needs tens of millions of trial-and-error steps. No physical robot survives that.

So we cheat. We build a simulator — a video-game version of the robot and the world — and we let a virtual robot fall a billion times for free. When the virtual robot finally masters the backflip, we copy its brain (the trained policy) into the real robot and hope it works.

That last word — hope — is the entire subject of this lecture. The transfer from a simulated policy to a real robot is called sim2real, and the difference between the simulator and reality is the reality gap. Everything we study is a tool for making that gap small enough that hope becomes a guarantee.

Definition

Sim2Real Transfer

Train a control policy π entirely in simulation, then deploy it on physical hardware zero-shot (no real-world retraining) or with minimal real-world fine-tuning. The challenge is that the simulator and the real robot never have identical dynamics.

Why simulators are irresistible

Four advantages make simulation the default substrate for modern robot learning. Each one maps to a hard physical constraint it dissolves:

Advantage	What it dissolves	Concrete number
Cheap, fast, scalable	Wall-clock time & money	A quadruped locomotion policy trained for 18 seconds on a 2020 MacBook Pro (M1) using RLtools. Real-world, that's months.
Safe	Hardware destruction	The virtual robot can fall off a cliff, slam into walls, blow its motors — cost: zero.
Labeled (oracle access)	Perception & measurement cost	In sim you know the exact contact forces, terrain height, friction, and body velocity — ground truth that a real sensor suite can only approximate.
No wear & tear	Mechanical degradation	Real gears strip, bearings wear, joints develop slop. A simulated joint is forever new.

The "free labels" superpower

This one is underrated. To train a robot to walk over rubble, you need to know the terrain height under each foot. In the real world that requires expensive lidar and still comes out noisy. In a simulator like PyBullet or MuJoCo, the terrain height is a variable you already set. You get a perfect, free, dense label for every quantity in the world. This is why "privileged information" (Section 5) is even possible — it only exists because sim hands it to you.

The catch, in one equation

Here is the whole problem stated formally. A simulator is a function that advances the world one tick:

The simulation step x_t+1 = f_sim( x_t , u_t , e ) // next state = physics( current state , action , environment params )

Read every symbol slowly, because the entire field lives inside this line:

x_t — the state at time t: joint angles, joint velocities, body pose, body velocity. Think "everything you'd need to know to predict the future."
u_t — the action (also written a_t): the motor commands the policy sends, usually target joint positions or torques.
f_sim — the physics engine: the rules (Newton's laws, contact models, integrators) that turn state + action into the next state.
e — the environment parameters: mass, friction, motor strength, link inertias, sensor delay, terrain. The hidden knobs of the world.

The real world runs the exact same kind of equation — x_t+1 = f_real(x_t, u_t) — but with a different f and a different e that you don't fully know. The reality gap is precisely the mismatch f_sim ≠ f_real. Hold onto this equation: every method in this lecture is a different way of attacking the e in this line.

Misconception

"If we just build a perfect enough simulator, sim2real becomes trivial." False, and it's the trap beginners fall into. A bit-perfect simulator is computationally impossible (you'd need to model every air molecule and every micro-scratch on every gear). More importantly, you don't need perfection — you need a policy that is robust to the imperfection. The field's biggest wins came not from better physics but from smarter training (domain randomization, adaptation). Chasing fidelity is often the wrong axis.

🔗Where this sits

Sim2real is robust + adaptive control wearing a learning hat

If you've seen classical control, the two big families below will feel familiar. Domain randomization is robust control: design one controller that works for a whole range of plants. Learning to adapt is adaptive control: estimate the plant online and adjust. RL didn't invent these ideas — it scaled them to high-dimensional, contact-rich systems that classical methods couldn't touch.

02 · The Substrate

What Is a Simulator?

Before we can talk about the gap, we need to know what a robot simulator actually does mechanically. Strip away the marketing and a physics-based simulator is a loop that repeatedly solves Newton's equations for a system of rigid bodies connected by joints, while resolving the nasty bits — contacts and friction — that make robotics hard.

Definition

Physics-based simulator

A program where the state evolves according to explicit physical laws (rigid-body dynamics, contact mechanics) rather than learned prediction. Contrast with a world model, a neural network that learns to predict x_t+1 from data — that's a non-physics-based simulator, and a frontier topic of its own.

The inner loop, traced

One simulation tick (say 1/200th of a second) does roughly this:

One physics step f_sim(x_t, u_t, e)

Apply actuation. Convert action u_t (e.g. target joint angles) into joint torques via the motor model — this is where parameter e like motor gain and torque limits enters.
Compute forces. Gravity, Coriolis, and the actuator torques produce accelerations through the equations of motion M(q)q̈ = τ + ..., where M is the mass matrix (depends on link masses/inertias — more e).
Detect & resolve contacts. Find which bodies touch the ground or each other, then solve for contact forces that prevent interpenetration and obey friction limits (friction coefficient = yet more e). This is the hardest, slowest, least-accurate part.
Integrate. Step the velocities and positions forward by Δt to get x_t+1.

Notice how many places e sneaks in: motor gains, masses, inertias, friction, restitution, joint damping. Each is a parameter you set in sim and can only estimate in reality. That's the surface area of the reality gap, enumerated.

The simulator zoo (and a crucial distinction)

There are many simulators because rendering graphics and computing physics are mature fields. The taxonomy that trips everyone up:

Simulator ≠ Framework

IsaacSim is a simulator (the physics engine). IsaacLab is a robot-learning framework built on top of it (it gives you environments, RL plumbing, randomization tooling). Mixing these up is the #1 vocabulary error. The simulator computes physics; the framework wraps it into something you can train policies in.

Name	What it is	Why people use it
MuJoCo (CPU)	Classic, accurate contact-rich physics engine	The research gold standard for accuracy on contacts and articulated bodies.
MuJoCo XLA (MJX)	MuJoCo reimplemented in JAX	Runs on GPU/TPU, vectorizes across thousands of envs.
MuJoCo Warp	MuJoCo on NVIDIA Warp	Massively parallel GPU physics.
mjlab	IsaacLab-style manager API = MuJoCo Warp + IsaacLab − IsaacSim	IsaacLab ergonomics without the IsaacSim dependency.
IsaacSim / IsaacLab	NVIDIA simulator + learning framework	GPU-native, thousands of parallel robots, great for locomotion at scale.

The killer feature: massive parallelism

The reason GPU simulators conquered robot learning is one number: thousands of robots in parallel. Instead of one robot collecting one stream of experience, a modern GPU simulator runs 4,096 (or more) copies of the robot simultaneously, each in a slightly different world. The RL algorithm gulps experience thousands of times faster than wall-clock.

Worked intuition: the speedup

Suppose real-time control runs at 50 Hz (50 steps/second). One real robot gives you 50 transitions per second. A GPU simulator running 4,096 envs at 5× real-time gives you 4096 × 50 × 5 = 1,024,000 transitions per second — a 20,000× data rate. A task needing 2 billion transitions: ~460 years on one real robot, ~33 minutes in parallel sim. That ratio is why sim exists.

The general recipe the whole field follows fits in one picture: physics-based simulator → massively parallel training envs → policy optimization with RL (usually PPO) → real-world deployment, with an optional real2sim arrow feeding real data back to fix the simulator. The rest of this lecture is the contents of the arrows.

Each tile is one parallel environment with its own randomized parameters e. Click Step to advance the shared policy across all of them at once — this is what "massively parallel" looks like.

03 · The Enemy

The Reality Gap, Dissected

"Sim2real is never easy." The reason splits cleanly into two failure modes, and naming them precisely tells you which tool to reach for.

Parametric mismatch

The simulator has the right form of physics but the wrong numbers. Real robot mass is 5.2 kg, sim says 5.0 kg. Real friction is 0.8, sim says 1.0. The equations are correct; the constants are off.

Non-parametric mismatch

The simulator doesn't model an effect at all. Aerodynamic drag on a fast quadrotor, fluid forces on a swimming robot, tire deformation on an offroad vehicle, cable dynamics, gear backlash. No knob can fix it — the term is simply missing.

Why this taxonomy matters

Parametric mismatch is fixable by tuning — system identification (Section 6) finds the right numbers. Non-parametric mismatch needs augmenting the model — adding a learned residual term (also Section 6) or making the policy robust enough to not care (Section 4). When a deployed policy fails, the first diagnostic question is always: "is this a wrong-number problem or a missing-physics problem?"

Why a tiny gap causes a total failure

Here's the part that surprises people. A 4% error in friction sounds harmless. Why does it make a backflip policy face-plant? Because a policy trained with RL is an exquisitely tuned exploiter. It learns the precise sequence of torques that works in its world. If the real world responds even slightly differently, the errors compound over time: a small slip at step 1 puts the robot in a state slightly off the trajectory it trained on, which it handles slightly wrong, putting it further off, and within a hundred timesteps it's in a state it has never seen, flailing.

Worked example: compounding drift

Model the per-step state error as growing by a factor ρ each tick (the dynamics are locally expanding near an aggressive maneuver). Start with a tiny sim/real mismatch of ε₀ = 0.001 rad. If ρ = 1.05 (5% growth/step) over a 1-second maneuver at 200 Hz = 200 steps:

ε₂₀₀ = ε₀ · ρ²⁰⁰ = 0.001 × 1.05²⁰⁰ ≈ 0.001 × 17,292 ≈ 17.3 rad // a 0.001 rad seed error explodes to 17 rad — the robot is on the floor

The lesson: in unstable, contact-rich maneuvers, errors don't add — they multiply. This is exactly why open-loop replay of a sim trajectory on a real robot almost never works, and why robustness/adaptation is non-negotiable.

A policy's planned trajectory (gold) vs. what happens on the real robot (red) as you increase the parameter mismatch. Drag the slider and watch a small gap turn into a fall.

Sim/real friction gap

The reward-design gap (the quiet killer)

There's a second gap nobody warns you about: even with perfect physics, how do you specify the task? For locomotion, "go forward, stay upright" is easy to reward. For dexterous manipulation — "pour the water without spilling," "assemble the part" — writing a reward function by hand is brutally hard. This is why human-data methods (Section 7) exist: they sidestep reward design by imitating demonstrations. Keep this in mind — sim2real isn't only a dynamics problem, it's a task-specification problem too.

Quick check: which is a non-parametric mismatch?

Your sim-trained drone policy works perfectly indoors but drifts badly outdoors on a windy day.

The real drone's mass was measured as 0.9 kg but set to 1.0 kg in sim. The simulator models no aerodynamic/wind forces at all. The motor gain constant in sim is 5% too high.

04 · Method 1 — Robustness

Domain Randomization

The first and most widely-used weapon. The idea is almost insultingly simple, which is why it works so well.

The whole idea in one line

Don't train on one world. Train on a thousand random worlds. If your policy works whether friction is 0.6 or 1.2, whether the robot weighs 4.8 kg or 5.4 kg, whether there's 20 ms of sensor delay or 60 ms — then the real world's particular values are just one more sample it already handles.

Formally, we randomize e in our governing equation:

Domain randomization objective θ* = arg max_θ E_{e ∼ p(e)} [ E_{τ ∼ π_θ, f_sim(·,·,e)} [ R(τ) ] ] // maximize expected return, averaged over a distribution of worlds e

Compare this to ordinary RL, which maximizes return for a single fixed e. The only change is the outer expectation E_e∼p(e) over a distribution of environments p(e). In practice each of the thousands of parallel envs draws its own e at episode start. We train one policy π_θ(x) — note: it takes only the state x, not e — and ask it to perform well on average across all of them.

🎯This is robust control

A controller that performs well across a whole set of plants without knowing which plant it faces is the textbook definition of robust control. Domain randomization is robust control where the "set of plants" is a sampled distribution and the controller is a neural net. The policy can't identify the world — it must find one strategy that is good enough everywhere.

What "robust enough everywhere" costs you

There is no free lunch. Asking one policy to handle every world forces a conservative solution — it can't exploit any particular world's quirks, so it leaves performance on the table. A policy that knew friction was exactly 1.1 could be more aggressive than one hedging across [0.6, 1.2].

Misconception: "more randomization is always better"

No. Widen p(e) too far and you ask the policy to handle worlds that don't exist (friction of 0.05, like wet ice everywhere). It becomes so timid it barely moves — this is over-conservatism. Too narrow, and the real world falls outside the training distribution and the policy fails. Tuning the width of p(e) is the real engineering. A good heuristic: randomize widely enough that reality is comfortably inside the range, no wider.

What gets randomized, in practice

The original domain-randomization paper randomized perception (textures, lighting, camera position) so a vision policy wouldn't overfit to sim graphics. But the idea generalized to every part of the pipeline:

Dynamics: mass, inertia, friction, motor strength, joint damping, center-of-mass offset.
Perception: textures, lighting, camera pose, sensor noise.
Latency: action/observation delay (real systems have 10–60 ms; randomize it).
Terrain: slopes, stairs, gaps, friction patches — "terrain randomization" for locomotion.
Disturbances: random pushes, applied forces/torques mid-episode.

Real systems that lean on this: Agile But Safe (agile-but-safe.github.io) randomizes e to train one robust quadruped policy; RPL (Robust Perceptive Locomotion) randomizes terrain and dynamics so a humanoid walks over challenging ground it never specifically trained on. In both, the recipe is identical: randomize e, train a single π(x) robust to all of it.

A policy's success rate vs. real-world friction. Toggle randomization width and watch the trade-off: a narrow-trained policy peaks high but is brittle; a wide-trained policy is flatter but survives the unknown real value (dashed line).

🧮 Derive It Why does randomizing e turn one RL problem into robust control? ▶ ✓ ATTEMPTED

Starting from the single-environment RL objective J(θ) = E_τ|e[R(τ)], show how introducing a distribution p(e) changes what the optimal policy can depend on, and explain why the result is necessarily more conservative than the per-environment optimum.

The DR policy is π_θ(x) — it sees only state, not e. So a single set of weights must produce good actions for every e simultaneously. It cannot branch on the world.

Define the per-environment optimum J*(e) = max_θ E_τ|e[R]. A single shared θ can never beat the best-per-e on every e at once, so E_e[J(θ,e)] ≤ E_e[J*(e)]. The gap is the price of robustness.

The DR objective is max_θ E_e∼p(e)[ J(θ, e) ] where J(θ, e) = E_{τ∼π_θ, f_sim(·,·,e)}[R(τ)].

Key constraint: the inner policy π_θ(x) is shared across all e — the same weights handle every world. Contrast adaptive control (next section), where the policy is π_θ(x, e) and can specialize.

The bound: for any fixed θ, J(θ, e) ≤ J*(e) := max_θ' J(θ', e) for every e. Taking expectations, E_e[J(θ,e)] ≤ E_e[J*(e)]. The optimal shared policy maximizes the left side but can rarely achieve equality — equality needs one θ that is simultaneously optimal for every e, which only happens if the worlds want the same strategy.

Conclusion: the DR policy pays an optimality gap E_e[J*(e)] − E_e[J(θ*,e)] ≥ 0 in exchange for needing no knowledge of e at test time. That's the robust-control trade exactly: give up peak performance for guaranteed adequacy across the set. Adaptive control (Section 5) tries to recover that gap by identifying e online.

05 · Method 2 — Adaptation

Learning to Adapt

Domain randomization throws away information: it builds a policy that ignores e. But what if the policy could figure out which world it's in and specialize? That's adaptive control, and it can recover the optimality gap robustness leaves on the table.

The adaptive policy π( x , e ) // action depends on state AND the environment parameters

An adaptive policy π(x, e) takes the world parameters as input, so it can be aggressive when friction is high and cautious when it's low. It does not conflict with domain randomization — you do both, giving "robust adaptive control." But it has one fatal problem:

The catch that defines the whole method

e is unknown in the real world. In sim, you hand the policy the true friction. On the real robot, nobody knows the exact friction, mass distribution, or motor wear. So π(x, e) can't be deployed directly — you don't have e to feed it. The entire teacher–student machinery exists to solve this one problem.

The teacher–student pipeline, traced end to end

The trick: train a policy that needs e in simulation (where it's free), then distill it into a second policy that infers what it needs from observable history. Three phases:

Privileged teacher → student distillation

Sim — train the teacher. Train π_teacher(x, e) with RL (PPO), giving it privileged information: the true e (contact states, terrain height, friction, applied disturbances) that only the simulator knows. Because it has oracle access, it learns a near-optimal, world-aware policy quickly. Data in: state x + privileged e. Data out: action.
Sim — train the student. Train π_student(x, o_1:t) using only observable info available on the real robot: a history of proprioception (IMU readings, joint angles/velocities over the last N steps). The student learns to match the teacher's actions. This is the key: the recent history of how the body moved implicitly encodes e — a robot carrying extra mass accelerates differently, and the history reveals it. Data in: state + proprioceptive history. Target: teacher's action.
Real — deploy the student. Ship π_student. It reads its own proprioceptive history, implicitly estimates the world, and acts like the teacher would — without ever being told e.

Why the student can infer e from history

This is the conceptual crux. You can't measure friction directly, but its effects are observable: push off with a fixed torque and the resulting acceleration depends on mass and friction. A short window of (commanded action, resulting motion) pairs is a fingerprint of e. The student is essentially a learned system identifier fused with a controller — it identifies the world from the consequences of its own actions and adapts in real time.

Phase 2 is just imitation learning. The student regresses onto the teacher's outputs — supervised learning with the teacher as an oracle labeler. (If you've seen DAgger, that's the standard tool here: roll out the student, query the teacher for the correct action at each visited state, train on those labels to avoid distribution shift.)

Variant A — RMA: distill in latent space

The student doesn't have to imitate in action space. RMA (Rapid Motor Adaptation) distills in latent space:

The teacher encodes privileged e into a latent vector z = μ(e) (an "environment embedding"), then acts via π(x, z).
The student trains an adaptation module that predicts ẑ from the proprioceptive history, then reuses the same base policy π(x, ẑ).
Now the student only has to match a small latent vector, not full actions — an easier regression, and the base controller is shared.

Variant B — Asymmetric Actor-Critic: skip the two stages

Two-stage training is a hassle. Asymmetric actor-critic does it in one stage by exploiting a structural asymmetry: the critic is only used during training, never at deployment. So give the critic privileged info and keep the actor deployable.

Component	Sees	Why
Actor `π(x, o_1:t)`	Only real-deployable info: proprioception + short history	It ships to the robot, so it can only use what the robot has.
Critic `V(x, e)`	Privileged info: true `e`, root velocity, contact forces	It's discarded after training, so privileged access is "free" and gives lower-variance value estimates → better gradients.

Real system: FALCON (L4DC'26)

A humanoid loco-manipulation policy. Actor: current proprioception + a 4-step history. Critic: additionally gets root velocity and end-effector force — quantities the robot can't reliably measure but the sim knows. One-stage training, deployable actor, privileged critic. This is asymmetric actor-critic in production.

The teacher (gold, sees true mass) adapts instantly. The student (blue) watches a few steps of proprioceptive history, infers the mass, and converges to the teacher's behavior. Click New robot to draw a random hidden mass and watch the student identify it.

⚖Robust vs. adaptive: when to use which

Domain randomization

One policy, ignores e, conservative, dead simple, no deployment-time inference. Great when the gap is small or the task tolerates conservatism.

Teacher–student

Infers e online, recovers performance, but needs a 2-stage pipeline (or asymmetric critic) and enough observable history to identify the world.

In practice: do both. Randomize for robustness, adapt to claw back the optimality gap.

Checkpoint — you shall not pass

In your own words: the adaptive policy is π(x, e), but e is never known on the real robot. So how does the deployed student ever act adaptively? Be specific about what it reads and why that suffices.

✓ checkpoint cleared

Model answer

The student never receives e directly. Instead it reads a short history of its own proprioception — recent IMU readings, joint angles and velocities. Because the consequences of the student's own commanded actions depend on the hidden world (a heavier robot accelerates less for the same torque; a low-friction floor lets the foot slip), this history is a fingerprint of e. The student is effectively a learned system-identifier fused with a controller: it infers the latent world from the motion it just produced, then acts as the privileged teacher would for that world. Concretely (RMA): it predicts a latent ẑ from history and feeds it to the shared base policy π(x, ẑ). That's why phase 2 only needs observable inputs — the information about e is implicit in the body's behavior, not handed over explicitly.

06 · Method 3 — Closing the Loop

Real2Sim2Real: Fix the Simulator Itself

The previous two methods take the simulator as given and make the policy cope. This third family does the opposite: use real-world data to make the simulator more like reality, then train in the improved sim. The loop is real → sim → real: collect real data, update the sim, retrain, redeploy.

Definition

Real2Sim

The process of using real-world measurements to correct a simulator — either by identifying its parameters (fixing parametric gaps) or by adding learned residual terms (fixing non-parametric gaps). Then you do sim2real again on the corrected sim. Hence "real2sim2real."

Path A — System Identification (fix the numbers)

If the gap is parametric (wrong masses, frictions), the fix is to estimate the true values from real data. That's classic system identification: run the robot, record states and actions, find the e that makes f_sim(x_t, u_t, e) best predict the observed x_t+1.

System ID as an optimization ê = arg min_e Σ_t ∥ x_t+1^real − f_sim(x_t^real, u_t^real, e) ∥² // find params that make sim predictions match real transitions

SPI-Active: don't just identify — explore to identify faster

Naive system ID collects whatever data you happen to have, which may be uninformative (you can't estimate friction from a robot standing still). SPI-Active (CoRL'25) adds active exploration: it deliberately picks the actions that will be most informative about the unknown parameters, by maximizing the Fisher Information — a measure of how much a measurement tells you about a parameter. Sampling-based system ID + "go do the experiment that resolves your uncertainty fastest." The robot becomes its own scientist.

Path B — Learn the residual (add missing physics)

If the gap is non-parametric (an effect the sim doesn't model at all), no amount of tuning e helps — the term is missing. The fix: learn a residual that the sim is wrong by, from real data, and add it back.

Residual-augmented dynamics x_t+1 = f_sim(x_t, u_t, e) + g_φ(x_t, u_t) // known physics + a learned correction g for everything sim missed

Here g_φ is a small neural network trained so that sim-plus-residual matches real trajectories. You keep the cheap, mostly-correct physics engine and let a learned term mop up aerodynamics, cable forces, unmodeled friction — whatever f_sim forgot. The pipeline: use real data to augment the simulator, train RL in the augmented sim, deploy.

The canonical residual: the Actuator Net

The single most impactful instance of this idea in legged robotics. The problem: real motors/transmissions have complex dynamics — they don't produce exactly the torque you command, due to gearing, friction, and saturation. Sim's idealized motor model is a major reality gap for locomotion.

Actuator Net (Lee et al., Science Robotics)

Collect real motor data. On the real robot, record (commanded position error, joint velocity history) → (actual measured torque).
Train a network τ̂ = h_φ(history of position errors & velocities) to predict the real torque the motor actually delivers.
Drop it into sim. Replace the idealized motor model with the learned actuator net. Now the simulated motor behaves like the real one — the biggest parametric+non-parametric gap for legged robots, closed.

"Unsupervised" actuator net — when you lack torque labels

Training the actuator net above needs torque labels (a torque sensor), which many robots don't have. The clever fix: make it label-free using RL. Train a residual torque model whose objective is to make the simulated trajectory match the observed real trajectory — you never need a torque sensor, only the (freely observed) joint positions over time. The residual is discovered by trajectory matching, not supervised regression. This is "real2sim by learning dynamics residuals" in its slickest form.

Commanded torque (gold) vs. what an idealized sim motor delivers vs. the real motor (red, with lag and saturation). Click Add actuator net to watch the learned residual pull the sim curve onto the real one.

📐 Design It Your quadruped slips on real grass. Which tool? ▶ ✓ ATTEMPTED

A sim-trained quadruped walks fine on lab floor but slips and stumbles on wet grass outdoors. You have: a budget to collect 10 minutes of real outdoor data, and the four methods from this lecture. Diagnose the gap and pick a fix.

Symptom

slips on grass

Real data

10 min available

Goal

zero-shot on new terrain

1. Is "slips on grass" parametric or non-parametric? Both, actually — argue it.

2. What would you randomize, identify, and/or learn as a residual?

Diagnosis: mostly parametric — grass has lower, variable friction than lab floor. But also non-parametric: grass deforms and the contact isn't a clean rigid point, which sim's hard-contact model misses.

Layered fix (this is how production systems do it):

(1) Domain-randomize friction widely — the cheapest, do-it-anyway move. Train across friction [0.3, 1.2] so low-grip grass is inside the distribution. Often this alone fixes the slip.

(2) Add teacher–student adaptation so the policy infers the current grip from its recent slip history and adjusts gait online — grass grip varies patch to patch, so online adaptation beats a fixed robust policy.

(3) If still failing, spend the 10 min of data on real2sim: identify grass friction (system ID) and learn a contact residual for the deformable surface, then retrain. Use real data last, because it's the most expensive step.

The meta-lesson: the methods compose. Robustness is the floor, adaptation recovers performance, and real2sim is the targeted fix when the first two aren't enough.

07 · Advanced Topic

Sim2Real from Human Data

Recall the quiet killer from Section 3: reward design. For locomotion, "go forward, stay up" is a fine reward. For loco-manipulation and dexterous hands — pour, assemble, open a door — hand-crafting a reward is a nightmare. So why design rewards at all? Humans already know how to do these tasks. Let's learn from human demonstrations.

No free lunch: the physics gap

You can't just copy a human's motion onto a robot. A human and a humanoid have different limb lengths, mass distributions, joint limits, and degrees of freedom. A human's hand trajectory is a statement of intent ("move the cup here"), not a sequence of robot-feasible actions. Naively replaying human motion makes the robot fall over. There is a physics gap between human intents and robot actions.

The key insight: simulation bridges the physics gap

This is the elegant move. Human data gives you the what (the intended motion). Simulation provides physics grounding: it forces the imitated motion to obey the robot's real dynamics, balance, and contacts. So the pipeline is: retarget human motion to the robot's body (kinematics), then use RL in simulation to learn a controller that tracks that motion while staying physically feasible. Sim turns "human intent" into "robot-executable, balanced action."

The two-step recipe

Retargeting + Policy Learning

Motion retargeting (kinematics-level). Map the human's joint trajectory onto the robot's skeleton — scale limb lengths, respect joint limits, preserve the shape of the motion. Output: a reference trajectory the robot's body could in principle follow. This is geometry, not physics — it ignores whether the motion is balanced.
Policy learning in sim (dynamics-level). Train an RL policy whose reward is "track the retargeted reference closely" — a whole-body tracking objective. The physics simulator penalizes anything infeasible (falling, slipping), so the learned policy produces a balanced, dynamically-valid version of the human motion. Examples: ASAP (RSS'25), BeyondMimic (Aug 2025).

When the object matters too: OmniRetarget

Whole-body tracking handles locomotion-like motion. But loco-manipulation and scene interaction — carrying a box, leaning on a wall, doing a "wall flip" — involve the robot and the objects it touches. Retargeting the robot alone breaks the interaction (the hand misses the box).

OmniRetarget (ICRA'26) — interaction-preserving retargeting

The fix: jointly retarget the robot and the object, preserving the interaction between them via an interaction mesh — a representation of the spatial relationship (contacts, relative positions) between robot and object. Retargeting then preserves "hand is on the box" even as it scales the human motion to the robot. The result enables parkour and a wall flip with up to 890°/second angular rate — relying only on proprioception. The data-generation engine produces interaction-correct demos that sim-based RL then makes feasible.

Pushing to dynamics-level: SPIDER

Kinematic retargeting can still produce dynamically-questionable references. SPIDER (Scalable Physics-Informed DExterous Retargeting) does dynamics-level retargeting: it frames retargeting as an optimal control problem and solves it with sampling-based methods, so the output is already dynamically feasible. The proof: SPIDER's retargeted trajectories can be replayed open-loop on the real robot — pick up a cup, rotate a lightbulb, unplug a charger — with no closed-loop policy at all. That's only possible because the motion was made physically valid from the start.

🎯The ladder of retargeting fidelity

Kinematic (shape only, may be infeasible) → + RL tracking in sim (physics-grounded via a learned controller) → dynamics-level / optimal-control (feasible by construction, open-loop replayable). Each rung pushes more of the physics-feasibility burden from the deployment-time policy into the data-generation step.

08 · Advanced Topic

RL Algorithms for Sim2Real

Every method so far trains a policy with RL. Which RL algorithm? For sim2real, the empirical answer is blunt and important:

The workhorse

On-policy policy-gradient methods like PPO are extremely effective for sim2real. Despite being "sample-inefficient" in theory, PPO shines here because samples are nearly free — massively parallel sim generates billions of transitions cheaply, so PPO's appetite for fresh on-policy data is a non-issue. Its stability (the clipped trust region) matters far more than its sample count. This is why almost every locomotion and humanoid result you've seen is PPO under the hood.

If you haven't internalized PPO, that's the prerequisite for this whole section — see the companion Policy Gradients lesson. Everything below is a refinement of "run the policy, weight log-probs by advantage, ascend, don't step too far."

Frontier directions, and the problem each one attacks

Method	Attacks	How
SAPG (Split & Aggregate Policy Gradient)	PPO under-uses massively parallel envs	With thousands of envs, naive PPO's single big batch wastes the diversity. SAPG splits envs into chunks, computes gradients per chunk, and aggregates — better leveraging the parallelism for higher throughput and performance.
FastTD3 / FastSAC	On-policy wastes old data	Bring off-policy methods (which reuse a replay buffer) into the sim2real regime, tuned to exploit parallel envs. More sample reuse where it helps.
FPO / FPO++ (Flow Policy Optimization)	Standard PPO assumes a simple Gaussian policy	Policy-gradient training for flow-matching policies — expressive, multimodal action distributions (think diffusion-style policies) optimized with PG. Lets the policy represent richer behaviors than a unimodal Gaussian.
BFM-Zero (ICLR'26)	You must retrain for every new reward	An unsupervised RL approach using a forward–backward representation: it learns a behavioral foundation model that can optimize any user-specified reward at test time, no retraining. A "promptable" controller for humanoids.

Definition

Forward–backward (FB) representation

A way to pre-learn the long-run consequences of behaviors such that, given any reward function at test time, you can immediately read off a near-optimal policy — without running RL again. BFM-Zero uses it to make a single trained humanoid controller "promptable" with arbitrary rewards, the way an LLM is promptable with arbitrary instructions.

Why "just use the fanciest algorithm" is wrong

For most sim2real projects, PPO + good domain randomization + a solid reward beats a fancy off-policy method with sloppy randomization. The algorithm is rarely the bottleneck; the training distribution and reward are. These frontier methods earn their keep at the edges — squeezing more from parallel envs, representing multimodal skills, or amortizing across many rewards. Reach for them when PPO is genuinely your limiting factor, not by default.

09 · The Big Picture

Sim2Real 1.0 → 4.0

Step back. The control community has been doing sim2real for decades — they just didn't call it that. Seeing the lineage tells you where the field is going.

Sim2Real 1.0 — classical model-based control

Long before deep RL, engineers balanced inverted pendulums and made robots hop using reduced-order models as their "simulator": an inverted-pendulum model, a single-rigid-body model. The controller was online model predictive control (MPC/NMPC) — at every tick, solve an optimization over the simple model to pick the next action.

What's fascinating (and a little strange)

Sim2Real 1.0 has no pretraining at all. There's no learned policy, no offline training run. It relies 100% on very fast (>100 Hz) online reasoning — re-solving the control problem from scratch every few milliseconds. It's the opposite philosophy from deep RL, which front-loads enormous offline training so that deployment is a cheap forward pass. Both work. That tension — when does "learning" happen, offline or online? — organizes the whole taxonomy.

The two axes that organize everything

Guanya Shi's framing places every method on a 2D map:

Vertical axis — when does "learning"/reasoning happen? Online (reason at deploy time, like MPC) vs. offline (train a policy first, like RL).
Horizontal axis — "sim" (model) fidelity & diversity. From crude reduced-order models → full simulators → sim corrected by real data (real2sim) → generative sims and world models.

Era	Model ("sim")	Learning	Signature
1.0	Reduced-order model	Online NMPC, no pretraining	Fast online reasoning, hand-derived models
2.0	Full simulator	Offline RL training	The "train in sim, deploy" paradigm of this lecture
3.0	Sim + real2sim correction	Offline RL (RL++)	Close the loop with real data; better algorithms
4.0	Generative sim / world model	Online + offline combined	Better model × better RL × better online reasoning, fused

Where the field is heading: Sim2Real 4.0

The frontier isn't "pick offline RL or online MPC" — it's both at once, on a model that is itself partly learned (a generative simulator or world model). Imagine a humanoid with a foundation-model controller (offline-trained, promptable like BFM-Zero) that also does fast online correction (MPC-style) against a world model continuously updated from real data (real2sim). Better model × better RL algorithm × better online reasoning, all reinforcing each other.

The sim2real map. Hover/click a node to see what defines each era. The arrow of progress runs toward the top-right: richer learned models fused with both offline training and online reasoning.

Topics this breadth tour skipped (frontiers)

For completeness, things a deeper course would add: sim & real co-training (train on a mix of sim and real data simultaneously), simulation for policy evaluation (use sim to predict real performance before deploying), and differentiable simulation (make f_sim differentiable so you can backprop through physics for gradient-based policy optimization).

10 · Put It Together

The Full Pipeline, Interactive

Here is the entire sim2real recipe as one machine. Run a virtual humanoid through the pipeline: pick how much you randomize, whether you add adaptation, and whether you've done real2sim correction — then deploy to a "real" robot with a hidden, unknown e and watch what survives. This is every section above, composed.

Domain randomization

Teacher–student adaptation

Real2Sim correction

What the showcase is teaching

Each toggle shrinks the reality gap a different way: randomization widens the policy's competence so reality falls inside it; adaptation lets the policy re-center on the actual world from observed history; real2sim moves the training distribution itself toward reality. Stacking all three is how real humanoid systems achieve zero-shot transfer. No single trick is enough on a hard task — the gap is closed in layers.

💻 Implement It Write the domain-randomization training loop ▶ ✓ ATTEMPTED

Fill in the per-episode environment randomization for a parallel sim2real training loop. The key line: each env resamples its own e from p(e) at reset.

signaturedef randomize_env(env, ranges): """Sample new physics params e ~ p(e) for one env reset.""" # ranges: dict like {"friction": (0.3, 1.2), "mass_scale": (0.8, 1.2), ...} # TODO: set env.friction, env.mass, env.motor_gain, env.latency, ...

Sanity check

After 1000 resets, the empirical distribution of env.friction should be ~uniform over (0.3, 1.2), and the trained policy should succeed at the real friction even though it never saw that exact value.

solutionimport numpy as np def randomize_env(env, ranges): """Sample new physics params e ~ p(e) for one env reset.""" def U(lo, hi): # uniform sample return np.random.uniform(lo, hi) env.friction = U(*ranges["friction"]) # parametric: grip env.mass = env.base_mass * U(*ranges["mass_scale"]) env.motor_gain = U(*ranges["motor_gain"]) # actuation strength env.latency = U(*ranges["latency"]) # obs/action delay (s) env.com_offset = np.random.uniform(-0.02, 0.02, size=3) # COM shift # disturbance: schedule a random push during the episode env.push_step = np.random.randint(0, env.horizon) env.push_force = U(0, ranges["max_push"]) return env # In the parallel rollout: # for env in envs: # thousands of them, on GPU # randomize_env(env, RANGES) # each gets its own world e # obs = [env.reset() for env in envs] # # ... PPO collects on-policy data across ALL randomized worlds ... # # one shared policy pi_theta(x) must succeed on every sampled e

Why uniform? Uniform p(e) makes no assumption about which value reality takes — it just guarantees reality is inside the support. Some systems use log-uniform for scale parameters (friction, gains) so they're randomized evenly in ratio, not absolute terms. Why randomize latency? Real control loops have variable delay; a policy that never saw delay learns to react instantly and then oscillates on hardware.

11 · Consolidate

Cheat Sheet & Connections

Everything, on one page. If you can reconstruct the right column from the left, you've mastered the lecture.

Concept	The one thing to remember
The sim step	`x_t+1 = f_sim(x_t, u_t, e)`. Every method attacks the `e`. Reality gap = `f_sim ≠ f_real`.
Why sim	Cheap, fast, safe, free labels, no wear. 20,000× data rate via massive parallelism.
Two gaps	Parametric (wrong numbers → tune via system ID). Non-parametric (missing physics → learn a residual).
Why small gaps kill	Errors compound over time; an aggressive RL policy is a brittle exploiter of its exact training world.
Domain randomization	Randomize `e ∼ p(e)`, train one robust `π(x)`. = robust control. Pays an optimality gap; tune the width.
Learning to adapt	`π(x, e)` = adaptive control, but `e` unknown in real → privileged teacher–student: teacher sees `e`, student infers it from proprioceptive history.
RMA	Distill in latent space: student predicts env embedding `ẑ`, reuses base policy `π(x, ẑ)`.
Asymmetric actor-critic	One stage. Actor sees deployable obs; critic sees privileged `e` (it's discarded after training). e.g. FALCON.
Real2Sim2Real	Use real data to fix the sim: system ID (incl. active exploration, SPI-Active) for numbers; residual nets (actuator net) for missing physics.
Human data	Skip reward design: retarget human motion → RL-track in sim for physics grounding. OmniRetarget (interaction mesh), SPIDER (dynamics-level, open-loop replayable).
RL algorithm	PPO dominates — samples are free, stability matters. Frontiers: SAPG, FastTD3, FPO (flow policies), BFM-Zero (promptable, FB representation).
1.0 → 4.0	1.0 online NMPC on reduced models → 2.0 RL in full sim → 3.0 + real2sim → 4.0 generative sim + online&offline fused.

The decision tree

"My sim-trained policy fails on the real robot. What now?"

Is reality inside your training distribution? If not → widen domain randomization first. Cheapest fix, do it always.
Does the world vary at deploy time (grip, payload)? → add teacher–student adaptation so the policy infers e online.
Is a specific number wrong (mass, friction)? → system ID from real data (parametric).
Is an effect missing entirely (aero, deformable contact)? → learn a residual / actuator net (non-parametric).
Is the reward/task the problem, not dynamics? → learn from human demos via retargeting + sim tracking.

The one sentence

Sim2real is the art of making a policy survive the gap between f_sim and f_real — by making the policy robust to the gap, adaptive to it, or by shrinking the gap itself with real data. The best systems do all three, in layers.

Where to go next

→Policy Gradients (the PPO inside every method) →Gleams index — more CS224R

🔗The thread continues

This lecture trained robots to move. The next frontier fuses this with perception and language — robots that take a camera image and a text instruction and act. That's the subject of RL for Vision-Language-Action models, where sim2real, RL, and foundation models collide. Everything you learned here — the reality gap, domain randomization, the role of PPO — carries straight over.

Sim2Real: Crossing the Reality Gap

What You'll Master

The Promise & The Catch

Why simulators are irresistible

The catch, in one equation

What Is a Simulator?

The inner loop, traced

The simulator zoo (and a crucial distinction)

The killer feature: massive parallelism

The Reality Gap, Dissected

Why a tiny gap causes a total failure

Domain Randomization

What "robust enough everywhere" costs you

What gets randomized, in practice

Learning to Adapt

The teacher–student pipeline, traced end to end

Variant A — RMA: distill in latent space

Variant B — Asymmetric Actor-Critic: skip the two stages

Real2Sim2Real: Fix the Simulator Itself

Path A — System Identification (fix the numbers)

Path B — Learn the residual (add missing physics)

The canonical residual: the Actuator Net

Sim2Real from Human Data

The two-step recipe

When the object matters too: OmniRetarget

Pushing to dynamics-level: SPIDER

RL Algorithms for Sim2Real

Frontier directions, and the problem each one attacks

Sim2Real 1.0 → 4.0

Sim2Real 1.0 — classical model-based control

The two axes that organize everything

The Full Pipeline, Interactive

Cheat Sheet & Connections

The decision tree

Where to go next