Sensorimotor World Models

Summary

Sensorimotor World Models (SMWM) is the inverse-dynamics-regularized JEPA world-model method introduced by Sensorimotor World Models: Perception for Action via Inverse Dynamics. It trains a pixel encoder and latent forward dynamics model end-to-end on offline, reward-free action trajectories, using an inverse dynamics head as the sole anti-collapse regularizer.

The key idea is that the representation should preserve state variables that explain action-conditioned transitions. That makes partial collapse of action-irrelevant variation a design feature for one agent, while also creating a caveat for broader foundation-model use.

Method Contract

  • Input data: transition tuples (o_t, a_t, o_{t+1}) with image observations and continuous actions/control inputs.
  • State representation: an encoder maps each observation into a compact latent state.
  • Forward model: predicts the next latent state from the current latent state and action.
  • Inverse model: predicts the executed action from consecutive latent states.
  • Anti-collapse signal: action recovery from latent transitions forces the encoder to preserve action-relevant information.
  • Planning hook: frozen latent states and dynamics can be used with CEM/MPC to compare candidate action sequences.
flowchart LR
  Data["offline reward-free trajectories"]
  Enc["encoder"]
  Fwd["latent forward dynamics"]
  Inv["inverse dynamics regularizer"]
  Plan["CEM / MPC latent planner"]

  Data --> Enc --> Fwd --> Plan
  Enc --> Inv --> Enc

Official Artifacts

The official repository includes a toy dot-world subproject and a planning subproject. The planning reproduction path requires Linux/CUDA dependencies and derives from LeWorldModel.

Relevance To This Wiki

SMWM belongs on the JEPA, Representation Collapse, Latent-Space Predictive Learning, and World Models branches. Its main contribution to the wiki is not that inverse dynamics is universally better than SIGReg. It adds a different regularization principle: use the action channel itself to decide which state variables a latent world model should keep.

For time-series and digital-world robot work, the transfer question is whether an analogous action/intervention-recovery objective can make multivariate models preserve controllable state, intervention effects, and useful latent geometry without erasing variables needed for safety, diagnostics, or downstream tasks.

Caveats

  • Evidence is visual/robotic control evidence, not numeric time-series evidence.
  • Action recoverability from observations is an assumption, not a guarantee.
  • Partial collapse can be beneficial for one action repertoire but harmful for a general foundation model.
  • Weak or incomplete action labels may require additional regularizers such as SIGReg, reconstruction grounding, or broader state probes.
  • Offline dataset coverage and long-horizon rollout error remain ordinary world-model risks.