Sensorimotor World Models

Source

Status And Credibility

Sensorimotor World Models is a current arXiv preprint: arXiv lists v1 as submitted on 2026-06-18 in cs.LG and cs.AI. The authors are Petr Ivashkov, Randall Balestriero, and Bernhard Schölkopf from Max Planck Institute for Intelligent Systems, Brown University, ELLIS Institute Tübingen, and ETH Zürich. Credibility is strong enough to track as an important Alex-provided source because the paper is current, the team is credible in JEPA/world-model representation learning, and the paper has an official project page, official MIT-licensed code repository, rendered PDF, and co-author X thread. No peer-reviewed venue acceptance was found during ingest, so the source should be treated as a preprint until a venue status is verified.

Core Claim

Sensorimotor World Models (SMWM) argues that a latent world model should be shaped by perception for action: the representation should preserve state variables that explain how actions change observations, not every visually faithful detail. The concrete method trains an encoder, a forward latent dynamics model, and an inverse dynamics head end-to-end from offline, reward-free pixel/action transitions. The inverse head is the only anti-collapse regularizer: it must recover the executed action from consecutive latent states, forcing the encoder to preserve controllable transition information.

Method Contract

SMWM trains on transition tuples , where observations are images and is a continuous action or control input. The encoder maps observations to latent states, the forward model predicts the next latent state under the action, and the inverse model predicts the intervening action from the latent transition.

flowchart LR
  Ot["observation o_t"] --> Enc["encoder f_theta"]
  Otp1["next observation o_t+1"] --> Enc
  Enc --> Zt["latent z_t"]
  Enc --> Ztp1["latent z_t+1"]
  Zt --> Fwd["forward dynamics g_phi"]
  At["action / control input a_t"] --> Fwd
  Fwd --> Zhat["predicted next latent"]
  Zt --> Inv["inverse dynamics h_psi"]
  Ztp1 --> Inv
  Inv --> Ahat["recovered action"]
  Zhat --> Lfwd["forward loss"]
  Ztp1 --> Lfwd
  Ahat --> Linv["inverse loss"]
  At --> Linv

The objective is:

The key difference from LeWorldModel is the anti-collapse prior: LeWorldModel uses SIGReg to push embeddings toward an isotropic Gaussian; SMWM uses the action-recovery task itself and therefore does not force the full embedding space to be filled.

Evidence And Results

  • Dot-world latent geometry: in a one-dot setup, the learned 64-dimensional embedding concentrates almost all variance into two principal components matching the true controllable state; without the inverse loss, the forward-only model collapses.
  • Controllable degrees of freedom: across independent, coupled, distractor, and combined dot-world variants, the significant latent dimension tracks the controllable action/state degrees of freedom while stochastic action-irrelevant distractors are filtered out.
  • Control-dependent perception: in a triangular-sprite experiment, the same visual object is represented differently depending on the available action interface: translation-only control preserves position but averages uncontrolled orientation; full pose control preserves orientation too.
  • Planning: using frozen latent states and a Cross-Entropy Method / MPC planner, SMWM matches SIGReg on TwoRoom, Reacher, and Push-T and outperforms SIGReg on OGBench-Cube in the reported setup, with 84% success versus 59% on the 3D cube task.
  • Physical state probes: both SMWM and SIGReg recover physical quantities well under nonlinear probes; SMWM has more compact and interpretable latent geometry, while SIGReg spreads state information across more dimensions.
  • Artifact surface: the official GitHub repository contains a toy dot-world implementation and a planning subproject that reproduces paper results with Linux/CUDA requirements.

X Thread Notes

Randall Balestriero’s co-author thread frames the method as a deep dive on inverse dynamics modeling as anti-collapse regularization for JEPAs. The thread adds a useful interpretation that is not a substitute for the paper: inverse dynamics regularization is weaker than SIGReg because it only captures what is affected by the agent’s actions; with rich actions, that may be enough, while weak action interfaces may need inverse dynamics plus SIGReg to preserve additional factors for a broader foundation model. The thread also treats partial collapse as beneficial when specializing to one agent because it simplifies the planner’s optimization landscape. This is landscape commentary; the paper artifacts remain the source of truth for bibliographic fields and experimental claims.

Limitations And Gotchas

  • The method assumes actions are recoverable from consecutive observations. Distinct actions with the same visible effect can break the inverse signal.
  • The encoder maps individual frames to latent states, so variables not identifiable from one frame, such as velocity, can be missed unless short histories are added.
  • Action-aligned compression is a feature for a specialized controller but a caveat for foundation models: downstream tasks that need action-irrelevant variables may lose information.
  • Biased behavior policies can make uncontrollable distractors action-correlated, causing the encoder to preserve the wrong factors.
  • Planning evidence is limited to moderate-scale simulated visual control tasks and inherits ordinary offline-world-model limits: data coverage, compounding model error, and simulator/planner mismatch.
  • The inverse-dynamics weight needs environment-specific tuning in the reported experiments.
  • The evidence is physical/visual control evidence, not direct evidence for numeric multivariate time series, graph time series, event streams, or digital-world operator interventions.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Latent-state predictionpartially closes outside numeric time seriesLearns compact latent states whose effective dimension and geometry track controllable state variables under pixel/action transitions.Needs multivariate numeric time-series, irregular event-stream, high-channel, and streaming-state evidence.
Control and counterfactualspartially closes outside numeric time seriesUses explicit continuous actions and latent rollouts inside an MPC/CEM candidate-action planner.Needs typed digital interventions, counterfactual validation, calibrated uncertainty, and operational telemetry benchmarks.
Anti-collapse regularizationpartially closesShows a single inverse dynamics head can prevent full collapse in JEPA-style end-to-end visual world models without EMA, frozen encoders, or SIGReg.Needs weak-action, non-Gaussian, long-tailed, rare-regime, and time-series tests; compare inverse dynamics, SIGReg, reconstruction, and hybrid regularizers under matched compute.
Representation qualitywarningThe method intentionally discards variables not needed to recover or predict the action-induced transition.Foundation models may need broader state than one agent’s controllable subspace; need task-conditioned preservation probes and multi-objective readouts.
Benchmark hygienewarningSeparates latent geometry, physical-state probes, planning success, horizon sweeps, and regularizer ablations.Needs independent reproduction, broader tasks, uncertainty/simulator-exploitation audits, and closed-loop transfer evidence.

Open Questions

  • When should an action-conditioned world model prefer inverse dynamics regularization, SIGReg, reconstruction grounding, or a hybrid objective?
  • Can action-conditioned time-series models use inverse dynamics or intervention-recovery losses to preserve controllable state without erasing rare but action-irrelevant safety variables?
  • How rich must the action or control-input interface be before inverse dynamics alone is a sufficient anti-collapse signal?
  • Can multi-step inverse objectives recover hidden velocities, delayed effects, or intervention consequences that single-frame, single-step inverse dynamics misses?
  • How should planners detect when a compact action-sufficient latent space has discarded variables needed for a new downstream task?