Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Source

Core Claim

Open X-Embodiment consolidates many robot-learning datasets into a standardized multi-embodiment repository and shows that RT-X policies can transfer skills across robot platforms.

Sensor-Time-Series Notes

  • The dataset is a large collection of real robot trajectories rather than a passive forecasting benchmark.
  • The relevant time-series unit is a trajectory with image observations, language instructions, and control inputs.
  • The repository uses RLDS to accommodate different action spaces and sensor modalities across robots.
  • The RT-X experiments coarsely align observations and actions by selecting a canonical camera view, resizing images, and mapping controls into a 7-DoF end-effector action representation before discretization.

Model Notes

RT-1-X and RT-2-X represent two common robotics foundation-model interfaces. RT-1-X treats recent image history plus language as inputs to a Transformer policy that emits discretized actions. RT-2-X maps robot actions into language-token-like outputs so a vision-language model can be co-fine-tuned for control.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Causal structure, counterfactuals, and controladjacentProvides large cross-embodiment robot trajectories with image observations, language instructions, and action outputs, which is an analogy for the digital-world robot north star.Physical robot policy data does not model digital telemetry or future observations under actions.
Context interfaceadjacentUses language instructions plus recent visual observations to condition action generation across embodiments.No channel metadata, topology, or numeric system-context contract.
Benchmarks: control utilityadjacentRT-X experiments evaluate policy success and transfer across robots.Does not test causal simulation, counterfactual rollouts, or TSFM latent-state quality.

Open Questions

  • Which parts of the RT-X alignment recipe are necessary for cross-embodiment transfer, and which are artifacts of the available datasets?
  • How should multi-view observations, proprioception, force, tactile, and control-frequency metadata be standardized without erasing embodiment-specific dynamics?