RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Source
- Raw Markdown: paper_rdt-1b-2024.md
- PDF: paper_rdt-1b-2024.pdf
- Preprint: arXiv 2410.07864
- Project page: rdt-robotics.github.io
- Official code: github.com/thu-ml/RoboticsDiffusionTransformer
Core Claim
RDT-1B scales diffusion-action modeling to a 1.2B-parameter bimanual manipulation policy. It uses a Diffusion Transformer to denoise continuous action chunks conditioned on language, visual observations, proprioception, and control-frequency metadata.
Method Notes
- RDT explicitly targets multimodal continuous bimanual action distributions, where deterministic regression can average incompatible action modes.
- The model treats proprioception, noisy action chunks, and control frequency as low-dimensional physical quantities, while images and text condition the denoising process.
- Its unified physical action space is an interface decision for cross-robot training, not a claim that all embodiments have identical dynamics.
Evidence And Limitations
The source reports pretraining on a large multi-robot collection and fine-tuning on more than 6K bimanual trajectories, with real-robot improvements over ACT, OpenVLA, and Octo baselines. The scope remains bimanual manipulation; it is not a general future-observation world model or a full whole-body humanoid controller.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Multi-modal future distributions | partially closes | Uses diffusion to model multimodal continuous bimanual action chunks instead of deterministic action regression. | Future observations and state distributions are not rolled out. |
| Causal structure, counterfactuals, and control | adjacent | Produces language-conditioned bimanual control inputs from vision, proprioception, action chunks, and control frequency, which is an analogy for digital-world robot actuation. | It remains a physical robot policy, not a general action-conditioned time-series world model. |
| Context interface | adjacent | Treats language, images, proprioception, and control frequency as conditioning inputs. | No general system/channel context or intervention schema for numeric TSFMs. |
| Native numeric/action encoding | adjacent | Introduces a physically interpretable unified action space for heterogeneous robot quantities. | The action space is robot-specific and not a general numeric-token interface. |
Links Into The Wiki
- Foundation Time-Series Model Research Agenda
- RDT-1B
- Robotics Time-Series Modeling
- Robotics Text Conditioning
Open Questions
- How reusable is the physically interpretable unified action space beyond gripper-arm embodiments?
- Do diffusion action chunks scale better than action tokens as bimanual action dimensionality rises?