RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Source

Raw Markdown: paper_rdt-1b-2024.md
PDF: paper_rdt-1b-2024.pdf
Preprint: arXiv 2410.07864
Project page: rdt-robotics.github.io
Official code: github.com/thu-ml/RoboticsDiffusionTransformer

Core Claim

RDT-1B scales diffusion-action modeling to a 1.2B-parameter bimanual manipulation policy. It uses a Diffusion Transformer to denoise continuous action chunks conditioned on language, visual observations, proprioception, and control-frequency metadata.

Method Notes

RDT explicitly targets multimodal continuous bimanual action distributions, where deterministic regression can average incompatible action modes.
The model treats proprioception, noisy action chunks, and control frequency as low-dimensional physical quantities, while images and text condition the denoising process.
Its unified physical action space is an interface decision for cross-robot training, not a claim that all embodiments have identical dynamics.

Evidence And Limitations

The source reports pretraining on a large multi-robot collection and fine-tuning on more than 6K bimanual trajectories, with real-robot improvements over ACT, OpenVLA, and Octo baselines. The scope remains bimanual manipulation; it is not a general future-observation world model or a full whole-body humanoid controller.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Multi-modal future distributions	partially closes	Uses diffusion to model multimodal continuous bimanual action chunks instead of deterministic action regression.	Future observations and state distributions are not rolled out.
Causal structure, counterfactuals, and control	adjacent	Produces language-conditioned bimanual control inputs from vision, proprioception, action chunks, and control frequency, which is an analogy for digital-world robot actuation.	It remains a physical robot policy, not a general action-conditioned time-series world model.
Context interface	adjacent	Treats language, images, proprioception, and control frequency as conditioning inputs.	No general system/channel context or intervention schema for numeric TSFMs.
Native numeric/action encoding	adjacent	Introduces a physically interpretable unified action space for heterogeneous robot quantities.	The action space is robot-specific and not a general numeric-token interface.

Links Into The Wiki

Open Questions

How reusable is the physically interpretable unified action space beyond gripper-arm embodiments?
Do diffusion action chunks scale better than action tokens as bimanual action dimensionality rises?

Alex Open Research Wiki

Explorer

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Source

Core Claim

Method Notes

Evidence And Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks