---
abstract: |
  Bimanual manipulation is essential in robotics, yet developing foundation models is extremely challenging due to the inherent complexity of coordinating two robot arms (leading to multi-modal action distributions) and the scarcity of training data. In this paper, we present the Robotics Diffusion Transformer (RDT), a pioneering diffusion foundation model for bimanual manipulation. RDT builds on diffusion models to effectively represent multi-modality, with innovative designs of a scalable Transformer to deal with the heterogeneity of multi-modal inputs and to capture the nonlinearity and high frequency of robotic data. To address data scarcity, we further introduce a Physically Interpretable Unified Action Space, which can unify the action representations of various robots while preserving the physical meanings of original actions, facilitating learning transferrable physical knowledge. With these designs, we managed to pre-train RDT on the largest collection of multi-robot datasets to date and scaled it up to $1.2$B parameters, which is the largest diffusion-based foundation model for robotic manipulation. We finally fine-tuned RDT on a self-created multi-task bimanual dataset with over $6$K+ episodes to refine its manipulation capabilities. Experiments on real robots demonstrate that RDT significantly outperforms existing methods. It exhibits zero-shot generalization to unseen objects and scenes, understands and follows language instructions, learns new skills with just 1$\sim$5 demonstrations, and effectively handles complex, dexterous tasks. We refer to the [project page](https://rdt-robotics.github.io/rdt-robotics/) for the code and videos.
author:
- |
  Songming Liu[^1],  Lingxuan Wu$^{*}$,  Bangguo Li,  Hengkai Tan,  Huayu Chen,\
  **Zhengyi Wang**,  **Ke Xu**,  **Hang Su**$^{\dag}$,   **Jun Zhu**$^\dag$\
  $^{1}$Department of Computer Science & Technology, Institute for AI, BNRist Center,\
  Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University\
bibliography:
- iclr2024\_conference.bib
title: |
  RDT-1B: a Diffusion Foundation Model for\
  Bimanual Manipulation
---

```{=latex}
\newcommand{\figleft}{{\em (Left)}}
```
```{=latex}
\newcommand{\figcenter}{{\em (Center)}}
```
```{=latex}
\newcommand{\figright}{{\em (Right)}}
```
```{=latex}
\newcommand{\figtop}{{\em (Top)}}
```
```{=latex}
\newcommand{\figbottom}{{\em (Bottom)}}
```
```{=latex}
\newcommand{\captiona}{{\em (a)}}
```
```{=latex}
\newcommand{\captionb}{{\em (b)}}
```
```{=latex}
\newcommand{\captionc}{{\em (c)}}
```
```{=latex}
\newcommand{\captiond}{{\em (d)}}
```
```{=latex}
\newcommand{\newterm}[1]{{\bf #1}}
```
```{=latex}
\def\figref#1{figure~\ref{#1}}
```
```{=latex}
\def\Figref#1{Figure~\ref{#1}}
```
```{=latex}
\def\twofigref#1#2{figures \ref{#1} and \ref{#2}}
```
```{=latex}
\def\quadfigref#1#2#3#4{figures \ref{#1}, \ref{#2}, \ref{#3} and \ref{#4}}
```
```{=latex}
\def\secref#1{section~\ref{#1}}
```
```{=latex}
\def\Secref#1{Section~\ref{#1}}
```
```{=latex}
\def\twosecrefs#1#2{sections \ref{#1} and \ref{#2}}
```
```{=latex}
\def\secrefs#1#2#3{sections \ref{#1}, \ref{#2} and \ref{#3}}
```
```{=latex}
\def\eqref#1{equation~\ref{#1}}
```
```{=latex}
\def\Eqref#1{Equation~\ref{#1}}
```
```{=latex}
\def\plaineqref#1{\ref{#1}}
```
```{=latex}
\def\chapref#1{chapter~\ref{#1}}
```
```{=latex}
\def\Chapref#1{Chapter~\ref{#1}}
```
```{=latex}
\def\rangechapref#1#2{chapters\ref{#1}--\ref{#2}}
```
```{=latex}
\def\algref#1{algorithm~\ref{#1}}
```
```{=latex}
\def\Algref#1{Algorithm~\ref{#1}}
```
```{=latex}
\def\twoalgref#1#2{algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\Twoalgref#1#2{Algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\partref#1{part~\ref{#1}}
```
```{=latex}
\def\Partref#1{Part~\ref{#1}}
```
```{=latex}
\def\twopartref#1#2{parts \ref{#1} and \ref{#2}}
```
```{=latex}
\def\ceil#1{\lceil #1 \rceil}
```
```{=latex}
\def\floor#1{\lfloor #1 \rfloor}
```
```{=latex}
\def\1{\bm{1}}
```
```{=latex}
\newcommand{\train}{\mathcal{D}}
```
```{=latex}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
```
```{=latex}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
```
```{=latex}
\def\eps{{\epsilon}}
```
```{=latex}
\def\reta{{\textnormal{$\eta$}}}
```
```{=latex}
\def\ra{{\textnormal{a}}}
```
```{=latex}
\def\rb{{\textnormal{b}}}
```
```{=latex}
\def\rc{{\textnormal{c}}}
```
```{=latex}
\def\rd{{\textnormal{d}}}
```
```{=latex}
\def\re{{\textnormal{e}}}
```
```{=latex}
\def\rf{{\textnormal{f}}}
```
```{=latex}
\def\rg{{\textnormal{g}}}
```
```{=latex}
\def\rh{{\textnormal{h}}}
```
```{=latex}
\def\ri{{\textnormal{i}}}
```
```{=latex}
\def\rj{{\textnormal{j}}}
```
```{=latex}
\def\rk{{\textnormal{k}}}
```
```{=latex}
\def\rl{{\textnormal{l}}}
```
```{=latex}
\def\rn{{\textnormal{n}}}
```
```{=latex}
\def\ro{{\textnormal{o}}}
```
```{=latex}
\def\rp{{\textnormal{p}}}
```
```{=latex}
\def\rq{{\textnormal{q}}}
```
```{=latex}
\def\rr{{\textnormal{r}}}
```
```{=latex}
\def\rs{{\textnormal{s}}}
```
```{=latex}
\def\rt{{\textnormal{t}}}
```
```{=latex}
\def\ru{{\textnormal{u}}}
```
```{=latex}
\def\rv{{\textnormal{v}}}
```
```{=latex}
\def\rw{{\textnormal{w}}}
```
```{=latex}
\def\rx{{\textnormal{x}}}
```
```{=latex}
\def\ry{{\textnormal{y}}}
```
```{=latex}
\def\rz{{\textnormal{z}}}
```
```{=latex}
\def\rvepsilon{{\mathbf{\epsilon}}}
```
```{=latex}
\def\rvtheta{{\mathbf{\theta}}}
```
```{=latex}
\def\rva{{\mathbf{a}}}
```
```{=latex}
\def\rvb{{\mathbf{b}}}
```
```{=latex}
\def\rvc{{\mathbf{c}}}
```
```{=latex}
\def\rvd{{\mathbf{d}}}
```
```{=latex}
\def\rve{{\mathbf{e}}}
```
```{=latex}
\def\rvf{{\mathbf{f}}}
```
```{=latex}
\def\rvg{{\mathbf{g}}}
```
```{=latex}
\def\rvh{{\mathbf{h}}}
```
```{=latex}
\def\rvu{{\mathbf{i}}}
```
```{=latex}
\def\rvj{{\mathbf{j}}}
```
```{=latex}
\def\rvk{{\mathbf{k}}}
```
```{=latex}
\def\rvl{{\mathbf{l}}}
```
```{=latex}
\def\rvm{{\mathbf{m}}}
```
```{=latex}
\def\rvn{{\mathbf{n}}}
```
```{=latex}
\def\rvo{{\mathbf{o}}}
```
```{=latex}
\def\rvp{{\mathbf{p}}}
```
```{=latex}
\def\rvq{{\mathbf{q}}}
```
```{=latex}
\def\rvr{{\mathbf{r}}}
```
```{=latex}
\def\rvs{{\mathbf{s}}}
```
```{=latex}
\def\rvt{{\mathbf{t}}}
```
```{=latex}
\def\rvu{{\mathbf{u}}}
```
```{=latex}
\def\rvv{{\mathbf{v}}}
```
```{=latex}
\def\rvw{{\mathbf{w}}}
```
```{=latex}
\def\rvx{{\mathbf{x}}}
```
```{=latex}
\def\rvy{{\mathbf{y}}}
```
```{=latex}
\def\rvz{{\mathbf{z}}}
```
```{=latex}
\def\erva{{\textnormal{a}}}
```
```{=latex}
\def\ervb{{\textnormal{b}}}
```
```{=latex}
\def\ervc{{\textnormal{c}}}
```
```{=latex}
\def\ervd{{\textnormal{d}}}
```
```{=latex}
\def\erve{{\textnormal{e}}}
```
```{=latex}
\def\ervf{{\textnormal{f}}}
```
```{=latex}
\def\ervg{{\textnormal{g}}}
```
```{=latex}
\def\ervh{{\textnormal{h}}}
```
```{=latex}
\def\ervi{{\textnormal{i}}}
```
```{=latex}
\def\ervj{{\textnormal{j}}}
```
```{=latex}
\def\ervk{{\textnormal{k}}}
```
```{=latex}
\def\ervl{{\textnormal{l}}}
```
```{=latex}
\def\ervm{{\textnormal{m}}}
```
```{=latex}
\def\ervn{{\textnormal{n}}}
```
```{=latex}
\def\ervo{{\textnormal{o}}}
```
```{=latex}
\def\ervp{{\textnormal{p}}}
```
```{=latex}
\def\ervq{{\textnormal{q}}}
```
```{=latex}
\def\ervr{{\textnormal{r}}}
```
```{=latex}
\def\ervs{{\textnormal{s}}}
```
```{=latex}
\def\ervt{{\textnormal{t}}}
```
```{=latex}
\def\ervu{{\textnormal{u}}}
```
```{=latex}
\def\ervv{{\textnormal{v}}}
```
```{=latex}
\def\ervw{{\textnormal{w}}}
```
```{=latex}
\def\ervx{{\textnormal{x}}}
```
```{=latex}
\def\ervy{{\textnormal{y}}}
```
```{=latex}
\def\ervz{{\textnormal{z}}}
```
```{=latex}
\def\rmA{{\mathbf{A}}}
```
```{=latex}
\def\rmB{{\mathbf{B}}}
```
```{=latex}
\def\rmC{{\mathbf{C}}}
```
```{=latex}
\def\rmD{{\mathbf{D}}}
```
```{=latex}
\def\rmE{{\mathbf{E}}}
```
```{=latex}
\def\rmF{{\mathbf{F}}}
```
```{=latex}
\def\rmG{{\mathbf{G}}}
```
```{=latex}
\def\rmH{{\mathbf{H}}}
```
```{=latex}
\def\rmI{{\mathbf{I}}}
```
```{=latex}
\def\rmJ{{\mathbf{J}}}
```
```{=latex}
\def\rmK{{\mathbf{K}}}
```
```{=latex}
\def\rmL{{\mathbf{L}}}
```
```{=latex}
\def\rmM{{\mathbf{M}}}
```
```{=latex}
\def\rmN{{\mathbf{N}}}
```
```{=latex}
\def\rmO{{\mathbf{O}}}
```
```{=latex}
\def\rmP{{\mathbf{P}}}
```
```{=latex}
\def\rmQ{{\mathbf{Q}}}
```
```{=latex}
\def\rmR{{\mathbf{R}}}
```
```{=latex}
\def\rmS{{\mathbf{S}}}
```
```{=latex}
\def\rmT{{\mathbf{T}}}
```
```{=latex}
\def\rmU{{\mathbf{U}}}
```
```{=latex}
\def\rmV{{\mathbf{V}}}
```
```{=latex}
\def\rmW{{\mathbf{W}}}
```
```{=latex}
\def\rmX{{\mathbf{X}}}
```
```{=latex}
\def\rmY{{\mathbf{Y}}}
```
```{=latex}
\def\rmZ{{\mathbf{Z}}}
```
```{=latex}
\def\ermA{{\textnormal{A}}}
```
```{=latex}
\def\ermB{{\textnormal{B}}}
```
```{=latex}
\def\ermC{{\textnormal{C}}}
```
```{=latex}
\def\ermD{{\textnormal{D}}}
```
```{=latex}
\def\ermE{{\textnormal{E}}}
```
```{=latex}
\def\ermF{{\textnormal{F}}}
```
```{=latex}
\def\ermG{{\textnormal{G}}}
```
```{=latex}
\def\ermH{{\textnormal{H}}}
```
```{=latex}
\def\ermI{{\textnormal{I}}}
```
```{=latex}
\def\ermJ{{\textnormal{J}}}
```
```{=latex}
\def\ermK{{\textnormal{K}}}
```
```{=latex}
\def\ermL{{\textnormal{L}}}
```
```{=latex}
\def\ermM{{\textnormal{M}}}
```
```{=latex}
\def\ermN{{\textnormal{N}}}
```
```{=latex}
\def\ermO{{\textnormal{O}}}
```
```{=latex}
\def\ermP{{\textnormal{P}}}
```
```{=latex}
\def\ermQ{{\textnormal{Q}}}
```
```{=latex}
\def\ermR{{\textnormal{R}}}
```
```{=latex}
\def\ermS{{\textnormal{S}}}
```
```{=latex}
\def\ermT{{\textnormal{T}}}
```
```{=latex}
\def\ermU{{\textnormal{U}}}
```
```{=latex}
\def\ermV{{\textnormal{V}}}
```
```{=latex}
\def\ermW{{\textnormal{W}}}
```
```{=latex}
\def\ermX{{\textnormal{X}}}
```
```{=latex}
\def\ermY{{\textnormal{Y}}}
```
```{=latex}
\def\ermZ{{\textnormal{Z}}}
```
```{=latex}
\def\vzero{{\bm{0}}}
```
```{=latex}
\def\vone{{\bm{1}}}
```
```{=latex}
\def\vmu{{\bm{\mu}}}
```
```{=latex}
\def\vtheta{{\bm{\theta}}}
```
```{=latex}
\def\va{{\bm{a}}}
```
```{=latex}
\def\vb{{\bm{b}}}
```
```{=latex}
\def\vc{{\bm{c}}}
```
```{=latex}
\def\vd{{\bm{d}}}
```
```{=latex}
\def\ve{{\bm{e}}}
```
```{=latex}
\def\vf{{\bm{f}}}
```
```{=latex}
\def\vg{{\bm{g}}}
```
```{=latex}
\def\vh{{\bm{h}}}
```
```{=latex}
\def\vi{{\bm{i}}}
```
```{=latex}
\def\vj{{\bm{j}}}
```
```{=latex}
\def\vk{{\bm{k}}}
```
```{=latex}
\def\vl{{\bm{l}}}
```
```{=latex}
\def\vm{{\bm{m}}}
```
```{=latex}
\def\vn{{\bm{n}}}
```
```{=latex}
\def\vo{{\bm{o}}}
```
```{=latex}
\def\vp{{\bm{p}}}
```
```{=latex}
\def\vq{{\bm{q}}}
```
```{=latex}
\def\vr{{\bm{r}}}
```
```{=latex}
\def\vs{{\bm{s}}}
```
```{=latex}
\def\vt{{\bm{t}}}
```
```{=latex}
\def\vu{{\bm{u}}}
```
```{=latex}
\def\vv{{\bm{v}}}
```
```{=latex}
\def\vw{{\bm{w}}}
```
```{=latex}
\def\vx{{\bm{x}}}
```
```{=latex}
\def\vy{{\bm{y}}}
```
```{=latex}
\def\vz{{\bm{z}}}
```
```{=latex}
\def\evalpha{{\alpha}}
```
```{=latex}
\def\evbeta{{\beta}}
```
```{=latex}
\def\evepsilon{{\epsilon}}
```
```{=latex}
\def\evlambda{{\lambda}}
```
```{=latex}
\def\evomega{{\omega}}
```
```{=latex}
\def\evmu{{\mu}}
```
```{=latex}
\def\evpsi{{\psi}}
```
```{=latex}
\def\evsigma{{\sigma}}
```
```{=latex}
\def\evtheta{{\theta}}
```
```{=latex}
\def\eva{{a}}
```
```{=latex}
\def\evb{{b}}
```
```{=latex}
\def\evc{{c}}
```
```{=latex}
\def\evd{{d}}
```
```{=latex}
\def\eve{{e}}
```
```{=latex}
\def\evf{{f}}
```
```{=latex}
\def\evg{{g}}
```
```{=latex}
\def\evh{{h}}
```
```{=latex}
\def\evi{{i}}
```
```{=latex}
\def\evj{{j}}
```
```{=latex}
\def\evk{{k}}
```
```{=latex}
\def\evl{{l}}
```
```{=latex}
\def\evm{{m}}
```
```{=latex}
\def\evn{{n}}
```
```{=latex}
\def\evo{{o}}
```
```{=latex}
\def\evp{{p}}
```
```{=latex}
\def\evq{{q}}
```
```{=latex}
\def\evr{{r}}
```
```{=latex}
\def\evs{{s}}
```
```{=latex}
\def\evt{{t}}
```
```{=latex}
\def\evu{{u}}
```
```{=latex}
\def\evv{{v}}
```
```{=latex}
\def\evw{{w}}
```
```{=latex}
\def\evx{{x}}
```
```{=latex}
\def\evy{{y}}
```
```{=latex}
\def\evz{{z}}
```
```{=latex}
\def\mA{{\bm{A}}}
```
```{=latex}
\def\mB{{\bm{B}}}
```
```{=latex}
\def\mC{{\bm{C}}}
```
```{=latex}
\def\mD{{\bm{D}}}
```
```{=latex}
\def\mE{{\bm{E}}}
```
```{=latex}
\def\mF{{\bm{F}}}
```
```{=latex}
\def\mG{{\bm{G}}}
```
```{=latex}
\def\mH{{\bm{H}}}
```
```{=latex}
\def\mI{{\bm{I}}}
```
```{=latex}
\def\mJ{{\bm{J}}}
```
```{=latex}
\def\mK{{\bm{K}}}
```
```{=latex}
\def\mL{{\bm{L}}}
```
```{=latex}
\def\mM{{\bm{M}}}
```
```{=latex}
\def\mN{{\bm{N}}}
```
```{=latex}
\def\mO{{\bm{O}}}
```
```{=latex}
\def\mP{{\bm{P}}}
```
```{=latex}
\def\mQ{{\bm{Q}}}
```
```{=latex}
\def\mR{{\bm{R}}}
```
```{=latex}
\def\mS{{\bm{S}}}
```
```{=latex}
\def\mT{{\bm{T}}}
```
```{=latex}
\def\mU{{\bm{U}}}
```
```{=latex}
\def\mV{{\bm{V}}}
```
```{=latex}
\def\mW{{\bm{W}}}
```
```{=latex}
\def\mX{{\bm{X}}}
```
```{=latex}
\def\mY{{\bm{Y}}}
```
```{=latex}
\def\mZ{{\bm{Z}}}
```
```{=latex}
\def\mBeta{{\bm{\beta}}}
```
```{=latex}
\def\mPhi{{\bm{\Phi}}}
```
```{=latex}
\def\mLambda{{\bm{\Lambda}}}
```
```{=latex}
\def\mSigma{{\bm{\Sigma}}}
```
```{=latex}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
```
```{=latex}
\def\tA{{\tens{A}}}
```
```{=latex}
\def\tB{{\tens{B}}}
```
```{=latex}
\def\tC{{\tens{C}}}
```
```{=latex}
\def\tD{{\tens{D}}}
```
```{=latex}
\def\tE{{\tens{E}}}
```
```{=latex}
\def\tF{{\tens{F}}}
```
```{=latex}
\def\tG{{\tens{G}}}
```
```{=latex}
\def\tH{{\tens{H}}}
```
```{=latex}
\def\tI{{\tens{I}}}
```
```{=latex}
\def\tJ{{\tens{J}}}
```
```{=latex}
\def\tK{{\tens{K}}}
```
```{=latex}
\def\tL{{\tens{L}}}
```
```{=latex}
\def\tM{{\tens{M}}}
```
```{=latex}
\def\tN{{\tens{N}}}
```
```{=latex}
\def\tO{{\tens{O}}}
```
```{=latex}
\def\tP{{\tens{P}}}
```
```{=latex}
\def\tQ{{\tens{Q}}}
```
```{=latex}
\def\tR{{\tens{R}}}
```
```{=latex}
\def\tS{{\tens{S}}}
```
```{=latex}
\def\tT{{\tens{T}}}
```
```{=latex}
\def\tU{{\tens{U}}}
```
```{=latex}
\def\tV{{\tens{V}}}
```
```{=latex}
\def\tW{{\tens{W}}}
```
```{=latex}
\def\tX{{\tens{X}}}
```
```{=latex}
\def\tY{{\tens{Y}}}
```
```{=latex}
\def\tZ{{\tens{Z}}}
```
```{=latex}
\def\gA{{\mathcal{A}}}
```
```{=latex}
\def\gB{{\mathcal{B}}}
```
```{=latex}
\def\gC{{\mathcal{C}}}
```
```{=latex}
\def\gD{{\mathcal{D}}}
```
```{=latex}
\def\gE{{\mathcal{E}}}
```
```{=latex}
\def\gF{{\mathcal{F}}}
```
```{=latex}
\def\gG{{\mathcal{G}}}
```
```{=latex}
\def\gH{{\mathcal{H}}}
```
```{=latex}
\def\gI{{\mathcal{I}}}
```
```{=latex}
\def\gJ{{\mathcal{J}}}
```
```{=latex}
\def\gK{{\mathcal{K}}}
```
```{=latex}
\def\gL{{\mathcal{L}}}
```
```{=latex}
\def\gM{{\mathcal{M}}}
```
```{=latex}
\def\gN{{\mathcal{N}}}
```
```{=latex}
\def\gO{{\mathcal{O}}}
```
```{=latex}
\def\gP{{\mathcal{P}}}
```
```{=latex}
\def\gQ{{\mathcal{Q}}}
```
```{=latex}
\def\gR{{\mathcal{R}}}
```
```{=latex}
\def\gS{{\mathcal{S}}}
```
```{=latex}
\def\gT{{\mathcal{T}}}
```
```{=latex}
\def\gU{{\mathcal{U}}}
```
```{=latex}
\def\gV{{\mathcal{V}}}
```
```{=latex}
\def\gW{{\mathcal{W}}}
```
```{=latex}
\def\gX{{\mathcal{X}}}
```
```{=latex}
\def\gY{{\mathcal{Y}}}
```
```{=latex}
\def\gZ{{\mathcal{Z}}}
```
```{=latex}
\def\sA{{\mathbb{A}}}
```
```{=latex}
\def\sB{{\mathbb{B}}}
```
```{=latex}
\def\sC{{\mathbb{C}}}
```
```{=latex}
\def\sD{{\mathbb{D}}}
```
```{=latex}
\def\sF{{\mathbb{F}}}
```
```{=latex}
\def\sG{{\mathbb{G}}}
```
```{=latex}
\def\sH{{\mathbb{H}}}
```
```{=latex}
\def\sI{{\mathbb{I}}}
```
```{=latex}
\def\sJ{{\mathbb{J}}}
```
```{=latex}
\def\sK{{\mathbb{K}}}
```
```{=latex}
\def\sL{{\mathbb{L}}}
```
```{=latex}
\def\sM{{\mathbb{M}}}
```
```{=latex}
\def\sN{{\mathbb{N}}}
```
```{=latex}
\def\sO{{\mathbb{O}}}
```
```{=latex}
\def\sP{{\mathbb{P}}}
```
```{=latex}
\def\sQ{{\mathbb{Q}}}
```
```{=latex}
\def\sR{{\mathbb{R}}}
```
```{=latex}
\def\sS{{\mathbb{S}}}
```
```{=latex}
\def\sT{{\mathbb{T}}}
```
```{=latex}
\def\sU{{\mathbb{U}}}
```
```{=latex}
\def\sV{{\mathbb{V}}}
```
```{=latex}
\def\sW{{\mathbb{W}}}
```
```{=latex}
\def\sX{{\mathbb{X}}}
```
```{=latex}
\def\sY{{\mathbb{Y}}}
```
```{=latex}
\def\sZ{{\mathbb{Z}}}
```
```{=latex}
\def\emLambda{{\Lambda}}
```
```{=latex}
\def\emA{{A}}
```
```{=latex}
\def\emB{{B}}
```
```{=latex}
\def\emC{{C}}
```
```{=latex}
\def\emD{{D}}
```
```{=latex}
\def\emE{{E}}
```
```{=latex}
\def\emF{{F}}
```
```{=latex}
\def\emG{{G}}
```
```{=latex}
\def\emH{{H}}
```
```{=latex}
\def\emI{{I}}
```
```{=latex}
\def\emJ{{J}}
```
```{=latex}
\def\emK{{K}}
```
```{=latex}
\def\emL{{L}}
```
```{=latex}
\def\emM{{M}}
```
```{=latex}
\def\emN{{N}}
```
```{=latex}
\def\emO{{O}}
```
```{=latex}
\def\emP{{P}}
```
```{=latex}
\def\emQ{{Q}}
```
```{=latex}
\def\emR{{R}}
```
```{=latex}
\def\emS{{S}}
```
```{=latex}
\def\emT{{T}}
```
```{=latex}
\def\emU{{U}}
```
```{=latex}
\def\emV{{V}}
```
```{=latex}
\def\emW{{W}}
```
```{=latex}
\def\emX{{X}}
```
```{=latex}
\def\emY{{Y}}
```
```{=latex}
\def\emZ{{Z}}
```
```{=latex}
\def\emSigma{{\Sigma}}
```
```{=latex}
\newcommand{\etens}[1]{\mathsfit{#1}}
```
```{=latex}
\def\etLambda{{\etens{\Lambda}}}
```
```{=latex}
\def\etA{{\etens{A}}}
```
```{=latex}
\def\etB{{\etens{B}}}
```
```{=latex}
\def\etC{{\etens{C}}}
```
```{=latex}
\def\etD{{\etens{D}}}
```
```{=latex}
\def\etE{{\etens{E}}}
```
```{=latex}
\def\etF{{\etens{F}}}
```
```{=latex}
\def\etG{{\etens{G}}}
```
```{=latex}
\def\etH{{\etens{H}}}
```
```{=latex}
\def\etI{{\etens{I}}}
```
```{=latex}
\def\etJ{{\etens{J}}}
```
```{=latex}
\def\etK{{\etens{K}}}
```
```{=latex}
\def\etL{{\etens{L}}}
```
```{=latex}
\def\etM{{\etens{M}}}
```
```{=latex}
\def\etN{{\etens{N}}}
```
```{=latex}
\def\etO{{\etens{O}}}
```
```{=latex}
\def\etP{{\etens{P}}}
```
```{=latex}
\def\etQ{{\etens{Q}}}
```
```{=latex}
\def\etR{{\etens{R}}}
```
```{=latex}
\def\etS{{\etens{S}}}
```
```{=latex}
\def\etT{{\etens{T}}}
```
```{=latex}
\def\etU{{\etens{U}}}
```
```{=latex}
\def\etV{{\etens{V}}}
```
```{=latex}
\def\etW{{\etens{W}}}
```
```{=latex}
\def\etX{{\etens{X}}}
```
```{=latex}
\def\etY{{\etens{Y}}}
```
```{=latex}
\def\etZ{{\etens{Z}}}
```
```{=latex}
\newcommand{\pdata}{p_{\rm{data}}}
```
```{=latex}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
```
```{=latex}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
```
```{=latex}
\newcommand{\pmodel}{p_{\rm{model}}}
```
```{=latex}
\newcommand{\Pmodel}{P_{\rm{model}}}
```
```{=latex}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
```
```{=latex}
\newcommand{\pencode}{p_{\rm{encoder}}}
```
```{=latex}
\newcommand{\pdecode}{p_{\rm{decoder}}}
```
```{=latex}
\newcommand{\precons}{p_{\rm{reconstruct}}}
```
```{=latex}
\newcommand{\laplace}{\mathrm{Laplace}}
```
```{=latex}
\newcommand{\E}{\mathbb{E}}
```
```{=latex}
\newcommand{\Ls}{\mathcal{L}}
```
```{=latex}
\newcommand{\R}{\mathbb{R}}
```
```{=latex}
\newcommand{\emp}{\tilde{p}}
```
```{=latex}
\newcommand{\lr}{\alpha}
```
```{=latex}
\newcommand{\reg}{\lambda}
```
```{=latex}
\newcommand{\rect}{\mathrm{rectifier}}
```
```{=latex}
\newcommand{\softmax}{\mathrm{softmax}}
```
```{=latex}
\newcommand{\sigmoid}{\sigma}
```
```{=latex}
\newcommand{\softplus}{\zeta}
```
```{=latex}
\newcommand{\KL}{D_{\mathrm{KL}}}
```
```{=latex}
\newcommand{\Var}{\mathrm{Var}}
```
```{=latex}
\newcommand{\standarderror}{\mathrm{SE}}
```
```{=latex}
\newcommand{\Cov}{\mathrm{Cov}}
```
```{=latex}
\newcommand{\normlzero}{L^0}
```
```{=latex}
\newcommand{\normlone}{L^1}
```
```{=latex}
\newcommand{\normltwo}{L^2}
```
```{=latex}
\newcommand{\normlp}{L^p}
```
```{=latex}
\newcommand{\normmax}{L^\infty}
```
```{=latex}
\newcommand{\parents}{Pa}
```
```{=latex}
\DeclareMathOperator*{\argmax}{arg\,max}
```
```{=latex}
\DeclareMathOperator*{\argmin}{arg\,min}
```
```{=latex}
\DeclareMathOperator{\sign}{sign}
```
```{=latex}
\DeclareMathOperator{\Tr}{Tr}
```
```{=latex}
\let\ab\allowbreak
```
```{=latex}
\newcommand{\hangx}[1]{{\color{blue}{[Hang: #1]}}}
```
```{=latex}
\newcommand{\junz}[1]{{\color{red}{[jz: #1]}}}
```
```{=latex}
\newcommand{\lsm}[1]{{\color{blue}{lsm: #1}}}
```
```{=latex}
\newcommand{\huayu}[1]{{\color{blue}{huayu: #1}}}
```
```{=latex}
\newcommand{\wlx}[1]{{\color{blue}{wlx: #1}}}
```
```{=latex}
\newcommand{\lsmtodo}[1]{{\color{blue}{todos: [ #1 ]}}}
```
```{=latex}
\newcommand{\wzy}[1]{{\color{blue}{wzy: #1}}}
```
```{=latex}
\newcommand{\thk}[1]{{\color{blue}{[thk: #1]}}}
```
```{=latex}
\newcommand{\bangguo}[1]{{\color{blue}{lbg: #1}}}
```
```{=latex}
\newcommand{\vth}{\bm{\theta}}
```
```{=latex}
\newcommand{\veps}{\bm{\epsilon}}
```
```{=latex}
\newcommand{\modify}[1]{{{#1}}}
```
```{=latex}
\newcommand{\fix}{\marginpar{FIX}}
```
```{=latex}
\newcommand{\new}{\marginpar{NEW}}
```
```{=latex}
\maketitle
```
Introduction
============

Bimanual manipulation is essential for robots to accomplish real-world tasks [@edsinger2007two]. For practical applications, a useful manipulation policy should be able to generalize to unseen scenarios, such as unseen objects and scenes. However, current approaches either depend on task-specific primitives [@mirrazavi2017dynamical; @rakita2019shared; @grannen2023learning] or are limited to small-scale model, data and simple tasks [@krebs2021kit; @franzese2023interactive; @grannen2023stabilize; @zhao2023learning; @grotz2024peract2; @liu2024voxact], thereby exhibiting only narrow generalization and failing in complex tasks. Following the success in natural language processing [@achiam2023gpt; @touvron2023llama] and computer vision [@radford2021learning; @kirillov2023segment], one promising direction to enable generalizable behaviors is to develop a foundation model through imitation learning on large-scale datasets.

Developing bimanual manipulation foundation models confronts the dual challenges of data scarcity and architectural limitations. The prohibitive costs of dual-arm systems create severe data scarcity [@sharma2018multiple; @padalkar2023open], fundamentally conflicting with the data-hungry nature of foundation models. Inspired by recent attempts in unimanual manipulation [@brohan2023rt; @kim2024openvla], we mitigate this through cross-robot pretraining: leveraging multi-robot datasets for pre-training followed by target-robot fine-tuning, amplifying data volume by three orders of magnitude to extract transferable physical priors. However, two interconnected technical barriers emerge. First, the doubled action space induces multi-modal action distributions [@li2006optimal; @jia2024towards] (see Fig. `\ref{fig:toy example b}`{=latex} for an illustrative example) that demand *expressiveness* capability beyond current methods [@zhao2023learning; @brohan2023rt; @kim2024openvla], while simultaneously requiring *scalabilty* for stable large-scale training on multimodal data (text, vision, actions). Beyond architectural constraints, physical and action space variations across robots introduce data heterogeneity that risks negative transfer [@pan2009survey]. Existing solutions either discard robots with structural inconsistencies or retain only cross-robot invariant features [@brohan2023rt; @team2023octo; @shah2023gnm], sacrificing valuable data diversity essential for generalization.

```{=latex}
\vspace{-3ex}
```
```{=latex}
\centering
```
![**Overview of Robotics Diffusion Transformer with 1B-Parameters (RDT-1B)**, a language-conditioned visuomotor policy for bimanual manipulation, with state-of-the-art generalizability to unseen scenarios (See App. `\ref{app:exp}`{=latex} for metric calculation details). ](head.png "fig:"){width="\\textwidth"} `\vspace{-3ex}`{=latex}

```{=latex}
\vspace{-3ex}
```
`\label{fig:head-demo}`{=latex}

In this paper, we introduce the *Robotics Diffusion Transformer (RDT)*, the largest bimanual manipulation foundation model with strong generalizability. RDT employs diffusion transformer (DiT) as `\modify{its scalable backbone~\citep{peebles2023scalable}}`{=latex}, with special designs for language-conditioned bimanual manipulation. For expressiveness, RDT excels in capturing the full modalities of bimanual actions from massive data by using the capacity of diffusion models to represent complex distributions [@sohn2015learning; @ho2020denoising]. For scalability, we harness the Transformer backbone and carefully design the multi-modal encoding to eliminate the heterogeneity of various modalities. `\modify{Moreover, robotic data is differed significantly from images and videos with temporal and spatial continuity~\citep{chen2019capturing,liang2022self}.}`{=latex} To characterize its inherent nonlinear dynamics [@de2012theory], high-frequency changes [@team2023octo], and the unstable numerical range, we make important modifications to the original DiT structure, including MLP decoding, improved normalization, and alternate injection of conditions (see Fig. `\ref{fig:necessity}`{=latex} for their importance). To further enable training RDT on heterogeneous data, we propose the *Physically Interpretable Unified Action Space*, a unified action format for various robots with gripper arms. This innovative format mitigates potential conflicts between different robots while retaining the physical meanings of the original actions, which can promote the model to learn generalizable physical knowledge across diverse robotic datasets.

With the above designs, we managed to pre-train the RDT model on the largest collection of multi-robot datasets to date [@padalkar2023open; @walke2023bridgedata; @fang2023rh20t; @RoboHive] and scale it up to $1.2$B parameters, which is the largest diffusion-based pre-trained model for robotic manipulation. To further enhance its bimanual manipulation capabilities, we fine-tuned the RDT on a self-collected multi-task bimanual dataset comprising over $6$K+ trajectories, which is one of the most extensive bimanual datasets. In our experiments, we have comprehensively evaluated RDT against strong baselines in both bimanual manipulation and robotic foundation models. Results show that RDT achieves state-of-the-art performance, outperforming baselines by achieving an improvement of $56\%$ in success rates across a wide spectrum of challenging tasks. In particular, RDT has exceptional zero-shot and few-shot ($1\sim 5$ shots) generalizability to unseen objects, scenes, instructions, and even skills. RDT is also capable of accomplishing tasks requiring fine-grained operations, such as controlling a robot dog with a joystick. Finally, ablation studies show that diffusion modeling, large model size, and large data size all contribute to superior performance.

Related Work {#sec:related}
============

**Learning-based Bimanual Manipulation.** One substantial challenge in learning a bimanual manipulation policy is the high dimensionality of the action space, which exacerbates the data scarcity [@zollner2004programming; @smith2012dual; @lioutikov2016learning; @stepputtis2022system] and the multi-modal behavior [@colome2018dimensionality; @colome2020reinforcement; @figueroa2017learning; @sharma2018multiple; @xie2020deep; @franzese2023interactive]. Some works have developed more cost-effective interfaces for data collection [@zhao2023learning; @aldaco2024aloha], but they are limited to specific hardware configurations and still insufficient to bridge the data gap for a generalizable policy. Others attempt to reduce data requirements by introducing inductive biases, such as distinguishing two arms for stabilization and functionality [@grannen2023stabilize], parameterizing movement primitives [@batinica2017compliant; @amadio2019exploiting; @chitnis2020efficient; @franzese2023interactive], or using voxel representations [@grotz2024peract2; @liu2024voxact]. These methods use strong priors or simplified modeling, which successfully reduce the action space, but at the cost of a reduced scope of application and inability to express the multi-modality of bimanual behaviors [@pearce2023imitating].

**Foundation Models for Robotics.** Foundation models have shown immense promise in enabling generalizable behaviors by training multi-task \`\`generalist" models [@brohan2022rt; @brohan2023rt; @team2023octo; @kim2024openvla] on large multi-task robot datasets [@padalkar2023open; @brohan2022rt; @fang2023rh20t]. Most studies adapt large vision-language models to directly predict action [@brohan2022rt; @driess2023palm; @brohan2023rt; @padalkar2023open; @kim2024openvla]. While demonstrating generalization to new objects and tasks, they face issues with quantization errors and uncoordinated behaviors [@pearce2023imitating] when applied to bimanual manipulation. `\modify{It's largely due to their discretization of action spaces.}`{=latex} To enhance precision, diffusion models have been used for continuous control [@ho2020denoising; @chi2023diffusionpolicy; @pearce2023imitating; @team2023octo]. @team2023octo pre-train a Transformer-based diffusion policy on a subset of Open X-Embodiment [@padalkar2023open] dataset ($25$ datasets), with up to $93$M parameters.

Problem Formulation and Challenges {#sec:method:overview}
==================================

We start by formulating the task and elaborating on the challenges. To evaluate the model on the hardware, we choose the ALOHA dual-arm robot as our target robot since it is one of the most representative dual-arm robots and is suitable for collecting human demonstration data via teleoperation [@zhao2023learning; @fu2024mobile; @aldaco2024aloha]. Fig. `\ref{fig:toy example a}`{=latex} shows a schematic diagram of the target robot, which consists of two arms with grippers and three cameras. Note that our setting and foundation model are generic to any dual-arm gripper robot.

```{=latex}
\vspace{-3ex}
```
```{=latex}
\centering
```
```{=latex}
\centering
```
![Target dual-arm robot](bimanual_robot.png){#fig:toy example a width="\\textwidth"}

```{=latex}
\hfill
```
```{=latex}
\centering
```
![Illustration of multi-modality](multi-modality.png){#fig:toy example b width="\\textwidth"}

```{=latex}
\vspace{-1ex}
```
```{=latex}
\vspace{-2ex}
```
`\label{fig:toy example}`{=latex}

We consider the concrete task of language-conditioned bimanual manipulation with vision, which is fundamental in robotics and has great value in real-world scenarios such as household [@stepputtis2020language; @brohan2022rt; @zhao2023learning]. Formally, given a language instruction $\ell$, the policy is presented with an observation $\vo_t$ at time $t\in \mathbb{N}^+$; and then it produces an action $\va_t$ to control *two robot arms* to achieve the goal specified by $\ell$. The observation is represented as a triple $\vo_t:= (\mX_{t-T_{\text{img}}+1:t+1}, \vz_t, c)$, where $\mX_{t-T_{\text{img}}+1:t+1}:= (\mX_{t-T_{\text{img}}+1},\dots,\mX_{t})$ is the RGB observation history of size $T_{\text{img}}$, $\vz_t$ is the low-dimensional proprioception of the robot, and $c$ is the control frequency. The action $\va_t$ is usually a subset of the desired proprioception $\vz_{t+1}$[^2].

A specific task in bimanual manipulation typically consists of multiple elements: a *skill* (e.g., verbs like \`\`pick\", \`\`wipe\", or \`\`open\"), an *object* (e.g., nouns like \`\`bottle\", \`\`table\", or \`\`door\"), a *scene* (i.e. the environment in which the task takes place), and a *modality* describing how the skill is performed (e.g., adverbials like \`\`pick the bottle [with the left hand]{.underline}\"). When encountering a new task, a practical policy is required to generalize to unseen[^3] elements in the task, which is particularly challenging for previous rule-based methods [@mirrazavi2017dynamical; @rakita2019shared; @grannen2023learning] as well as learning-based methods that are limited to either small models and data or simple tasks, as discussed in Sec. `\ref{sec:related}`{=latex}.

We aim to train a foundation model policy via imitation learning to achieve generalizability. However, the available data for a specific dual-arm robot is particularly scarce ($< 10$K trajectories) due to high hardware costs, far from the common requirement to train a foundation model. To address this, we propose to employ a pre-training and fine-tuning pipeline [@radford2018improving] to take advantage of data from multiple robots by drawing inspiration from recent advances in unimanual manipulation [@team2023octo; @padalkar2023open; @kim2024openvla]. In this manner, we would expand the data size by three orders of magnitude. Specifically, we first pre-train the model on a large-scale multi-robot dataset $\mathcal{D}_{\text{pre}}$ (mostly single-arm) and then fine-tune on a dataset of the target robot $\mathcal{D}_{\text{ft}}$. We denote the dataset by $\mathcal{D}_{\cdot} = \{ (\ell^{(i)}, \vo^{(i)}_{t}, \va^{(i)}_{t}) \mid 0\le t < T^{(i)}, 1 \le i \le N\}$, where $T^{(i)}$ is the length of the $i$-th trajectory and $N$ is the number of trajectories. Moreover, it is worth emphasizing that our goal is to use multi-robot data to enhance the model's generalizability in bimanual manipulation *rather than* developing a cross-embodiment model for various robots. There are two main challenges to developing such a foundation model with multi-robot data:

```{=latex}
\vspace{-3ex}
```
```{=latex}
\centering
```
![**RDT framework.** Heterogeneous action spaces of various robots are embedded into a unified action space for multi-robot training. **Inputs:** proprioception $\vz_t$, noisy action chunk $\tilde{\va}_{t:t+T_a}$, control frequency $c$, and diffusion time step $k$, acting as denoising inputs; image inputs ($T_{\text{img}}=2$ and $\mX_{\cdot}=\{ \mX_{\cdot}^1, \mX_{\cdot}^2, \mX_{\cdot}^3 \}$ denotes a set of images from exterior, right-wrist, and left wrist cameras) and language inputs, acting as conditions. **Outputs:** denoised action chunk $\va_{t:t+T_a}$.](framework.png){#fig:framework width=".85\\textwidth"}

```{=latex}
\vspace{-2ex}
```
**Challenge 1: How to design a powerful architecture?** A generalizable foundation model necessitates a powerful architecture. This requirement encompasses two primary aspects. Firstly, the architecture must possess sufficient *expressiveness* to capture the multi-modality in the action distribution. Fig. `\ref{fig:toy example b}`{=latex} illustrates a toy example where the robot attempts to grasp a cube. We can see that there are many modes to finish this task, in contrast to unimanual manipulation, where only one robot arm is controlled. When collecting demonstrations, the human operator may randomly pick one of them, leading to multi-modality in the collected action data. Secondly, *scalability* is necessary for such an architecture. As a foundation model, it should effectively process heterogeneous inputs from various modalities (text, images, actions, etc.) while being scalable to train stably on large datasets.

**Challenge 2: How to train on heterogeneous data?** Training on multi-robot data presents a unique challenge of data heterogeneity. The physical structure and the action space can vary greatly across different robots. Previous attempts either restrict themselves to a subset of robots with similar action spaces [@yang2023polybot; @team2023octo; @kim2024openvla] or only retain a subset of inputs sharing the same structure [@padalkar2023open; @yang2024pushing], at the cost of losing a lot of information. It remains largely under-addressed on how to train models on such heterogeneous data.

Robotics Diffusion Transformer {#sec:model}
==============================

We now present Robotics Diffusion Transformer (RDT), as illustrated in Fig. `\ref{fig:framework}`{=latex}. In Sec. `\ref{sec:model:diff}`{=latex}, we present the diffusion model and the corresponding architecture to address Challenge 1. In Sec. `\ref{sec:model:data}`{=latex}, we resolve Challenge 2 by proposing a physically interpretable unified action space to unify various robot action spaces and enable multi-robot pre-training. We also collect a comprehensive multi-task bimanual dataset for fine-tuning to improve the bimanual manipulation capabilities of RDT.

RDT Model {#sec:model:diff}
---------

#### Diffusion Modeling.

Due to multi-modality, given the language instruction $\ell$ and observation $\vo_t$, there may be many possible actions $\va_t$ to proceed with the task. The policy will learn the \`\`average\" of action modes if we model it as a deterministic mapping $(\ell, \vo_t) \mapsto \va_{t}$ and regress the tuples of $(\ell, \vo_t, \va_{t})$ in the training data. This may result in out-of-distribution actions, such as the arithmetic mean of multiple modes, which can be completely infeasible [@pearce2023imitating]. Instead, we choose to model the continuous conditional distribution $p(\va_{t} | \ell, \vo_t)$. As discussed in Sec. `\ref{sec:related}`{=latex}, among various approaches, diffusion models excel in both expressiveness and sampling quality, but can be slow to `\modify{sample high-dimensional data}`{=latex} (e.g., images). Luckily, for our settings, the drawback is minor since that $\va_{t}$ has a much lower dimension than images, which requires only minimal sampling overhead. This has made diffusion models an ideal choice for policy as in @chi2023diffusionpolicy.

Nevertheless, employing diffusion models for robotic tasks faces unique challenges since the inherent properties of robotic physics quantities (i.e., the action and proprioception) are different from image/video data. Image and video data, while high-dimensional, often exhibit a degree of temporal and spatial continuity [@chen2019capturing; @liang2022self], with changes between frames typically being incremental. In contrast, robotic physics quantities are characterized by its *nonlinear dynamics* [@de2012theory] and the potential for *high-frequency changes* stemming from the physical interactions, such as collision, constraints, and material properties like damping. Moreover, the quantities also feature an *unstable numerical range*, probably due to extreme values caused by unreliable sensors. This underscores the necessity of adapting current diffusion models to effectively capture the instability and nonlinearity of robot data. Next, we will first elaborate on diffusion formulation and then introduce our design of architecture to resolve these challenges.

When making a decision with diffusion policies, we first sample a totally noisy action $\va_t^K \sim \mathcal{N}(\bm{0},\mI)$ and then perform $K\in\mathbb{N}^+$ denoising steps to denoise it to a clean action sample $\va_t^0$ from $p(\va_{t} | \ell, \vo_t)$: $$\label{eq1}
    \va_t^{k-1} = \frac{\sqrt{\bar{\alpha}^{k-1}}\beta^k}{1-\bar{\alpha}^{k}}  \va_t^{0} + \frac{\sqrt{\alpha^{k}}(1-\bar{\alpha}^{k-1})}{1-\bar{\alpha}^{k}} \va_t^{k} + \sigma^k \vz,\quad k=K,\dots,1,$$ where $\{ \alpha^k \}_{k=1}^K, \{ \sigma^k \}_{k=1}^K$ are scalar coefficients pre-defined by a noise schedule [@nichol2021improved]. Here, $\beta^k := 1-\alpha^k$, and $\bar{\alpha}^{k-1}:= \prod_{i=1}^{k-1} \alpha^{i}, \vz \sim \mathcal{N}(\bm{0},\mI)$ if $k > 1$, else $\bar{\alpha}^{k-1}=1, \vz = \bm{0}$. However, $\va_t^0$ is intractable before sampling is finished. We opt to use a learnable denoising network $f_{\vth}$ with parameters $\vth$ to estimate the clean sample from a noisy one: $\va_t^{0} \leftarrow f_{\vth}(\ell, \vo_t, \va_{t}^k, k)$. To train such a network, we will minimize the following mean-squared error (MSE) of denoising: $$\label{eq2}
    \mathcal{L}(\vth) := \mathrm{MSE}\left(\va_{t}, f_{\vth}(\ell, \vo_t, \sqrt{\bar{\alpha}^k}\va_{t} + \sqrt{1-\bar{\alpha}^k}\veps, k)\right),$$ where $k \sim \mathrm{Uniform}(\{1,\dots, K\})$, $\veps \sim \mathcal{N}(\bm{0},\mI)$, and $(\ell, \vo_t, \va_{t})$ is sampled from our training dataset. Later in this paper, we will denote noisy action inputs by $\tilde{\va}_{t}:= \sqrt{\bar{\alpha}^k}\va_{t} + \sqrt{1-\bar{\alpha}^k}\veps$, in which the superscript of $k$ is dropped for simplicity. Besides, in practice, we prefer to predict a sequence of actions, i.e., an action chunk, in one shot to encourage temporal consistency [@chi2023diffusionpolicy] and to alleviate error accumulation over time by reducing number of decisions in a task [@zhao2023learning]. Specifically, we model $p(\va_{t:t+T_a} | \ell, \vo_t)$, where $\va_{t:t+T_a}:=(\va_t, \dots, \va_{t+T_a-1})$ is an action chunk and $T_a$ denotes the chunk size [@zhao2023learning]. We provide a detailed discussion in App. `\ref{app:ac}`{=latex}.

We now present the design of the architecture, including the encoding of multi-modal inputs and the network structure of $f_{\vth}$, while details are deferred to App. `\ref{app:arc}`{=latex}.

#### Encoding of Heterogeneous Multi-Modal Inputs.

The heterogeneity of multi-modal inputs is reflected in the structure; that is, the format and number of dimensions of each modality are significantly different. This has posed challenges for mult-modal training. To address this, we encode these diverse modalities into a unified latent space. Below are the encoding methods:

-   **Low-Dimensional Inputs** are low-dimensional vectors that represent physical quantities of the robot, including the proprioception, the action chunk, and the control frequency. To encode them, we use MLPs (with Fourier features [@tancik2020fourier]), which can effectively capture the *high-frequency changes* in low-dimensional spaces.

-   **Image Inputs** are high-dimensional and contain rich spatial and semantic information. To extract compact representations, we use an image-text-aligned pre-trained vision encoder, SigLIP [@zhai2023sigmoid]. We fix its weights during training to save GPU memory.

-   **Language Inputs** are of varying length and highly abstract, posing integration challenges due to their complexity and ambiguity. To encode them, we use a pre-trained Transformer-based language model, T5-XXL [@2020t5]. We also fix its weights during training to save GPU memory.

    ```{=latex}
    \vspace{-0.5em}
    ```

Besides, heterogeneity also manifests in both cross-modal and intra-modal information density. First, modalities differ inherently in information capacity (e.g., visual inputs typically yield more tokens than text). Second, substantial variance exists within modalities, as seen in robotic perception where exterior cameras capture broader scenes versus the limited views from wrist cameras (Fig. `\ref{fig:framework}`{=latex}). This discrepancy risks shortcut learning: focusing on exterior views while neglecting details from wrist cameras. To promote balanced multimodal integration, we implement stochastic independent masking across modalities during encoding, preventing overreliance on specific inputs.

```{=latex}
\begin{wrapfigure}{r}{0.3\linewidth}
\centering
 \vspace{-1em}
% \fbox{\rule[-.5cm]{0cm}{4cm} \rule[-.5cm]{4cm}{0cm}}
\begin{minipage}{\linewidth}
    \captionsetup[subfigure]{justification=centering,font=small}
    \centering
    \includegraphics[width=0.9\linewidth]{loss2.pdf}
 \vspace{-0.5ex}
    \subcaption{\modify{Loss w/o QKN \& RMSN}}
    \label{fig:necessity:a}
    % \hfill
   % \par\vfill
    \vspace{1ex}
    \includegraphics[width=0.9\linewidth]{ablation.pdf}
     % \vspace{-0.5em}
    \subcaption{Task w/o MLP or ACI}
    \vspace{-0.5ex}
    \label{fig:necessity:b}
\end{minipage}
% \vspace{-2ex}
% \includegraphics[width=1.0\linewidth]{loss.pdf}%\vspace{-1ex}
\captionsetup{font=small}
\caption{{\textbf{(a)}  Unstable loss curve during training without QKNorm \& RMSNorm. \textbf{(b)} Success rates of RDT (w/o MLP Decoder or w/o ACI) in tasks of \textit{Robot Dog} (walk straight sub-task) and \textit{Pour Water-L-1/3} (correct amount sub-task). See Fig.~\ref{fig:task} for task definitions. All the models are without pre-training in this experiment due to resource constraints.}}
\label{fig:necessity}
\vspace{-0.5cm}
\end{wrapfigure}
```
#### Network Structure of $f_{\vth}$.

We choose Transformer as the scalable backbone network [@bao2023all; @peebles2023scalable] and make the following three key modifications from Diffusion Transfomer (DiT) by considering the characteristics of our robotic problem:

-   **QKNorm & RMSNorm.** The *unstable numerical range* of the inputting robotic physical quantities can lead to problems such as gradient instability and numerical overflow, especially when training large foundation models. To solve this problem, we add QKNorm [@henry2020query] to avoid numerical instability when calculating attention. Besides, we also note that our problem can be considered as a time series forecasting task, and the centering operation in the original DiTs' LayerNorm could cause *token shift* and *attention shift*, thus destroying the symmetry of the time series [@huang2024unitnorm]. Therefore, we replace LayerNorm with RMSNorm [@zhang2019root] without a centering operation. Fig. `\ref{fig:necessity:a}`{=latex} shows that large-scale pre-training tends to be very unstable or even explode without this modification.

-   **MLP Decoder.** To improve the approximation capability for *nonlinear* robot actions, we replace the final linear decoder with a nonlinear MLP decoder as a projection from the latent space back to the physical space. As empirically shown in Fig. `\ref{fig:necessity:b}`{=latex}, without this design, RDT cannot effectively capture nonlinear dynamics and thus loses the ability to accomplish dexterous tasks that require delicate operations.

-   **Alternating Condition Injection (ACI).** In our model, image and language inputs serve as conditions, which are high-dimensional and variable in length, contrasting with the class label conditions in `\modify{traditional DiTs~\citep{peebles2023scalable}}`{=latex}. These informative conditions are challenging to compress into a single token, making the original adaptive layer norm approach unsuitable. Therefore, we employ cross-attention to accommodate conditions of varying lengths avoiding the information loss in further compression. Besides, we further analyze that, given that image tokens are usually much more than text tokens, simultaneous injection of both modalities tends to overshadow text-related information, thus impairing the capability of the instruction following (see Fig. `\ref{fig:necessity:b}`{=latex} for quantitative results). To mitigate this issue, we strategically alternate between injecting image and text tokens in successive layers' cross-attention rather than injecting both in every layer.

    ```{=latex}
    \vspace{-0.5em}
    ```

Data {#sec:model:data}
----

#### Training on Heterogeneous Multi-Robot Data.

To enable training on heterogeneous multi-robot data, `\modify{we need a unified action space shared among various robots to provide a unified format for multi-robot actions.}`{=latex} The mapping from the original action space of a robot to the unified action space should be physically interpretable, `\modify{and each dimension of the space should have a clear physical meaning.}`{=latex} This can encourage the model to learn shared physical laws from different robot data, thereby improving the efficiency of learning from data of different robots [@shah2023gnm].

The design of the space consists of two steps. Firstly, for each robot, we can use a single space to accommodate both its proprioception $\vz_{t}$ and action $\va_t$. This is because $\va_t$ is usually a subset of the desired $\vz_{t+1}$ [@de2012theory; @kouvaritakis2016model], and thus the space of $\vz_{t}$ naturally contains the space of $\va_t$. Secondly, we design a unified space that encompasses all the main physical quantities of most robots with gripper arms. As illustrated in the left side of Fig. `\ref{fig:framework}`{=latex}, we embed the action space of a robot into this unified space by filling each element of the original action vector into the corresponding position of the unified action space vector according to its physical meaning, with the remaining positions being padded. The specific definition of the space is given in App. `\ref{unified_action_space}`{=latex}.

With this unified space, we are able to pre-train RDT on data from almost all modern robots with gripper arms, and greatly expand the data scale towards the requirement for a foundation model. Specifically, our collection of pre-training datasets includes $46$ datasets of various robots, with a total size of $1$M+ trajectories and $21$TB. More details and preprocessing are deferred to App. `\ref{pretrain_dataset_detail}`{=latex}.

#### Collecting a Comprehensive Multi-Task Bimanual Dataset.

Though having been pre-trained on large-scale datasets, RDT could still need help to zero-shot generalize to the target dual-arm robot due to the embodiment gap. To bridge the gap, we need to collect a multi-task bimanual dataset on the target robot for fine-tuning. Recent advances in large language models [@ziegler2019fine; @brown2020language; @touvron2023llama] have shown that high-quality fine-tuning datasets are crucial for model performance. We ensure the high quality of our dataset from three aspects: **(1)** Regarding quantity, we have collected $6$K+ trajectories, making our dataset one of the largest bimanual datasets nowadays; **(2)** Regarding comprehensiveness, we consider $300$+ challenging tasks, covering most manipulation task types, from pick-and-place to plugging cables, even including writing math equations; **(3)** Regarding diversity, we prepare $100$+ objects with rigid and non-rigid bodies of various sizes and textures and $15$+ different rooms with different lighting conditions. Besides, we further utilize GPT-4-Turbo [@achiam2023gpt] to rewrite human-annotated instructions to increase text diversity. For more information, we refer to Fig. `\ref{fig:ft_dataset}`{=latex} and App. `\ref{app:ft_dataset}`{=latex}.

Experiments {#sec:experiments}
===========

We aim to answer the following questions through real-robot experiments: $\mathcal{Q}$**1**: Can RDT zero-shot generalize to unseen objects and scenes? $\mathcal{Q}$**2**: How effective is RDT's zero-shot instruction-following capability for unseen modalities? $\mathcal{Q}$**3**: Can RDT facilitate few-shot learning for previously unseen skills? $\mathcal{Q}$**4**: Is RDT capable of completing tasks that require delicate operations? and $\mathcal{Q}$**5**: Are large model sizes, extensive data, and diffusion modeling helpful for RDT's performance? `\iffalse`{=latex}

1.  Can RDT zero-shot generalize to unseen objects and scenes?

2.  How effective is RDT's zero-shot instruction-following capability for unseen modalities?

3.  Can RDT facilitate few-shot learning for previously unseen skills?

4.  Is RDT capable of completing tasks that require delicate operations?

5.  Are large model sizes, extensive data, and diffusion modeling helpful for RDT's performance?

```{=latex}
\fi
```
Experiment Setups {#sec:exp:setup}
-----------------

```{=latex}
\centering
```
![**Task definitions and visualizations.** For $7$ challenging tasks, we describe their language instruction, randomization, and definitions of each sub-task. For *Pour Water-L-1/3* and *Pour Water-R-2/3*, we show the resulting water levels in two images.](exp.png "fig:"){#fig:task width="\\textwidth"} `\vspace{-1em}`{=latex}

```{=latex}
\vspace{-2ex}
```
```{=latex}
\vspace{-1em}
```
`\label{tbl:taskdim}`{=latex} `\renewcommand{\arraystretch}{1.2}`{=latex}

```{=latex}
\resizebox{\linewidth}{!}{
\begin{tabular}{lll}
\toprule
\multicolumn{1}{c}{\textbf{TASK NAME}} & \multicolumn{1}{c}{\textbf{DIMENSION}}             & \multicolumn{1}{c}{\textbf{EXPLANATION}}                                                   \\ \hline
Wash Cup                      & Unseen Object ($\mathcal{Q}$\textbf{1})        & To wash one seen and two unseen cups with the faucet                              \\
Pour Water                    & Unseen Scene ($\mathcal{Q}$\textbf{1})         & To pour water into the cup in three unseen rooms                                  \\
Pour Water-L-1/3              & Instruction Following ($\mathcal{Q}$\textbf{2}) & To pour water into the cup with the \textbf{left hand} until \textbf{one-third }full                \\
Pour Water-R-2/3              & Instruction Following ($\mathcal{Q}$\textbf{2}) & To pour water into the cup with the \textbf{right hand} until \textbf{two-thirds} full              \\
Handover                      & 5-Shot Learning ($\mathcal{Q}$\textbf{3})       & To move the marker to the box, where handover is needed due to far distance          \\
Fold Shorts                   & 1-Shot Learning ($\mathcal{Q}$\textbf{3})       & To fold the shorts in half horizontally                                           \\
Robot Dog                     & Dexterity ($\mathcal{Q}$\textbf{4})                                       & To push the joystick straight to control the robot dog to walk in a straight line \\ \bottomrule
\end{tabular}
}
```
```{=latex}
\vspace{-2ex}
```
#### Tasks.

We select $7$ challenging tasks to evaluate the generalizability and capabilities of RDT from different dimensions, including complex scenarios that the model may encounter in real-world tasks, such as various unseen elements and dexterous manipulation. An illustration of the dimension of each task is given in Table `\ref{tbl:taskdim}`{=latex} while detailed definitions and visualizations are provided in Fig. `\ref{fig:task}`{=latex}.

#### Data.

We use the pre-training and fine-tuning datasets in Sec. `\ref{sec:model:data}`{=latex}. We now list the number of demos related to each task in our fine-tuning dataset. *Wash Cup*: $133$ demos for seen cups combined and $0$ demos for unseen cups; *Pour Water*: $350$ demos for seen rooms combined and $0$ demos for unseen rooms; *Pour Water-L-1/3* & *Pour Water-R-2/3*: $18$ demos for the water level of little, $19$ demos for half, and $19$ demos for full; *Handover*: $5$ demos; *Fold Shorts*: $1$ demo; *Robot Dog*: $68$ demos.

**Model Training and Inference.** We scale the size of RDT up to $1.2$B parameters, establishing it as the currently largest diffusion-based robotic foundation model. The model is pre-trained on $48$ H100 80GB GPUs for a month, giving a total of $1$M training iteration steps. It takes three days to fine-tune this model using the same GPUs for $130$K steps. We defer further details to App. `\ref{app:train}`{=latex}, including the running platform, design choices, and data augmentation techniques. For real-time inference, we adopt DPM-Solver++ [@lu2022dpm], a recent sampling accelerator of diffusion models. It can reduce the diffusion steps required to sample an action chunk from $100$ steps to $5$ steps, achieving an action chunk inference frequency of $6$ Hz (action chunks per second) and an average action inference frequency of $381$ Hz (actions per second) on the target robot's onboard RTX 4090 24GB GPU.

**Baselines.** To comprehensively evaluate RDT, we consider the most advanced baselines in robotic foundation models and bimanual manipulation, including Action Chunking with Transformers (ACT) [@zhao2023learning], OpenVLA [@kim2024openvla], and Octo [@team2023octo]. ACT is a state-of-the-art method in bimanual manipulation, which uses VAE to model the action distribution. OpenVLA is the largest open-source foundation model (7B), employing the discretization modeling. Octo is a diffusion-based foundation model, and its largest version has only $93$M parameters.

#### Metric and Hardware.

We employ the success rate as our main metric, which is calculated by dividing successful trials by total trials. *Wash Cup* is tested with $8$ trials for each cup (one seen cup, two unseen cups, $24$ trials in total). *Pour Water* is tested with $8$ trials for each room (three unseen rooms, $24$ trials in total). *Pour Water-L-1/3* and *Pour Water-R-2/3* are tested with $8$ trials each. *Handover*, *Fold Shorts*, and *Robot Dog* are tested with $25$ trials each. All the tests are performed on the ALOHA dual-arm robot (see App. `\ref{hardware_detail}`{=latex} for hardware configurations). Experimental details, such as the implementation and hyper-parameters, are elaborated in App. `\ref{app:exp}`{=latex}.

```{=latex}
\begin{wraptable}{r}{0.5\linewidth}
% \vspace{-0.3em}
\centering
\captionsetup{font=small}
\caption{\textbf{Ablation study results.} Here are the success rates ($\%$) of the original RDT and its three variants in tasks of \textit{Wash Cup} (unseen cup 2, total success rate), \textit{Pour Water} (unseen room 3, total success rate), and \textit{Pour Water-L-1/3} (correct amount sub-task). All the models except \textit{RDT (scratch)} are pre-trained before fine-tuning.}
\label{tbl:ablation}
\renewcommand{\arraystretch}{1.2}
\begin{center}
\resizebox{\linewidth}{!}{
\begin{tabular}{lccc}
\hline
\multicolumn{1}{c}{\textbf{\begin{tabular}[c]{@{}c@{}}VARIANT\\ NAME\end{tabular}}} & \textbf{\begin{tabular}[c]{@{}c@{}}UNSEEN\\ OBJECT\end{tabular}} & \textbf{\begin{tabular}[c]{@{}c@{}}UNSEEN\\ SCENE\end{tabular}} & \multicolumn{1}{c}{\textbf{\begin{tabular}[c]{@{}c@{}}INSTRUCTION\\ FOLLOWING\end{tabular}}} \\ \hline
RDT (regress)                                                                       & 12.5                                                             & 50                                                              & 12.5                                                                                         \\
RDT (small)                                                                         & 37.5                                                             & \textbf{62.5}                                                            & 25                                                                                           \\
RDT (scratch)                                                                       & 0                                                                & 25                                                              & 62.5                                                                                         \\
RDT (\textbf{ours})                                                                          & \textbf{50}                                                      & \textbf{62.5}                                                   & \textbf{100}                                                                                 \\ \hline
\end{tabular}
}
\end{center}
\end{wraptable}
```
#### Ablation Study.

Answering $\mathcal{Q}$**5**, we have conducted ablation studies on the model size, pre-training, and the modeling method to understand their importance. We consider the variants of: *RDT (ours):* the original RDT. *RDT (regress):* RDT without diffusion modeling. It models the deterministic mapping $(\ell, \vo_t) \mapsto \va_{t}$. *RDT (small):* RDT without large parameters. It has only $166$M parameters. *RDT (scratch):* RDT without pre-training. It is trained from scratch during fine-tuning. In Table `\ref{tbl:ablation}`{=latex}, we evaluate these variants in terms of three dimensions of generalizability. Table `\ref{tbl:baseline}`{=latex} provides a comparison of different variants of RDT as well as baselines.

```{=latex}
\def\Thickhline{\noalign{\hrule height1pt}}
```
```{=latex}
\vspace{-1em}
```
`\label{tbl:result}`{=latex} `\renewcommand{\arraystretch}{1.2}`{=latex}

```{=latex}
\resizebox{\linewidth}{!}{
\begin{tabular}{lcccccccccccccccccc}
\Thickhline
                     & \multicolumn{18}{c}{Wash Cup: seen cup 1 $|$ unseen cup 1 $|$ unseen cup 2 (\textbf{Unseen Object})}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 \\ \hline
                     & \multicolumn{3}{c|}{\renewcommand{\arraystretch}{1.1}\begin{tabular}[c]{@{}c@{}}Pick Up Cup\end{tabular}}                                                & \multicolumn{3}{c|}{\renewcommand{\arraystretch}{1.1}\begin{tabular}[c]{@{}c@{}}Turn On Faucet\end{tabular}}                      & \multicolumn{3}{c|}{\renewcommand{\arraystretch}{1.1}\begin{tabular}[c]{@{}c@{}}Get Water\end{tabular}}                                             & \multicolumn{3}{c|}{\renewcommand{\arraystretch}{1.1}\begin{tabular}[c]{@{}c@{}}Pour Out Water\end{tabular}}                 & \multicolumn{3}{c|}{\renewcommand{\arraystretch}{1.1}\begin{tabular}[c]{@{}c@{}}Place Back Cup\end{tabular}}                                    & \multicolumn{3}{c}{Total}                                                                        \\
ACT                  & 50                                  & 12.5                                & \multicolumn{1}{c|}{37.5}                     & 0                         & 0                        & \multicolumn{1}{c|}{0}                      & 0                                    & 0                                    & \multicolumn{1}{c|}{0}                 & 0                 & 0                        & \multicolumn{1}{c|}{0}                         & 37.5                                 & 0                                    & \multicolumn{1}{c|}{0}             & 0                  & 0                          & 0                                              \\
OpenVLA              & 0                                   & 0                                   & \multicolumn{1}{c|}{0}                        & 0                         & 0                        & \multicolumn{1}{c|}{0}                      & 0                                    & 0                                    & \multicolumn{1}{c|}{0}                 & 0                 & 0                        & \multicolumn{1}{c|}{0}                         & 0                                    & 0                                    & \multicolumn{1}{c|}{0}             & 0                  & 0                          & 0                                              \\
Octo              & 0                                   & 0                                   & \multicolumn{1}{c|}{0}                        & 0                         & 0                        & \multicolumn{1}{c|}{0}                      & 0                                    & 0                                    & \multicolumn{1}{c|}{0}                 & 0                 & 0                        & \multicolumn{1}{c|}{0}                         & 0                                    & 0                                    & \multicolumn{1}{c|}{0}             & 0                  & 0                          & 0                                              \\
RDT (scratch)        & 37.5                                & 12.5                                & \multicolumn{1}{c|}{0}                        & 0                         & 12.5                     & \multicolumn{1}{c|}{12.5}                   & 0                                    & 0                                    & \multicolumn{1}{c|}{0}                 & 37.5              & 12.5                     & \multicolumn{1}{c|}{0}                         & 25                                   & 0                                    & \multicolumn{1}{c|}{0}             & 0                  & 0                          & 0                                              \\
RDT (\textbf{ours})           & 87.5                                & 87.5                                & \multicolumn{1}{c|}{50}                       & 62.5                      & 75                       & \multicolumn{1}{c|}{50}                     & 50                                   & 75                                   & \multicolumn{1}{c|}{50}                & 87.5              & 75                       & \multicolumn{1}{c|}{50}                        & 87.5                                 & 62.5                                 & \multicolumn{1}{c|}{50}            & \textbf{50}        & \textbf{75}                & \textbf{50}                                    \\ \hline
                     & \multicolumn{18}{c}{Pour Water-L-1/3 $|$ Pour Water-R-2/3 (\textbf{Instruction Following})}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          \\ \hline
                     & \multicolumn{3}{c|}{\renewcommand{\arraystretch}{1.1}\begin{tabular}[c]{@{}c@{}}Pick Up Bottle\end{tabular}}                                             & \multicolumn{3}{c|}{\renewcommand{\arraystretch}{1.1}\begin{tabular}[c]{@{}c@{}}Pour Water\end{tabular}}                          & \multicolumn{3}{c|}{\renewcommand{\arraystretch}{1.1}\begin{tabular}[c]{@{}c@{}}Place Back Bottle\end{tabular}}                                     & \multicolumn{3}{c|}{Total}                                                                    & \multicolumn{3}{c|}{\renewcommand{\arraystretch}{1.1}\begin{tabular}[c]{@{}c@{}}Correct Hand\end{tabular}}                                      & \multicolumn{3}{c}{\renewcommand{\arraystretch}{1.1}\begin{tabular}[c]{@{}c@{}}Correct Amount\end{tabular}}                     \\
OpenVLA              & 50                                  &                                     & \multicolumn{1}{c|}{0}                        & 0                         &                          & \multicolumn{1}{c|}{0}                      & 0                                    &                                      & \multicolumn{1}{c|}{0}                 & 0                 &                          & \multicolumn{1}{c|}{0}                         & 50                                   &                                      & \multicolumn{1}{c|}{0}             & 0                  &                            & \multicolumn{1}{c}{0}                          \\
Octo              &  0                                 &                                     & \multicolumn{1}{c|}{0}                        & 0                         &                          & \multicolumn{1}{c|}{0}                      & 0                                    &                                      & \multicolumn{1}{c|}{0}                 & 0                 &                          & \multicolumn{1}{c|}{0}                         & 0                                   &                                      & \multicolumn{1}{c|}{0}             & 0                  &                            & \multicolumn{1}{c}{0}                          \\
RDT (scratch)        & 100                                 &                                     & \multicolumn{1}{c|}{75}                       & 75                        &                          & \multicolumn{1}{c|}{25}                     & 62.5                                 &                                      & \multicolumn{1}{c|}{25}                & 62.5              &                          & \multicolumn{1}{c|}{25}                        & 100                                  &                                      & \multicolumn{1}{c|}{75}            & 62.5               &                            & \multicolumn{1}{c}{12.5}                       \\
RDT (\textbf{ours})           & 100                                 &                                     & \multicolumn{1}{c|}{87.5}                     & 100                       &                          & \multicolumn{1}{c|}{87.5}                   & 100                                  &                                      & \multicolumn{1}{c|}{87.5}              & \textbf{100}      &                          & \multicolumn{1}{c|}{\textbf{87.5}}             & \textbf{100}                         &                                      & \multicolumn{1}{c|}{\textbf{87.5}} & \textbf{100}      &                            & \multicolumn{1}{c}{\textbf{75}}                \\ \hline
\multicolumn{1}{c}{} & \multicolumn{12}{c}{Pour Water: unseen room 1 $|$ unseen room 2 $|$ unseen room 3 (\textbf{Unseen Scene})}                                                                                                                                                                                                                                                                                                                                                                    & \multicolumn{1}{c}{}                 & \multicolumn{1}{c}{}                 & \multicolumn{1}{c}{}               & \multicolumn{3}{c}{Fold Shorts (\textbf{1-Shot})}                                                         \\ \hline
                     & \multicolumn{3}{c|}{\renewcommand{\arraystretch}{1.1}\begin{tabular}[c]{@{}c@{}}Pick Up Bottle\end{tabular}}                                             & \multicolumn{3}{c|}{\renewcommand{\arraystretch}{1.1}\begin{tabular}[c]{@{}c@{}}Pour Water\end{tabular}}                          & \multicolumn{3}{c|}{\renewcommand{\arraystretch}{1.1}\begin{tabular}[c]{@{}c@{}}Place Back Bottle\end{tabular}}                                     & \multicolumn{3}{c|}{Total}                                                                    & \multicolumn{3}{c|}{-}                                                                                           & \multicolumn{3}{c}{Total}                                                                        \\
ACT                  & 25                                  & 87.5                                & \multicolumn{1}{c|}{25}                       & 0                         & 50                       & \multicolumn{1}{c|}{12.5}                   & 0                                    & 37.5                                 & \multicolumn{1}{c|}{12.5}              & 0                 & 37.5                     & \multicolumn{1}{c|}{12.5}                      & \multicolumn{1}{c}{}                 & -                                    & \multicolumn{1}{c|}{}              & \multicolumn{3}{c}{0}                                                                            \\
OpenVLA              & 0                                   & 0                                   & \multicolumn{1}{c|}{0}                        & 0                         & 0                        & \multicolumn{1}{c|}{0}                      & 0                                    & 0                                    & \multicolumn{1}{c|}{0}                 & 0                 & 0                        & \multicolumn{1}{c|}{0}                         & \multicolumn{1}{c}{}                 & -                                    & \multicolumn{1}{c|}{}              & \multicolumn{3}{c}{0}                                                                            \\
Octo              & 50                                   & 0                                   & \multicolumn{1}{c|}{12.5}                        & 12.5                         & 0                        & \multicolumn{1}{c|}{0}                      & 12.5                                    & 0                                    & \multicolumn{1}{c|}{0}                 & 12.5                 & 0                        & \multicolumn{1}{c|}{0}                         & \multicolumn{1}{c}{}                 & -                                    & \multicolumn{1}{c|}{}              & \multicolumn{3}{c}{4}                                                                            \\
RDT (scratch)        & 62.5                                & 100                                 & \multicolumn{1}{c|}{62.5}                     & 25                        & 87.5                     & \multicolumn{1}{c|}{37.5}                   & 25                                   & 75                                   & \multicolumn{1}{c|}{25}                & 25                & 75                       & \multicolumn{1}{c|}{25}                        & \multicolumn{1}{c}{}                 & -                                    & \multicolumn{1}{c|}{}              & \multicolumn{3}{c}{40}                                                                           \\
RDT (\textbf{ours})           & 62.5                                & 100                                 & \multicolumn{1}{c|}{62.5}                     & 62.5                      & 100                      & \multicolumn{1}{c|}{62.5}                   & 62.5                                 & 100                                  & \multicolumn{1}{c|}{62.5}              & \textbf{62.5}     & \textbf{100}             & \multicolumn{1}{c|}{\textbf{62.5}}             & \multicolumn{1}{c}{}                 & -                                    & \multicolumn{1}{c|}{}              & \multicolumn{3}{c}{\textbf{68}}                                                                  \\ \hline
                     & \multicolumn{10}{c}{Handover (\textbf{5-Shot})}                                                                                                                                                                                                                                                                                                                                    & \multicolumn{8}{c}{Robot Dog (\textbf{Dexterity})}                                                                                                                                                                                                                                                        \\ \hline
                     & \multicolumn{2}{c}{\renewcommand{\arraystretch}{1.1}\begin{tabular}[c]{@{}c@{}}Pick Up\\ Pen\end{tabular}} & \multicolumn{2}{c}{\renewcommand{\arraystretch}{1.1}\begin{tabular}[c]{@{}c@{}}Switch\\ Hand\end{tabular}} & \multicolumn{2}{c}{\renewcommand{\arraystretch}{1.1}\begin{tabular}[c]{@{}c@{}}Drop\\ Pen\end{tabular}} & \multicolumn{2}{c}{\renewcommand{\arraystretch}{1.1}\begin{tabular}[c]{@{}c@{}}Fall into\\ Box\end{tabular}} & \multicolumn{2}{c|}{Total}                                 & \multicolumn{2}{c}{\renewcommand{\arraystretch}{1.1}\begin{tabular}[c]{@{}c@{}}Grab\\ Remote\end{tabular}} & \multicolumn{2}{c}{\renewcommand{\arraystretch}{1.1}\begin{tabular}[c]{@{}c@{}}Push\\ Joystick\end{tabular}} & \multicolumn{2}{c}{Total}                               & \multicolumn{2}{c}{\renewcommand{\arraystretch}{1.1}\begin{tabular}[c]{@{}c@{}}Walk\\ Straight\end{tabular}} \\
ACT                  & \multicolumn{2}{c}{44}                                                    & \multicolumn{2}{c}{0}                                                     & \multicolumn{2}{c}{0}                                                  & \multicolumn{2}{c}{0}                                                       & \multicolumn{2}{c|}{0}                                     & \multicolumn{2}{c}{88}                                                    & \multicolumn{2}{c}{32}                                                      & \multicolumn{2}{c}{32}                                  & \multicolumn{2}{c}{32}                                                      \\
OpenVLA              & \multicolumn{2}{c}{0}                                                     & \multicolumn{2}{c}{0}                                                     & \multicolumn{2}{c}{0}                                                  & \multicolumn{2}{c}{0}                                                       & \multicolumn{2}{c|}{0}                                     & \multicolumn{2}{c}{84}                                                    & \multicolumn{2}{c}{0}                                                       & \multicolumn{2}{c}{0}                                   & \multicolumn{2}{c}{0}                                                       \\
Octo              & \multicolumn{2}{c}{12}                                                     & \multicolumn{2}{c}{0}                                                     & \multicolumn{2}{c}{0}                                                  & \multicolumn{2}{c}{0}                                                       & \multicolumn{2}{c|}{0}                                     & \multicolumn{2}{c}{100}                                                    & \multicolumn{2}{c}{4}                                                       & \multicolumn{2}{c}{4}                                   & \multicolumn{2}{c}{0}                                                       \\
RDT (scratch)        & \multicolumn{2}{c}{88}                                                    & \multicolumn{2}{c}{32}                                                    & \multicolumn{2}{c}{24}                                                 & \multicolumn{2}{c}{16}                                                      & \multicolumn{2}{c|}{16}                                    & \multicolumn{2}{c}{100}                                                   & \multicolumn{2}{c}{64}                                                      & \multicolumn{2}{c}{64}                                  & \multicolumn{2}{c}{32}                                                      \\
RDT (\textbf{ours})           & \multicolumn{2}{c}{100}                                                   & \multicolumn{2}{c}{56}                                                    & \multicolumn{2}{c}{56}                                                 & \multicolumn{2}{c}{40}                                                      & \multicolumn{2}{c|}{\textbf{40}}                           & \multicolumn{2}{c}{100}                                                   & \multicolumn{2}{c}{76}                                                      & \multicolumn{2}{c}{\textbf{76}}                         & \multicolumn{2}{c}{\textbf{48}}                                             \\ \Thickhline
\end{tabular}
}
```
```{=latex}
\vspace{-2ex}
```
Results Analysis
----------------

From the results in Table `\ref{tbl:result}`{=latex}, we can see that RDT consistently outperforms other baselines. This is because RDT employs diffusion with a powerful network architecture to model the distribution of multi-modal actions accurately, while discretization and VAE lack accuracy and expressiveness, respectively. Besides, the large number of parameters after large-scale pre-training provides a lot of prior knowledge, which significantly improves the generalizability. Here is a detailed analysis:

-   **$\mathcal{Q}$1 & $\mathcal{Q}$2:** RDT can zero-shot generalize to unseen objects, scenes, and modalities. In *Wash Cup* and *Pour Water*, RDT can still achieve a high success rate on unseen scenarios, and its performance is not much different from that on seen ones. In contrast, the other baselines cannot even complete the entire task. In *Pour Water-L-1/3* and *Pour Water-R-2/3*, from the third row of Fig. `\ref{fig:task}`{=latex} or Fig. `\ref{fig:water}`{=latex} (zoomed-in version), we can find that RDT understands precisely which hand to manipulate and how much water to pour and closely follows the instruction through its actions, even though it has never seen words like \`\`one-third\" or \`\`two-thirds\". It is precisely because of large-scale pre-training that RDT has seen a large number of diverse objects, scenes, and instructions, leading to such strong zero-shot generalization.

-   **$\mathcal{Q}$3:** RDT can learn new skills using only a few shots. In *Handover* and *Fold Shorts*, RDT has learned new and complex skills of handover and folding through few-shot learning, whose action patterns are very different from known skills, while the success rate of others is almost zero. Such improvement is also due to large-scale pre-training. Few-shot learning can help RDT quickly adapt to new working environments, which is of great significance for practical applications.

-   **$\mathcal{Q}$4:** RDT can handle dexterous tasks. In *Robot Dog*, RDT accurately controls the angle when pushing the joystick, while others have caused the robot dog to deviate. This is because diffusion, with our powerful network architecture, can model the distribution of multi-modal and nonlinear actions so that the action precision can meet the requirements of dexterous tasks. We also note that the joystick and the remote control are both black, making the joystick not visually apparent. It probably makes ACT prone to failure. In contrast, large-scale pre-training has made RDT learn a better vision-language representation of the joystick concept, improving the recognition capability.

-   **$\mathcal{Q}$5:** Large model size, extensive data, and diffusion are all essential factors for our excellence. In Table `\ref{tbl:ablation}`{=latex}, there is a serious performance drop without any of these factors, demonstrating the necessity of our contributions. In particular, *RDT (scratch)* performs poorly on unseen objects and scenes, indicating that the knowledge from pre-training is critical for generalization.

Conclusion {#sec:conclusion}
==========

In this paper, we tackled the challenges of data scarcity and increased manipulation complexity in generalizable bimanual manipulation by developing the Robotics Diffusion Transformer (RDT), a diffusion-based foundation model for language-conditioned visuomotor imitation learning. Our model was pre-trained on an extensive multi-robot dataset and fine-tuned on a self-collected bimanual dataset. We further introduce a Physically Interpretable Unified Action Space to unify action representations across different robots, enhancing robustness and transferability. Outperforming existing methods, RDT not only demonstrates significant improvements in dexterous bimanual capability and instruction following but also achieves remarkable performance in few-shot learning and zero-shot generalization to unseen objects and scenes.

```{=latex}
\clearpage
```
Acknowledgments {#acknowledgments .unnumbered}
---------------

This work was supported by NSFC Projects (Nos. 92248303, 92370124, 62350080, 62276149, 62061136001, 92270001), BNRist (BNR2022RC01006), Tsinghua Institute for Guo Qiang, and the High Performance Computing Center, Tsinghua University. J. Zhu was also supported by the XPlorer Prize.

Ethics Statement {#ethics-statement .unnumbered}
----------------

All the data used in this research comes from open-source and well-documented datasets, and we strictly follow all applicable licensing and usage guidelines. Our finetuning dataset is collected by the authors of this paper along with some volunteers.

While RDT is a model trained for scalable, language-conditioned visuomotor policy learning and tested on the ALOHA dual-arm robot, we emphasize that any harmful use of our model is neither intended nor encouraged, and we encourage responsible deployment on real-world robots.

Reproducibility Statement {#reproducibility-statement .unnumbered}
-------------------------

To reproduce our pre-training and fine-tuning processes, we have provided the code in the [repository](https://github.com/thu-ml/RoboticsDiffusionTransformer). We also include instructions for downloading the dataset, how to use the training code, and a guide for deploying on a real machine in the README file. We have fully open-sourced all our code, model weights, and fine-tuning datasets. We refer to the [project page](https://rdt-robotics.github.io/rdt-robotics/) for more information.

Please refer to App. `\ref{pretrain_dataset_detail}`{=latex} for pre-training dataset details, App. `\ref{app:ft_dataset}`{=latex} for fine-tuning dataset details, App. `\ref{app:train}`{=latex} for RDT training details, App. `\ref{hardware_detail}`{=latex} for hardware details, and App. `\ref{app:exp}`{=latex} for experimental details and implementation of baselines.

```{=latex}
\bibliographystyle{iclr2024_conference}
```
```{=latex}
\clearpage
```
```{=latex}
\appendix
```
Action Chunking Technique {#app:ac}
=========================

In practice, we find that the errors in action prediction accumulate as the number of historical decisions increases due to the imperfection of the learned policy. This may cause the robot to drift out of the training distribution, reaching hard-to-recover states [@ross2011reduction]. To alleviate this, we prefer to predict multiple actions in one shot, thereby reducing the total number of decisions in a trajectory. In this way, we model $p(\va_{t:t+T_a} | \ell, \vo_t)$, where $\va_{t:t+T_a}:=(\va_t, \dots, \va_{t+T_a-1})$ is an action chunk and $T_a$ denotes the chunk size [@zhao2023learning]. To adapt Eq. `\ref{eq1}`{=latex} and Eq. `\ref{eq2}`{=latex} to this context, we could simply replace $\va_t$ by $\va_{t:t+T_a}$. Besides, according to @chi2023diffusionpolicy, action chunking is also helpful for improving temporal consistency. It can better consider the coherence of previous and subsequent actions when making decisions and may avoid sudden changes in actions that may cause damage to the robot.

Architecture Details {#app:arc}
====================

#### Encoding of Multi-Modal Inputs.

Encoding details are outlined below:

-   **Low-Dimensional Inputs.** The proprioception $\vz_t$ and the noisy action chunk $\tilde{\va}_{t:t+T_a}$ are first embedded into the unified action space. This space is used to unify the representation of $\vz_t$ and $\tilde{\va}_{t:t+T_a}$ across various robots, which is elaborated in Sec. `\ref{sec:model:data}`{=latex}. Then, they are encoded into the token space by a shared MLP since they have similar physical meanings. Such continuous encoding can avoid precision loss in contrast to discretized encoding [@brohan2022rt; @brohan2023rt; @kim2024openvla]. For frequency $c$ as well as the diffusion time step $k$, we encode them into the token space through two MLPs, respectively. Afterward, all of them are concatenated together in the length direction to achieve *in-context conditioning* [@peebles2023scalable; @bao2023all], resulting in an input token sequence of length $1+T_a+1+1$. Finally, position embeddings are added to distinguish different modalities and to inject temporal information in $\tilde{\va}_{t:t+T_a}$.

-   **Image Inputs.** We encode the RGB images by a frozen SigLIP [@zhai2023sigmoid] and utilize an additional MLP to project the output to the token space. To enhance the model's ability to distinguish images based on viewpoint and time steps, we extend traditional sinusoidal positional embeddings to multi-dimensional grids, as shown on the right side of Fig. `\ref{fig:framework}`{=latex}. This modification integrates spatial-temporal information, enabling the model to capture the relationships between input images. Specifically, we adopt the implementation by @liu2022petr, employing grid dimensions of $(T_\text{img}, N_\text{cam}, N_\text{patch}, D)$. Here, $N_\text{cam}$ represents the number of cameras, set to three in our configuration, and $N_\text{patch}$ indicates the number of patches into which each image is divided by the ViT-based Image Encoder and $D$ denotes the embedding dimension.

-   **Language Inputs.** Language instruction is encoded by a frozen T5-XXL [@2020t5], and an MLP is used to project the output to the token space. When calculating attention for language tokens, we apply the language attention mask to mask out the pad tokens appended during batching.

During training, each input from various modalities is independently masked with a probability of $10\%$.

#### Network Structure of $f_{\vth}$.

After encoding, we feed the tokens of the low-dimensional inputs into the main network, which is adjusted from Diffusion Transformers (DiTs) with Cross-Attention [@peebles2023scalable] due to their high scalability. For better training stability, we add QKNorm [@henry2020query] into each attention layer and replace each LayerNorm with RMSNorm [@zhang2019root]. In each DiT block's cross-attention layer, we alternately inject language and image tokens rather than simultaneously inject both, avoiding the issue of token imbalance between the two modalities. After $L$ DiT blocks, we normalize the output and project it back to the action space via an MLP decoder.

Physically Interpretable Unified Action Space {#unified_action_space}
=============================================

As mentioned in Sec. `\ref{sec:model:data}`{=latex}, we embed the actions of various robots into one unified space that includes all the main physical quantities of robots. This unified action space has a dimensionality of 128. Table `\ref{tab:state_vec_mapping}`{=latex} describes each element of the vector in this unified action space. For a specific robot, each element of the raw action vector is filled into the corresponding position of the unified action vector according to its physical meanings, with the remaining positions being padded.

```{=latex}
\centering
```
::: {#tab:state_vec_mapping}
  **Index Range**    **Element Index**       **Mapped Physical Quantity**
  ----------------- ------------------- ---------------------------------------
  `[0, 10)`                0--9                Right arm joint positions
  `[10, 15)`              10--14             Right gripper joint positions
  `[15, 25)`              15--24              Right arm joint velocities
  `[25, 30)`              25--29            Right gripper joint velocities
  `[30, 33)`              30--32             Right end effector positions
  `[33, 39)`              33--38              Right end effector 6D pose
  `[39, 42)`              39--41             Right end effector velocities
  `[42, 45)`              42--44         Right end effector angular velocities
  `[45, 50)`              45--49                       Reserved
  `[50, 60)`              50--59               Left arm joint positions
  `[60, 65)`              60--64             Left gripper joint positions
  `[65, 75)`              65--74               Left arm joint velocities
  `[75, 80)`              75--79             Left gripper joint velocities
  `[80, 83)`              80--82              Left end effector positions
  `[83, 89)`              83--88               Left end effector 6D pose
  `[89, 92)`              89--91             Left end effector velocities
  `[92, 95)`              92--94         Left end effector angular velocities
  `[95, 100)`             95--99                       Reserved
  `[100, 102)`           100--101               Base linear velocities
  `[102, 103)`              102                 Base angular velocities
  `[103, 128)`           103--127                      Reserved

  : **Description of the unified action space vector.** For single-arm robot cases, its arm is mapped to the \`\`right\" arm. For a robot arm with only 6 DoF, its joint positions will be filled in the first $6$ of the $10$ corresponding positions. The same is true for other physical quantities.
:::

Pre-Training Datasets {#pretrain_dataset_detail}
=====================

Our pre-training dataset collection includes $46$ datasets, with a total scale of $1$M+ trajectories and $21$TB, making it the largest pre-training collection of robotics datasets to date. Table `\ref{tab:dataset_percentages}`{=latex} presents the complete list of our pre-training datasets and their sampling weights. We assign an initial weight of $\sqrt{N_j}$ to each dataset with size $N_j$ and adjust it according to the diversity and quality of each dataset. Compared to linear weighting, this approach prevents excessive sampling of large datasets while ensuring smaller datasets are adequately sampled, thus enhancing the diversity of pre-training samples in each mini-batch. During the pre-training stage, we further observed and adjusted the weights of different datasets based on their intermediate loss results. We increased the weights of those slow-convergent datasets.

#### Main Datasets.

We list some main datasets as follows:

-   **RT-1 Dataset** [@brohan2022rt] is a large diversve dataset including $130$K trajectories with multiple tasks, objects and environments. It is collected across $13$ different embodiments, each equipping a single exterior RGB camera. The action space includes the 6D end effector (EEF), gripper open, and base displacement with a control frequency of $3$Hz.

-   **DROID** [@khazatsky2024droid] is a large-scale multi-task dataset with $76$K trajectories and $564$ scenes. It is collected via teleoperating a Franka Panda $7$-DoF Robot Arm, with both wrist and exterior RGB-D cameras. The action space includes $7$-DoF joint positions and a gripper width, while the proprioception additionally includes the 6D EEF with a control frequency of $15$Hz.

-   **RH20T** [@fang2023rh20t] is a comprehensive dataset covering 110K trajectories and 140 tasks. It includes four different robotic embodiments and three different camera views, sampled at a frequency of 10Hz. It also includes both long and short tasks. Its state space is a mix of 6-DoF and 7-DoF joint positions, and it features a third-person perspective RGB-D camera.

-   **Mobile ALOHA Dataset** [@fu2024mobile] is a bimanual dataset containing 1K+ trajectories collected by the Mobile ALOHA robot. Its state space includes base movements and 14-dimensional joint positions of both hands, along with three or four first-person perspective cameras. Some of its data includes wide-ranging perspective changes and base movements, which were originally suitable for imitation learning.

-   **Other Datasets.** The other data come from RH20T [@fang2023rh20t], RoboSet [@RoboHive], BridgeData V2 [@walke2023bridgedata], and Open X-Embodiment [@padalkar2023open]. Most of them feature different robotic morphology and camera observation, enhancing both heterogeneity and variety of our pretraining datasets.

#### Data Cleaning.

Repetitive episodes and episodes of failure are excluded to ensure the quality of the pre-training datasets. We remove blank images, exclude erroneously recorded velocities, and filter out overly short trajectories. Overlength trajectories will be downsampled to avoid unfairness.

#### Preprocessing of Multi-Modal Observation/Action Inputs.

We describe the preprocessing details of each modality:

-   **Language Instruction $\ell$.** We perform a simple cleaning on the raw text, such as removing illegal characters and extra spaces, capitalizing the beginning of sentences, and adding a period at the end of sentences. We leave the text variable-length.

-   **RGB Images $\mX_{t-T_{\text{img}}+1:t+1}$.** We employ a fixed-length image input strategy. We fix the image input order and format for all robots, with a total of three views: a static exterior view, a right-wrist view, and a left-wrist view, deemed sufficient for the requirements of most bimanual tasks. We treat a single-arm robot's wrist camera as the right-wrist one and pad the unavailable views with the background color. When fed into the model, each image is padded into a square and resized to $384\times 384$, keeping its origin aspect ratio. Besides, we choose $T_{\text{img}}=2$ since a history length of two is adequate for most situations, striking a balance between efficiency and performance [@team2023octo; @wu2024embodied]. Finally, we can write the image inputs as $\mX_{t-1:t+1}:=(\{ \mX^1_{t-1}, \mX^2_{t-1},\mX^3_{t-1}\}, \{ \mX^1_{t}, \mX^2_{t},\mX^3_{t}\})$.

-   **Proprioception $\vz_t$ and Action Chunk $\va_{t:t+T_a}$.** We roughly align the scales of various datasets by unifying the units of physical quantities (m, rad, m/s, rad/s, etc) rather than strictly normalizing to $[-1,1]$ or $\mathcal{N}(0, 1)$ as in prior work [@chi2023diffusionpolicy; @team2023octo]. For example, \`\`$1$ (m)\" in different datasets corresponds to the same real-world length. Rescaling the physical quantities will destroy such shared properties and thus impair the model's ability to transfer across robots. We also employ the 6D representation [@8953486] for the EEF rotation to overcome the gimbal lock issue.

    Before choosing $T_a=64$, we have referred to the previous ablation studies by @zhao2023learning and balanced between the performance and computational overhead. Besides, historical proprioceptions $\vz_{i}, i<t$ are excluded to prevent the model from learning shortcuts using the low-dimensional inputs only and thus sticking to fixed motion patterns. Instead, we encourage the model to learn generalizable decision-making structures from high-dimensional image features.

-   **Control Frequency $c$.** In addressing the challenge posed by differing control frequencies across datasets, we feed the control frequency into the model, allowing the model to take this variation into account when making decisions.

```{=latex}
\centering
```
::: {#tab:dataset_percentages}
  Pre-Training Dataset                                                    Sample Percentage (%)
  ---------------------------------------------------------------------- -----------------------
  RT-1 Dataset [@brohan2022rt]                                                    9.00
  TACO Dataset [@rosete2022tacorl]                                                1.99
  JACO Play Dataset [@dass2023jacoplay]                                           1.10
  Cable Routing Dataset [@luo2023multistage]                                      0.27
  NYU Door Opening [@pari2021surprising]                                          0.33
  Viola [@zhu2022viola]                                                           0.40
  Berkeley UR5 [@BerkeleyUR5Website]                                              1.06
  TOTO [@zhou2023train]                                                           1.06
  Kuka [@kalashnikov2018qt]                                                       1.66
  Language Table [@lynch2022interactivelanguagetalkingrobots]                     3.32
  Columbia Cairlab Pusht Real [@chi2023diffusionpolicy]                           0.40
  Stanford Kuka Multimodal Dataset [@lee2019icra]                                 1.83
  Stanford Hydra Dataset  [@belkhale2023hydra]                                    0.80
  Austin Buds Dataset [@zhu2022bottom]                                            0.23
  Maniskill Dataset [@gu2023maniskill2]                                           5.78
  Furniture Bench Dataset [@heo2023furniturebench]                                2.36
  UCSD Kitchen Dataset [@ucsd_kitchens]                                           0.40
  UCSD Pick And Place Dataset [@Feng2023Finetuning]                               1.23
  Austin Sailor Dataset [@nasiriany2022sailor]                                    0.50
  Austin Sirius Dataset [@liu2022robot]                                           0.80
  BC Z [@jang2021bc]                                                              6.91
  UTokyo PR2 Opening Fridge [@oh2023pr2utokyodatasets]                            0.30
  UTokyo PR2 Tabletop Manipulation [@oh2023pr2utokyodatasets]                     0.50
  UTokyo Xarm Pick And Place [@matsushima2023weblab]                              0.33
  UTokyo Xarm Bimanual [@matsushima2023weblab]                                    0.03
  Berkeley MVP [@Radosavovic2022]                                                 0.73
  Berkeley RPT [@Radosavovic2022]                                                 1.00
  KAIST Nonprehensile [@kimpre]                                                   0.46
  Tokyo U LSMO [@Osa22]                                                           0.23
  DLR Sara Grid Clamp [@padalkar2023guided]                                       0.03
  Robocook [@shi2023robocook]                                                     1.66
  Imperialcollege Sawyer Wrist Cam [@imperialcollege_sawyer_wrist_cam]            0.43
  Iamlab CMU Pickup Insert [@saxena2023multiresolution]                           0.83
  UTAustin Mutex [@shah2023mutex]                                                 1.29
  Fanuc Manipulation [@fanuc_manipulation2023]                                    0.66
  Play Fusion [@chen2023playfusion]                                               0.80
  DROID [@khazatsky2024droid]                                                     10.06
  FMB [@luo2024fmbfunctionalmanipulationbenchmark]                                1.39
  Dobb·E [@shafiullah2023bringing]                                                1.20
  QUT Dexterous Manipulation [@ceola2023lhmanip]                                  0.46
  Aloha Dataset [@zhao2023learning]                                               4.98
  Mobile Aloha Dataset [@fu2024mobile]                                            4.98
  RoboSet [@RoboHive]                                                             4.48
  RH20T [@fang2023rh20t]                                                          10.99
  Calvin Dataset [@mees2022calvin]                                                3.32
  BridgeData V2 [@walke2023bridgedata]                                            7.44

  : The pre-training datasets and their corresponding weights.
:::

Fine-Tuning Dataset {#app:ft_dataset}
===================

Our fine-tuning dataset is created using Mobile ALOHA robot [@fu2024mobile], including $300$+ tasks, $6$K+ trajectories, and $3$M+ frames. It is also one of the largest open-source multi-task bimanual robot datasets to date. Fig. `\ref{fig:ft_dataset}`{=latex} gives a summary of this dataset. We have borrowed $3$ tasks ($140$ episodes in total) from the open-source [Songling dataset](https://openi.pcl.ac.cn/ARIO/Songling_datasets/datasets) [@wang2024all].

-   **Multi-Modal Features.** We collect the dataset with three RGB cameras positioned at the front and on the left and right grippers. We record dual-arm 6-DoF joint positions and velocities, along with the gripper angles. We manually annotated instructions for each task. To further augment our instructions and align them with the pre-training datasets, we utilize GPT-4-Turbo [@achiam2023gpt] to generate $100$ expanded instructions and one simplified instruction for each task. This multi-modal information further enhances the richness and quality of our dataset.

-   **Diverse Objects and Scenes.** Our dataset includes diverse tasks and scenes, encompassing more than 300 tasks, including skills such as picking up, inserting, writing, pushing, and pulling. It features 100+ objects with rigid and non-rigid bodies of various sizes and textures. We collect the dataset in 15+ scenes and introduce randomness during data collection for each task, such as varying the initial positions of objects and robots. To further increase diversity, we added random lighting conditions. For instance, pouring water was performed under both normal lighting and changing color conditions. These measures further enhance the diversity of our dataset.

-   **Challenging Tasks.** Various challenging tasks are also considered, encompassing dexterous manipulations, such as unscrewing the cap from a plastic bottle, and comprehension tasks, such as spelling \`\`love" with letter blocks. Furthermore, the dataset includes tasks that integrate both dexterity and comprehension, such as solving mathematical equations on the whiteboard. Additionally, our dataset incorporates bimanual tasks, such as inserting the charging cable into the phone. These complex, high-quality tasks further enhance the model's downstream comprehensibility and generalizability.

```{=latex}
\centering
```
![**Fine-Tuning dataset.** Our dataset includes the following key features: (1) **Diverse Objects and Scenes.** Our dataset contains objects with different properties manipulated in different scenes and conditions. (2) **Challenging Tasks.** Our dataset incorporates dexterous manipulation, language and vision comprehension, and bimanual tasks. (3) **Multi-Modal Features.** Our dataset is annotated with rich multi-modal data, including 3-View RGB cameras, joint information, and augmented instructions.](ft_dataset.png){#fig:ft_dataset width="\\textwidth"}

RDT Training Details {#app:train}
====================

#### Platform.

We use Pytorch [@paszke2019pytorch] and DeepSpeed [@rasley2020deepspeed] to facilitate parallel training and employ a producer-consumer framework with TensorFlow Dataset [@TFDS] for fast data loading. Since most of the datasets in the Open X-Embodiment [@padalkar2023open] are stored in the form of `TFRecord`, we convert all pre-training datasets into `TFRecord` for storage. In pre-training, we use the producer process to decompress the data from `TFRecord` and store it in a buffer on the hard disk. At the same time, we use the consumer process to read data from the buffer in a disorderly order and feed it to the model training. This not only decouples the TensorFlow [@tensorflow2015-whitepaper] and PyTorch environments but also alleviates the training performance loss caused by the small size of the shuffling buffer in the memory. In the fine-tuning stage, since the dataset is relatively small, we additionally implement a data reading pipeline using the HDF5 dataset for storage.

#### Padding Action and Proprioception.

To embed a specific robot action into the $128$-dimensional unified action space, we need to pad unavailable action elements. The usual practice is to pad with a 0 value or a specific value. But \`\`0\" actually has a physical meaning. For example, a speed of \`\`0\" generally represents stillness relative to the ground. This may confuse the model: Does \`\`0\" represent stillness or a filler value? To solve this problem, we concatenate the action and proprioception with a $0$-$1$ vector indicating whether each dimension is padded before encoding them into the token space, resulting in a $256$-dimensional vector. This can supplement the missing availability information and eliminate confusion.

#### Inspecting Training Process.

During training, for every fixed period, we conduct a diffusion sampling and compare the sampled actions with the ground truth of the training dataset. Empirically, we discover a general positive correlation between the Mean Squared Error (MSE) of the two and the performance of deployment on the robot. This observation allows us to monitor the model's training progress easily. When this MSE converges, we can generally stop training. We note that an overly low MSE may also mean overfitting.

#### Data Augmentation.

Overfitting is a common challenge in training large neural models, particularly in the fine-tuning phase. We utilize data augmentation techniques to resolve it. We perform image augmentation, including color jittering and image corruption, and add Gaussian noise to the input proprioception with a signal-to-noise ratio (SNR) of $40$dB. We also use GPT-4-Turbo to augment and expand the language instructions (Refer to Sec. `\ref{app:ft_dataset}`{=latex} for more details on the instruction augmentation).

#### Some Fine-Tuning Details.

During fine-tuning, we removed a static part at the beginning of each episode, which might be caused by the operator not reacting after the recording started. Our language instructions are sampled from the original manually annotated instruction, the expanded instructions, and the simplified instruction with a probability of one-third. When the expanded instructions are drawn, we evenly sample one from the $100$ expanded instructions corresponding to the task. We did not apply Classifier-Free Guidance (CFG) because we found that this did not improve the performance of the model but instead brought the unstable robot arm behavior.

Hardware Details {#hardware_detail}
================

```{=latex}
\centering
```
```{=latex}
\centering
```
![Hardware features.](aloha_figure.png){#fig:aloha width="\\textwidth"}

```{=latex}
\hfill
```
```{=latex}
\centering
```
  Parameter                          Value
  -------------------- ---------------------------------
  DoF                          7 $\times$ 2 = 14
  Size                  1080 $\times$ 700 $\times$ 1140
  Arm weight                        4.2 kg
  Arm Payload                    3000 g (peak)
                                1500 g (valid)
  Arm reach                         600 mm
  Arm repeatability                  1 mm
  Arm working radius                653 mm
  Joint motion range      J1: ±154°, J2: 0°$\sim$165°
                         J3: -175°$\sim$0°, J4: ±106°
                             J5: ±75° , J6: ±100°
  Gripper range                     0-80 mm
  Gripper max force                  10 NM

```{=latex}
\captionof{table}{Technical specifications.}
```
`\label{tab:tech_spec}`{=latex}

We provide a detailed overview of the hardware configuration of our target dual-arm robot. Our model is deployed and evaluated on the Cobot Mobile ALOHA, a robot using the Mobile ALOHA system design [@fu2024mobile] and manufactured by [agilex.ai](https://global.agilex.ai). The key features of the robot are illustrated in Fig. `\ref{fig:aloha}`{=latex} . It is equipped with two wrist cameras, a front camera, a laptop, and an onboard battery. The robot's technical specifications are listed in Table `\ref{tab:tech_spec}`{=latex}. It is important to note we used the \`\`mobile\" ALOHA only to facilitate transportation and testing between various scenes and did not use its autonomous mobility feature during any training or inference stages. Our tasks are still static bimanual manipulation tasks.

```{=latex}
\renewcommand{\arraystretch}{1.2}
```
```{=latex}
\resizebox{\linewidth}{!}{
\begin{tabular}{lccc}
\toprule
\multicolumn{1}{c}{\textbf{METHOD NAME}} & \multicolumn{1}{c}{\textbf{LARGE MODEL}}             & \multicolumn{1}{c}{\textbf{LARGE MULTI-ROBOT DATA}}   &  \multicolumn{1}{c}{\textbf{MODELING}}                                               \\ \hline
ACT~\citep{zhao2023learning} & \ding{55} & \ding{55} & VAE \\
OpenVLA~\citep{kim2024openvla} & \ding{51} & \ding{51} & Discretization \\
Octo~\citep{team2023octo} & \ding{55} & \ding{51} & Diffusion \\\hline
RDT (scratch) & \ding{51} & \ding{55} & Diffusion \\
RDT (small) & \ding{55} & \ding{51} & Diffusion \\
RDT (regress) & \ding{51} & \ding{51} & Regression \\
RDT (\textbf{ours}) & \ding{51} & \ding{51} & Diffusion
\\ \bottomrule
\end{tabular}
}
```
Experiment Details {#app:exp}
==================

#### Calculation of Total Performance.

The general performance in Fig. `\ref{fig:head-demo}`{=latex} of each method is calculated in three steps. Firstly, we calculate the success rate of a method in each task. We take an average of the total success rate and any additional requirement, i.e., the average of the values in the *Total* column and all columns to its right in Table `\ref{tbl:result}`{=latex}. For example, in the *Pour Water-L-1/3*, we take the average of *Total*, *Correct Hand*, and *Correct Amount*. Secondly, we calculate the success rate of each dimension of *Unseen Object*, *Unseen Scene*, *Instruction Following*, *Few-Shot Learning*, and *Dexterity* by averaging all the tasks in this dimension (see Table `\ref{tbl:taskdim}`{=latex} for the correspondence). Lastly, we average the success rates of all the dimensions to obtain the overall result.

#### Implementation and Hyper-Parameters of RDT.

We list the details of the multi-modal encoders in Table `\ref{tab:encoder_configs}`{=latex} and the model parameter in Table `\ref{tab:model_configs}`{=latex}. The image history size is $T_{\text{img}}=2$, the action chunk size is $T_a = 64$, the language token space dimension is $4096$, the image token space dimension is $1152$, and the token space dimension of RDT is $2048$. We use adaptors to align each modality's token dimension to $2048$. And all adaptors for multi-modal encoders are with GeLU activation [@hendrycks2016gaussian].

We use the AdamW optimizer [@adam2019no] with a constant learning rate scheduler and hyper-parameters in Table `\ref{tab:rdt_hyper_params}`{=latex} in the pre-training and fine-tuning stages. The model is pre-trained and finetuned on $48$ H100 80GB GPUs for $1$M steps and $130$K steps, respectively. Due to scheduling reasons, we did not start fine-tuning from the 1M pre-trained checkpoint but chose the 500K checkpoint. During the training stage, we use the DDPM scheduler with a glide cosine scheduler (i.e., `squaredcos_cap_v2`) and a step number of $1000$. During the sampling stage, we utilize the DPM-Solver++ [@lu2022dpm] with a glide cosine scheduler and a sampling step number of $5$. During fine-tuning, we also filter out episodes with a length lower than $32$ and down-sample those with a length higher than $2048$ to $2048$.

```{=latex}
\centering
```
::: {#tab:encoder_configs}
  Modality             Encoder            Trainable     Adaptor
  ---------- --------------------------- ----------- --------------
  Language        T5-XXL [@2020t5]            N       2-layers MLP
  Image       SigLIP [@zhai2023sigmoid]       N       2-layers MLP
  Action                 \-                  \-       3-layers MLP

  : Encoder configurations of RDT.
:::

```{=latex}
\centering
```
::: {#tab:model_configs}
  Model     Layers   Hidden size   Heads   \#Params  
  -------- -------- ------------- ------- ---------- --
  RDT-1B      28        2048        32       1.2B    

  : Model configurations for RDT.
:::

```{=latex}
\centering
```
::: {#tab:rdt_hyper_params}
  **Hyper-Parameter**       **Value**
  --------------------- ------------------
  Batch Size               32$\times$48
  Learning Rate          $1\times10^{-4}$
  Mixed Precision             `bf16`
  Warm-Up Steps               $500$
  $\beta_1$                   $0.9$
  $\beta_2$                  $0.999$
  Weight Decay           $1\times10^{-2}$
  $\epsilon$             $1\times10^{-8}$

  : Hyper-parameters for both pre-training and fine-tuning RDT.
:::

#### Implementation and Hyper-Parameters of ACT.

We directly employed the same architecture and hyper-parameters of ACT as that in the original paper [@fu2024mobile], except for the hyper-parameters in Table `\ref{tab:act_hyper_params}`{=latex}. We trained ACT with $90\%$ of the $6$K fine-tuning episodes for $8000$ epochs (about $8$ days in total), while the remaining $10\%$ is treated as the validation set. We took the checkpoint at epoch $5413$ as the final outcome, according to the best performance in the validation set.

```{=latex}
\centering
```
::: {#tab:act_hyper_params}
  **Hyper-Parameter**              **Value**
  ---------------------------- ------------------
  Batch Size                      80$\times$4
  Learning Rate                 $9\times10^{-5}$
  Learning Rate for Backbone    $4\times10^{-5}$

  : Adapetd hyper-parameters of ACT.
:::

#### Implementation and Hyper-Parameters of OpenVLA.

We adopt the official implementation (<https://github.com/openvla/openvla>) and flagship pre-trained model and checkpoint at <https://huggingface.co/openvla/openvla-7b>. For each task in evaluation, we further fine-tune the officially pre-trained OpenVLA with all the task-relevant demonstrations ($\sim100$ episodes) from the fine-tuning dataset to facilitate convergence and train the model to around $95\%$ action token accuracy as suggested by @kim2024openvla (<https://github.com/openvla/openvla/issues/12#issuecomment-2203772810>). Additionally, we experimented with both full-parameter tuning and LoRA methods using the entire dataset but did not achieve sufficient action token accuracy (approximately $60\%$) for deployment upon convergence (see Fig. `\ref{fig:train_openvla}`{=latex}). According to real-robot testing, such non-convergent checkpoints exhibit completely static or random behaviors in the deployment.

Concretely, we adhere to the same hyper-parameters claimed in @kim2024openvla for fine-tuning via LoRA [@hu2021lora] as detailed in Table `\ref{tab:openvla_hyper_params}`{=latex}.

```{=latex}
\centering
```
![The accuracy of action token prediction fluctuates rather than converges with the number of training steps when fine-tuning OpenVLA with the full fine-tuning dataset.](openvla_train_acc.png){#fig:train_openvla width="\\linewidth"}

```{=latex}
\centering
```
::: {#tab:openvla_hyper_params}
  **Hyper-Parameter**       **Value**
  --------------------- ------------------
  Batch Size               16$\times$8
  Learning Rate          $2\times10^{-5}$
  Lora Rank                     32
  Image Augmentation          `True`

  : Hyper-parameters of fine-tuning OpenVLA for bimanual manipulations.
:::

#### Implementation and Hyper-Parameters of Octo.

We utilize the official implementation available at <https://github.com/octo-models/octo> and the most comprehensive pre-trained model, `octo-base-1.5`, hosted at <https://huggingface.co/rail-berkeley/octo-base-1.5>. We follow the officially recommended practices for fine-tuning a bimanual robot, detailed in <https://github.com/octo-models/octo/blob/main/examples/02_finetune_new_observation_action.py>, employing a full-parameter approach. Additionally, we have incorporated an extra image tokenizer to process images from the right-wrist camera, enhancing the system's manipulation capabilities. Furthermore, by integrating image augmentation during the fine-tuning process, we enhance the performance upon deployment in real-world robotics. We replicate the wrist image tokenizer from the pre-trained model to initialize the right-wrist image tokenizer. Similar to OpenVLA, we only fine-tune octo with the task-relevant demonstrations for each evaluation tasks, for we do not observe sufficient test MSE (approximately $10^{-1}$) for deployment upon convergence (Fig. `\ref{fig:train_octo}`{=latex}). Concretely, we apply the default hyper-parameters with variations listed in Table `\ref{tab:octo_hyper_params}`{=latex}:

```{=latex}
\centering
```
![The test MSE of action prediction fluctuates rather than converges with the number of training steps when fine-tuning Octo with the full fine-tuning dataset.](octo_test_mse.png){#fig:train_octo width="\\linewidth"}

oct

```{=latex}
\centering
```
::: {#tab:octo_hyper_params}
  **Hyper-Parameter**                                        **Value**
  ------------------------------------------------ ------------------------------
  Action Head Type                                     `DiffusionActionHead`
  Batch Size                                                 8$\times$8
  Action Chunk Size                                              8
  `\multirow{ 4}{*}{Image Augmentation}`{=latex}      `RandomBrightness(0.1)`
                                                     `RandomContrast(0.9, 1.1)`
                                                    `RandomSaturation(0.9, 1.1)`
                                                         `RandomHue(0.05)`

  : Hyper-parameters of fine-tuning Octo for bimanual manipulations.
:::

More Results
============

We further provide a zoom-in view for water-level across $8$ trails in instruction-following evaluation in Fig. `\ref{fig:water}`{=latex}.

```{=latex}
\centering
```
![Visulization of the resulting water levels across $8$ trails in *Pour Water-L-1/3* and *Pour Water-R-2/3*. **Left:** The water level completed by RDT in each trial is extremely close to the ground-truth 1/3 standard. **Right:** RDT made one mistake in pouring (empty cup) and one mistake in water level, but the other trials were in roughly good agreement with 2/3. ](water.png){#fig:water width="\\linewidth"}

[^1]: Equal contribution;  $^\dag$Corresponding authors at dcszj\@tsinghua.edu.cn

[^2]: E.g., $\vz_{t}$ may include the gripper position at time $t$, and $\va_t$ can be the target gripper position at step $t+1$.

[^3]: *unseen* means that a certain element has not appeared in the training data.