---
abstract: |
  This paper introduces Diffusion Policy, a new way of generating robot behavior by representing a robot's visuomotor policy as a conditional denoising diffusion process. We benchmark Diffusion Policy across 15 different tasks from 4 different robot manipulation benchmarks and find that it consistently outperforms existing state-of-the-art robot learning methods with an average improvement of 46.9%. Diffusion Policy learns the gradient of the action-distribution score function and iteratively optimizes with respect to this gradient field during inference via a series of stochastic Langevin dynamics steps. We find that the diffusion formulation yields powerful advantages when used for robot policies, including gracefully handling multimodal action distributions, being suitable for high-dimensional action spaces, and exhibiting impressive training stability. To fully unlock the potential of diffusion models for visuomotor policy learning on physical robots, this paper presents a set of key technical contributions including the incorporation of receding horizon control, visual conditioning, and the time-series diffusion transformer. We hope this work will help motivate a new generation of policy learning techniques that are able to leverage the powerful generative modeling capabilities of diffusion models. Code, data, and training details is available [diffusion-policy.cs.columbia.edu](diffusion-policy.cs.columbia.edu)
author:
- 'Cheng Chi$^{*}$`\affilnum{1}`{=latex}, Zhenjia Xu$^{*}$`\affilnum{1}`{=latex}, Siyuan Feng`\affilnum{2}`{=latex}, Eric Cousineau`\affilnum{2}`{=latex}, Yilun Du`\affilnum{3}`{=latex}, Benjamin Burchfiel`\affilnum{2}`{=latex}, Russ Tedrake `\affilnum{2,3}`{=latex}, Shuran Song`\affilnum{1,4}`{=latex}'
bibliography:
- references.bib
title: 'Diffusion Policy: Visuomotor Policy Learning via Action Diffusion'
---

```{=latex}
\newcommand\BibTeX{{\rmfamily B\kern-.05em \textsc{i\kern-.025em b}\kern-.08em
T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
```
```{=latex}
\let\labelindent\relax
```
```{=latex}
\newcommand{\cmark}{\textcolor{green}{\ding{51}}}
```
```{=latex}
\newcommand{\xmark}{\textcolor{red}{\ding{55}}}
```
```{=latex}
\newcommand{\rev}[1]{#1}
```
```{=latex}
\newcommand{\shuran}[1]{\textcolor{MyDarkGreen}{[Shuran: #1]}}
```
```{=latex}
\newcommand{\zhenjia}[1]{\textcolor{MyDarkBlue}{[Zhenjia: #1]}}
```
```{=latex}
\newcommand{\cheng}[1]{\textcolor{MyPurple}{[Cheng: #1]}}
```
```{=latex}
\newcommand{\xian}[1]{\textcolor{MyDarkOrange}{[Xian: #1]}}
```
```{=latex}
\newcommand{\xingyu}[1]{\textcolor{MyDarkgray}{[Xian: #1]}}
```
```{=latex}
\newcommand{\zhiao}[1]{\textcolor{MyPink}{[Zhiao: #1]}}
```
```{=latex}
\newcommand{\yilun}[1]{\textcolor{MyPink}{[Yilun: #1]}}
```
```{=latex}
\newcommand{\ben}[1]{\textcolor{MyGold}{[Ben: #1]}}
```
```{=latex}
\newcommand{\sfeng}[1]{\textcolor{MyCyan}{[sfeng: #1]}}
```
```{=latex}
\newcommand\todo[1]{\textcolor{red}{[TODO: #1]}}
```
```{=latex}
\newcommand\ijrrupdate[1]{\textcolor{blue}{#1}}
```
```{=latex}
\newcommand{\mypara}[1]{\par\vspace*{0mm} \textbf{{#1}}}
```
```{=latex}
\def\halfcheckmark{\tikz\draw[scale=0.4,fill=black](0,.35) -- (.25,0) -- (1,.7) -- (.25,.15) -- cycle (0.75,0.2) -- (0.77,0.2)  -- (0.6,0.7) -- cycle;}
```
```{=latex}
\def\OURS{Diffusion Policy\xspace}
```
```{=latex}
\newcommand{\legalTM}{\textsuperscript{\texttrademark}}
```
```{=latex}
\newcommand*\circled[1]{\tikz[baseline=(char.base)]{
            \node[shape=circle,draw,inner sep=2pt] (char) {#1};}}
```
```{=latex}
\def\volumeyear{2016}
```
```{=latex}
\runninghead{Diffusion Policy}
```
```{=latex}
\affiliation{\affilnum{*} Joint First Author \\
\affilnum{1}Columbia University, US
\affilnum{2}Toyota Research Institute, US.
\affilnum{3}MIT, US.
\affilnum{4}Stanford University, US.
}
```
```{=latex}
\corrauth{Cheng Chi, Columbia University, US}
```
```{=latex}
\email{chenng.chi@columbia.edu}
```
```{=latex}
\keywords{Imitation learning, visuomotor policy, manipulation}
```
```{=latex}
\twocolumn[{
\renewcommand\twocolumn[1][]{#1}
\maketitle
    \vspace{-5mm}
\begin{center}
    \includegraphics[width=0.95\textwidth]{figure/DP_teaser.pdf}
    \captionof{figure}{\textbf{Policy Representations.} \label{fig:policy_rep} a) Explicit policy with different types of action representations.  b) Implicit policy learns an energy function conditioned on both action and observation and optimizes for actions that minimize the energy landscape c) Diffusion policy refines noise into actions via a learned gradient field. This formulation provides stable training, allows the learned policy to accurately model multimodal action distributions, and accommodates high-dimensional action sequences. 
    % \todo{update figure to make b and c consistent, both 2D or both 3D. Make it clear c is gradient of b change J (a) -> E (a)}
    } 
%https://docs.google.com/drawings/d/1SNd5_khk3RsYuE9JCwUVjmRED-eF3UrO78XnzbOwE4Y/edit?usp=sharing
\end{center}
}]
```
Introduction
============

Policy learning from demonstration, in its simplest form, can be formulated as the supervised regression task of learning to map observations to actions. In practice however, the unique nature of predicting robot actions --- such as the existence of multimodal distributions, sequential correlation, and the requirement of high precision --- makes this task distinct and challenging compared to other supervised learning problems.

Prior work attempts to address this challenge by exploring different *action representations* (Fig `\ref{fig:policy_rep}`{=latex} a) -- using mixtures of Gaussians [@robomimic], categorical representations of quantized actions [@bet], or by switching the *the policy representation* (Fig `\ref{fig:policy_rep}`{=latex} b) -- from explicit to implicit to better capture multi-modal distributions [@ibc; @wu2020spatial].

In this work, we seek to address this challenge by introducing a new form of robot visuomotor policy that generates behavior via a \`\`conditional denoising diffusion process [@ho2020denoising] on robot action space", **Diffusion Policy**. In this formulation, instead of directly outputting an action, the policy infers the action-score gradient, conditioned on visual observations, for $K$ denoising iterations (Fig. `\ref{fig:policy_rep}`{=latex} c). This formulation allows robot policies to inherit several key properties from diffusion models -- significantly improving performance.

-   **Expressing multimodal action distributions.** By learning the gradient of the action score function [@song2019score] and performing Stochastic Langevin Dynamics sampling on this gradient field, Diffusion policy can express arbitrary normalizable distributions [@neal2011mcmc], which includes multimodal action distributions, a well-known challenge for policy learning.

-   **High-dimensional output space.** As demonstrated by their impressive image generation results, diffusion models have shown excellent scalability to high-dimension output spaces. This property allows the policy to jointly infer a *sequence* of future actions instead of *single-step* actions, which is critical for encouraging temporal action consistency and avoiding myopic planning.

-   **Stable training.** Training energy-based policies often requires negative sampling to estimate an intractable normalization constant, which is known to cause training instability [@du2020improved; @ibc]. Diffusion Policy bypasses this requirement by learning the gradient of the energy function and thereby achieves stable training while maintaining distributional expressivity.

Our **primary contribution** is to bring the above advantages to the field of robotics and demonstrate their effectiveness on complex real-world robot manipulation tasks. To successfully employ diffusion models for visuomotor policy learning, we present the following technical contributions that enhance the performance of Diffusion Policy and unlock its full potential on physical robots:

-   **Closed-loop action sequences.** We combine the policy's capability to predict high-dimensional action sequences with *receding-horizon control* to achieve robust execution. This design allows the policy to continuously re-plan its action in a closed-loop manner while maintaining temporal action consistency -- achieving a balance between long-horizon planning and responsiveness.

-   **Visual conditioning.** We introduce a vision-conditioned diffusion policy, where the visual observations are treated as conditioning instead of a part of the joint data distribution. In this formulation, the policy extracts the visual representation once regardless of the denoising iterations, which drastically reduces the computation and enables real-time action inference.

-   **Time-series diffusion transformer.** We propose a new transformer-based diffusion network that minimizes the over-smoothing effects of typical CNN-based models and achieves state-of-the-art performance on tasks that require high-frequency action changes and velocity control.

We systematically evaluate Diffusion Policy across **15** tasks from **4** different benchmarks [@ibc; @gupta2019relay; @robomimic; @bet] under the behavior cloning formulation. The evaluation includes both simulated and real-world environments, 2DoF to 6DoF actions, single- and multi-task benchmarks, and fully- and under-actuated systems, with rigid and fluid objects, using demonstration data collected by single and multiple users.

Empirically, we find **consistent** performance boost across all benchmarks with an average improvement of 46.9%, providing strong evidence of the effectiveness of Diffusion Policy. We also provide detailed analysis to carefully examine the characteristics of the proposed algorithm and the impacts of the key design decisions.

This work is an extended version of the conference paper [@chi2023diffusionpolicy]. We expand the content of this paper in the following ways:

-   Include a new discussion section on the connections between diffusion policy and control theory. See Sec. `\ref{sec:control}`{=latex}.

-   Include additional ablation studies in simulation on alternative network architecture design and different pretraining and finetuning paradigms, Sec. `\ref{sec:arch_ablation}`{=latex}.

-   Extend the real-world experimental results with three bimanual manipulation tasks including Egg Beater, Mat Unrolling, and Shirt Folding in Sec. `\ref{sec:eval_bimanual}`{=latex}.

The code, data, and training details are publicly available for reproducing our results [diffusion-policy.cs.columbia.edu](diffusion-policy.cs.columbia.edu).

```{=latex}
\begin{figure*}[t]\centering
    \includegraphics[width=\linewidth]{figure/policy_input_output.pdf}
    % https://docs.google.com/drawings/d/1Z-OWGff7cpdeAJ5V07L2IxZSEB-P0DDHW8zRXLku4sg/edit
    \caption{\textbf{Diffusion Policy Overview} \label{fig:policy_io} a) General formulation. At time step $t$, the policy takes the latest $T_o$ steps of observation data $O_t$ as input and outputs $T_a$ steps of actions $A_t$.  b) In the CNN-based Diffusion Policy, FiLM (Feature-wise Linear Modulation) \cite{perez2018film} conditioning of the observation feature $O_t$ is applied to every convolution layer, channel-wise. Starting from $\mathbf{A}^K_t$ drawn from Gaussian noise, the output of noise-prediction network $\epsilon_\theta$ is subtracted, repeating $K$ times to get $\mathbf{A}^0_t$, the denoised action sequence. c) In the Transformer-based \cite{vaswani2017attention} Diffusion Policy, the embedding of observation $\mathbf{O}_t$ is passed into a multi-head cross-attention layer of each transformer decoder block. Each action embedding is constrained to only attend to itself and previous action embeddings (causal attention) using the attention mask illustrated.  }
\end{figure*}
```
Diffusion Policy Formulation {#sec:method}
============================

We formulate visuomotor robot policies as Denoising Diffusion Probabilistic Models (DDPMs) [@ho2020denoising]. Crucially, Diffusion policies are able to express complex multimodal action distributions and possess stable training behavior -- requiring little task-specific hyperparameter tuning. The following sections describe DDPMs in more detail and explain how they may be adapted to represent visuomotor policies.

Denoising Diffusion Probabilistic Models {#sec:ddpm}
----------------------------------------

DDPMs are a class of generative model where the output generation is modeled as a denoising process, often called Stochastic Langevin Dynamics [@welling2011bayesian].

Starting from $\mathbf{x}^K$ sampled from Gaussian noise, the DDPM performs $K$ iterations of denoising to produce a series of intermediate actions with decreasing levels of noise, $\mathbf{x}^k, \mathbf{x}^{k-1} ...\mathbf{x}^{0}$, until a desired noise-free output $\mathbf{x}^0$ is formed. The process follows the equation `\vspace{-1mm}`{=latex} $$\textbf{x}^{k-1}=\alpha(\textbf{x}^{k}-\gamma\epsilon_\theta(\mathbf{x}^k,k) + \mathcal{N} \bigl(0, \sigma^2 I \bigl)),
    \label{eq:unconditional_langevin}
\vspace{-1mm}$$ where $\epsilon_\theta$ is the noise prediction network with parameters $\theta$ that will be optimized through learning and $\mathcal{N} \bigl(0, \sigma^2 I \bigl)$ is Gaussian noise added at each iteration.

The above equation `\ref{eq:unconditional_langevin}`{=latex} may also be interpreted as a single noisy gradient descent step: `\vspace{-2mm}`{=latex} $$\mathbf{x}'=\mathbf{x}-\gamma\nabla E(\mathbf{x}),
    \label{eq:gradient_descent}
\vspace{-1mm}$$ where the noise prediction network $\epsilon_\theta(\mathbf{x},k)$ effectively predicts the gradient field $\nabla E(\mathbf{x})$, and $\gamma$ is the learning rate.

The choice of $\alpha,\gamma,\sigma$ as functions of iteration step $k$, also called noise schedule, can be interpreted as learning rate scheduling in gradient decent process. An $\alpha$ slightly smaller than $1$ has been shown to improve stability [@ho2020denoising]. Details about noise schedule will be discussed in Sec `\ref{sec:method-noise-schedule}`{=latex}.

DDPM Training {#sec:ddpm_inference}
-------------

The training process starts by randomly drawing unmodified examples, $\mathbf{x}^0$, from the dataset. For each sample, we randomly select a denoising iteration $k$ and then sample a random noise $\mathbf{\epsilon}^k$ with appropriate variance for iteration $k$. The noise prediction network is asked to predict the noise from the data sample with noise added.

```{=latex}
\vspace{-2mm}
```
$$\mathcal{L} = MSE(\mathbf{\epsilon}^k, \epsilon_\theta(\mathbf{x}^0+\mathbf{\epsilon}^k,k))
    \label{eq:unconditional_loss}$$

As shown in [@ho2020denoising], minimizing the loss function in Eq `\ref{eq:unconditional_loss}`{=latex} also minimizes the variational lower bound of the KL-divergence between the data distribution $p(\mathbf{x}^0)$ and the distribution of samples drawn from the DDPM $q(\mathbf{x}^0)$ using Eq `\ref{eq:unconditional_langevin}`{=latex}.

Diffusion for Visuomotor Policy Learning
----------------------------------------

While DDPMs are typically used for image generation ($\mathbf{x}$ is an image), we use a DDPM to learn robot visuomotor policies. This requires two major modifications in the formulation: 1. changing the output $\mathbf{x}$ to represent robot actions. 2. making the denoising processes *conditioned* on input observation $\mathbf{O}_t$. The following paragraphs discuss each of the modifications, and Fig. `\ref{fig:policy_io}`{=latex} shows an overview.

**Closed-loop action-sequence prediction:** An effective action formulation should encourage temporal consistency and smoothness in long-horizon planning while allowing prompt reactions to unexpected observations. To accomplish this goal, we commit to the action-sequence prediction produced by a diffusion model for a fixed duration before replanning. Concretely, at time step $t$ the policy takes the latest $T_o$ steps of observation data $\mathbf{O}_t$ as input and predicts $T_p$ steps of actions, of which $T_a$ steps of actions are executed on the robot without re-planning. Here, we define $T_o$ as the observation horizon, $T_p$ as the action prediction horizon and $T_a$ as the action execution horizon. This encourages temporal action consistency while remaining responsive. More details about the effects of $T_a$ are discussed in Sec `\ref{sec:action_sequence}`{=latex}. Our formulation also allows receding horizon control [@mayne1988receding] to futher improve action smoothness by warm-starting the next inference setup with previous action sequence prediction.

**Visual observation conditioning:** We use a DDPM to approximate the conditional distribution $p(\mathbf{A}_t | \mathbf{O}_t)$ instead of the joint distribution $p(\mathbf{A}_t,\mathbf{O}_t)$ used in @janner2022diffuser for planning. This formulation allows the model to predict actions conditioned on observations without the cost of inferring future states, speeding up the diffusion process and improving the accuracy of generated actions. To capture the conditional distribution $p(\mathbf{A}_t |\mathbf{O}_t)$, we modify Eq `\ref{eq:unconditional_langevin}`{=latex} to: $$\label{eq:diffusion_policy_langevin}
    \mathbf{A}^{k-1}_t = \alpha(\mathbf{A}^k_t - \gamma\epsilon_\theta(\mathbf{O}_t,\mathbf{A}^k_t,k) + \mathcal{N} \bigl(0, \sigma^2 I \bigl))
% \vspace{-1mm}$$ The training loss is modified from Eq `\ref{eq:unconditional_loss}`{=latex} to: $$\label{eq:diffusion_policy_loss}
    \mathcal{L}=MSE(\mathbf{\epsilon}^k,\epsilon_\theta(\mathbf{O}_t, \mathbf{A}^0_t + \mathbf{\epsilon}^k, k))
% \vspace{-1mm}$$

The exclusion of observation features $\mathbf{O}_t$ from the output of the denoising process significantly improves inference speed and better accommodates real-time control. It also helps to make **end-to-end** training of the vision encoder feasible. Details about the visual encoder are described in Sec. `\ref{sec:method-visual}`{=latex}.

Key Design Decisions
====================

In this section, we describe key design decisions for Diffusion Policy as well as its concrete implementation of $\epsilon_\theta$ with neural network architectures.

Network Architecture Options {#sec:method-network}
----------------------------

The first design decision is the choice of neural network architectures for $\epsilon_\theta$. In this work, we examine two common network architecture types, convolutional neural networks (CNNs) [@ronneberger2015u] and Transformers [@vaswani2017attention], and compare their performance and training characteristics. Note that the choice of noise prediction network $\epsilon_\theta$ is independent of visual encoders, which will be described in Sec. `\ref{sec:method-visual}`{=latex}.

**CNN-based Diffusion Policy** We adopt the 1D temporal CNN from @pmlr-v162-janner22a with a few modifications: First, we only model the conditional distribution $p(\mathbf{A}_t|\mathbf{O}_t)$ by conditioning the action generation process on observation features $\mathbf{O}_t$ with Feature-wise Linear Modulation (FiLM) [@perez2018film] as well as denoising iteration $k$, shown in Fig `\ref{fig:policy_io}`{=latex} (b). Second, we only predict the action trajectory instead of the concatenated observation action trajectory. Third, we removed inpainting-based goal state conditioning due to incompatibility with our framework utilizing a receding prediction horizon. However, goal conditioning is still possible with the same FiLM conditioning method used for observations.

In practice, we found the CNN-based backbone to work well on most tasks out of the box without the need for much hyperparameter tuning. However, it performs poorly when the desired action sequence changes quickly and sharply through time (such as velocity command action space), likely due to the inductive bias of temporal convolutions to prefer low-frequency signals [@tancik2020fourier].

**Time-series diffusion transformer** To reduce the over-smoothing effect in CNN models [@tancik2020fourier], we introduce a novel transformer-based DDPM which adopts the transformer architecture from minGPT [@bet] for action prediction. Actions with noise $A_t^k$ are passed in as input tokens for the transformer decoder blocks, with the sinusoidal embedding for diffusion iteration $k$ prepended as the first token. The observation $\mathbf{O}_t$ is transformed into observation embedding sequence by a shared MLP, which is then passed into the transformer decoder stack as input features. The \`\`gradient\" $\epsilon_\theta(\mathbf{O_t},\mathbf{A_t}^k,k)$ is predicted by each corresponding output token of the decoder stack.

In our state-based experiments, most of the best-performing policies are achieved with the transformer backbone, especially when the task complexity and rate of action change are high. However, we found the transformer to be more sensitive to hyperparameters. The difficulty of transformer training [@liu2020understanding] is not unique to Diffusion Policy and could potentially be resolved in the future with improved transformer training techniques or increased data scale.

**Recommendations.** In general, we recommend starting with the CNN-based diffusion policy implementation as the first attempt at a new task. If performance is low due to task complexity or high-rate action changes, then the Time-series Diffusion Transformer formulation can be used to potentially improve performance at the cost of additional tuning.

Visual Encoder {#sec:method-visual}
--------------

The visual encoder maps the raw image sequence into a latent embedding $O_t$ and is trained end-to-end with the diffusion policy. Different camera views use separate encoders, and images in each timestep are encoded independently and then concatenated to form $O_t$. We used a standard ResNet-18 (without pretraining) as the encoder with the following modifications: 1) Replace the global average pooling with a spatial softmax pooling to maintain spatial information [@robomimic]. 2) Replace BatchNorm with GroupNorm [@groupnorm] for stable training. This is important when the normalization layer is used in conjunction with Exponential Moving Average [@he2020moco] (commonly used in DDPMs).

Noise Schedule {#sec:method-noise-schedule}
--------------

The noise schedule, defined by $\sigma$, $\alpha$, $\gamma$ and the additive Gaussian Noise $\epsilon^k$ as functions of $k$, has been actively studied [@ho2020denoising; @nichol2021improved]. The underlying noise schedule controls the extent to which diffusion policy captures high and low-frequency characteristics of action signals. In our control tasks, we empirically found that the Square Cosine Schedule proposed in iDDPM [@nichol2021improved] works best for our tasks.

Accelerating Inference for Real-time Control
--------------------------------------------

We use the diffusion process as the policy for robots; hence, it is critical to have a fast inference speed for closed-loop real-time control. The Denoising Diffusion Implicit Models (DDIM) approach [@song2021ddim] decouples the number of denoising iterations in training and inference, thereby allowing the algorithm to use fewer iterations for inference to speed up the process. In our real-world experiments, using DDIM with 100 training iterations and 10 inference iterations enables 0.1s inference latency on a Nvidia 3080 GPU.

Intriguing Properties of Diffusion Policy
=========================================

In this section, we provide some insights and intuitions about diffusion policy and its advantages over other forms of policy representations.

Model Multi-Modal Action Distributions {#sec:multimodal}
--------------------------------------

The challenge of modeling multi-modal distribution in human demonstrations has been widely discussed in behavior cloning literature [@ibc; @bet; @robomimic]. Diffusion Policy's ability to express multimodal distributions naturally and precisely is one of its key advantages.

Intuitively, multi-modality in action generation for diffusion policy arises from two sources -- an underlying stochastic sampling procedure and a stochastic initialization. In Stochastic Langevin Dynamics, an initial sample $\mathbf{A}^K_t$ is drawn from standard Gaussian at the beginning of each sampling process, which helps specify different possible convergence basins for the final action prediction $\mathbf{A}^0_t$. This action is then further stochastically optimized, with added Gaussian perturbations across a large number of iterations, which enables individual action samples to converge and move between different multi-modal action basins. Fig. `\ref{fig:multimodal}`{=latex}, shows an example of the Diffusion Policy's multimodal behavior in a planar pushing task (Push T, introduced below) without explicit demonstration for the tested scenario.

```{=latex}
\centering
```
![`\label{fig:multimodal}`{=latex} **Multimodal behavior.** At the given state, the end-effector (blue) can either go left or right to push the block. **Diffusion Policy** learns both modes and commits to only one mode within each rollout. In contrast, both **LSTM-GMM** [@robomimic] and **IBC** [@ibc] are biased toward one mode, while **BET** [@bet] fails to commit to a single mode due to its lack of temporal action consistency. Actions generated by rolling out 40 steps for the best-performing checkpoint. ](figure/multimodal_sim.png){width="0.98\\linewidth"}

```{=latex}
\vspace{-2mm}
```
Synergy with Position Control {#sec:property_pos_vs_vel}
-----------------------------

We find that Diffusion Policy with a position-control action space consistently outperforms Diffusion Policy with velocity control, as shown in Fig `\ref{fig:pos_vs_vel}`{=latex}. This surprising result stands in contrast to the majority of recent behavior cloning work that generally relies on velocity control [@robomimic; @bet; @zhang2018deep; @florence2019self; @mandlekar2020learning; @mandlekar2020iris]. We speculate that there are two primary reasons for this discrepancy: First, action multimodality is more pronounced in position-control mode than it is when using velocity control. Because Diffusion Policy better expresses action multimodality than existing approaches, we speculate that it is inherently less affected by this drawback than existing methods. Furthermore, position control suffers less than velocity control from compounding error effects and is thus more suitable for action-sequence prediction (as discussed in the following section). As a result, Diffusion Policy is both less affected by the primary drawbacks of position control and is better able to exploit position control's advantages.

```{=latex}
\centering
```
![**Velocity v.s. Position Control.** `\label{fig:pos_vs_vel}`{=latex} The performance difference when switching from velocity to position control. While both BCRNN and BET performance decrease, Diffusion Policy is able to leverage the advantage of position and improve its performance. ](figure/pos_vs_vel_figure.png){width="0.85\\linewidth"}

```{=latex}
\vspace{-4mm}
```
Benefits of Action-Sequence Prediction {#sec:action_sequence}
--------------------------------------

Sequence prediction is often avoided in most policy learning methods due to the difficulties in effectively sampling from high-dimensional output spaces. For example, IBC would struggle in effectively sampling high-dimensional action space with a non-smooth energy landscape. Similarly, BC-RNN and BET would have difficulty specifying the number of modes that exist in the action distribution (needed for GMM or k-means steps).

In contrast, DDPM scales well with output dimensions without sacrificing the expressiveness of the model, as demonstrated in many image generation applications. Leveraging this capability, Diffusion Policy represents action in the form of a high-dimensional action sequence, which naturally addresses the following issues:

-   **Temporal action consistency**: Take Fig `\ref{fig:multimodal}`{=latex} as an example. To push the T block into the target from the bottom, the policy can go around the T block from either left or right. However, suppose each action in the sequence is predicted as independent multimodal distributions (as done in BC-RNN and BET). In that case, consecutive actions could be drawn from different modes, resulting in jittery actions that alternate between the two valid trajectories.

-   **Robustness to idle actions**: Idle actions occur when a demonstration is paused and results in sequences of identical positional actions or near-zero velocity actions. It is common during teleoperation and is sometimes required for tasks like liquid pouring. However, single-step policies can easily overfit to this pausing behavior. For example, BC-RNN and IBC often get stuck in real-world experiments when the idle actions are not explicitly removed from training.

```{=latex}
\centering
```
![**Diffusion Policy Ablation Study.** Change (difference) in success rate relative to the maximum for each task is shown on the Y-axis. **Left**: trade-off between temporal consistency and responsiveness when selecting the action horizon. **Right**: Diffusion Policy with position control is robust against latency. Latency is defined as the number of steps between the last frame of observations to the first action that can be executed. ](figure/ablation_figure.png "fig:"){#fig:ablation width="\\linewidth"} `\vspace{-6mm}`{=latex}

```{=latex}
\vspace{-5mm}
```
Training Stability {#sec:ibc_stability}
------------------

While IBC, in theory, should possess similar advantages as diffusion policies. However, achieving reliable and high-performance results from IBC in practice is challenging due to IBC's inherent training instability [@ta2022conditional]. Fig `\ref{fig:ibc_stability}`{=latex} shows training error spikes and unstable evaluation performance throughout the training process, making hyperparameter turning critical and checkpoint selection difficult. As a result, @ibc evaluate every checkpoint and report results for the best-performing checkpoint. In a real-world setting, this workflow necessitates the evaluation of many policies on hardware to select a final policy. Here, we discuss why Diffusion Policy appears significantly more stable to train.

An implicit policy represents the action distribution using an Energy-Based Model (EBM): `\vspace{-2mm}`{=latex} $$\label{eq:ebm}
    p_\theta(\mathbf{a}|\mathbf{o})=\frac{e^{-E_\theta(\mathbf{o},\mathbf{a})}}{Z(\mathbf{o},\theta)}
\vspace{-1mm}$$ where $Z(\mathbf{o},\theta)$ is an intractable normalization constant (with respect to $\mathbf{a}$).

To train the EBM for implicit policy, an InfoNCE-style loss function is used, which equates to the negative log-likelihood of Eq `\ref{eq:ebm}`{=latex}: $$\label{eq:infonce}
    \mathcal{L}_{infoNCE}=-\log(\frac{
        e^{-E_\theta(\mathbf{o},\mathbf{a})}
    }{
        e^{-E_\theta(\mathbf{o},\mathbf{a})} + 
            \textcolor{red}{\sum^{N_{neg}}_{j=1}}e^{
                -E_\theta(\mathbf{o},\textcolor{red}{\widetilde{\mathbf{a}}^j})}
    })$$ where a set of negative samples $\textcolor{red}{\{\widetilde{\mathbf{a}}^j\}^{N_{neg}}_{j=1}}$ are used to estimate the intractable normalization constant $Z(\mathbf{o},\theta)$. In practice, the inaccuracy of negative sampling is known to cause training instability for EBMs [@du2020improved; @ta2022conditional].

Diffusion Policy and DDPMs sidestep the issue of estimating $Z(\mathbf{a},\theta)$ altogether by modeling the **score function** [@song2019score] of the same action distribution in Eq `\ref{eq:ebm}`{=latex}: $$\nabla_{\mathbf{a}}\log p(\mathbf{a}|\mathbf{o})
    =-\nabla_{\mathbf{a}} E_{\theta}(\mathbf{a},\mathbf{o})-\underbrace{\nabla_{\mathbf{a}}\log Z(\mathbf{o},\theta)}_{=0}
    % =-\nabla_{\mathbf{a}} E_{\theta}(\mathbf{a},\mathbf{o})
    \approx -\epsilon_\theta(\mathbf{a},\mathbf{o})$$ where the noise-prediction network $\epsilon_\theta(\mathbf{a},\mathbf{o})$ is approximating the negative of the score function $\nabla_{\mathbf{a}}\log p(\mathbf{a}|\mathbf{o})$ [@liu2022compositional], which is independent of the normalization constant $Z(\mathbf{o},\theta)$. As a result, neither the inference (Eq `\ref{eq:diffusion_policy_langevin}`{=latex}) nor training (Eq `\ref{eq:diffusion_policy_loss}`{=latex}) process of Diffusion Policy involves evaluating $Z(\mathbf{o},\theta)$, thus making Diffusion Policy training more stable.

```{=latex}
\centering
```
![**Training Stability.** `\label{fig:ibc_stability}`{=latex} Left: IBC fails to infer training actions with increasing accuracy despite smoothly decreasing training loss for energy function. Right: IBC's evaluation success rate oscillates, making checkpoint selection difficult (evaluated using policy rollouts in simulation).](figure/ibc_stability_figure.png "fig:"){width="\\linewidth"} `\vspace{-5mm}`{=latex}

Connections to Control Theory {#sec:control}
-----------------------------

Diffusion Policy has a simple limiting behavior when the tasks are very simple; this potentially allows us to bring to bear some rigorous understanding from control theory. Consider the case where we have a linear dynamical system, in standard state-space form, that we wish to control: $$\begin{gathered}
{\bf s}_{t+1} = {\bf A}{\bf s}_t + {\bf B}{\bf a}_t + {\bf w}_t, \qquad {\bf w}_t \sim \mathcal{N}(0, \Sigma_w).
%, \\ {\bf o}_t = {\bf C}{\bf s}_t + {\bf v}_t, \qquad {\bf v}_t \sim \mathcal{N}(0, \Sigma_v).\end{gathered}$$ Now imagine we obtain demonstrations (rollouts) from a linear feedback policy: ${\bf a}_t = -{\bf K}{\bf s}_t.$ This policy could be obtained, for instance, by solving a linear optimal control problem like the Linear Quadratic Regulator. Imitating this policy does not need the modeling power of diffusion, but as a sanity check, we can see that Diffusion Policy does the right thing.

In particular, when the prediction horizon is one time step, $T_p=1$, it can be seen that the optimal denoiser which minimizes $$\mathcal{L}=MSE(\mathbf{\epsilon}^k,\epsilon_\theta(\mathbf{s}_t, -{\bf K}{\bf s}_t + \mathbf{\epsilon}^k, k))$$ is given by $$\epsilon_\theta({\bf s}, {\bf a}, k) = \frac{1}{\sigma_k}[{\bf a} + {\bf K}{\bf s}],$$ where $\sigma_k$ is the variance on denoising iteration $k$. Furthermore, at inference time, the DDIM sampling will converge to the global minima at ${\bf a} = -{\bf Ks}.$

Trajectory prediction ($T_p>1$) follows naturally. In order to predict ${\bf a}_{t+t'}$ as a function of ${\bf s}_t$, the optimal denoiser will produce ${\bf a}_{t+t'} = -{\bf K}({\bf A}-{\bf BK})^{t'}{\bf s}_t$; all terms involving ${\bf w}_t$ are zero in expectation. This shows that in order to perfectly clone a behavior that depends on the state, the learner must implicitly learn a (task-relevant) dynamics model [@subramanian2019approximate; @zhang2020learning]. Note that if either the plant or the policy is nonlinear, then predicting future actions could become significantly more challenging and once again involve multimodal predictions.

```{=latex}
\begin{table*}[h]
    
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
\includegraphics[width=0.835\linewidth]{figure/sim_task_thumbnails.pdf}
\label{tab:sim_benchmark_state}
% https://docs.google.com/drawings/d/11Azs8uUlZ2CER5NarlBOTnGhj3o_jm59wIqQKW3rXNQ/edit

\vspace{1mm}
{
\centering
% \setlength\tabcolsep{ 4.4 pt}
% \begin{tabular}{r|cc|cc|cc|cc|c|c}
% \toprule

%  & \multicolumn{2}{c|}{ Lift} & \multicolumn{2}{c|}{ Can} & \multicolumn{2}{c|}{ Square} & \multicolumn{2}{c|}{ Transport} & \multicolumn{1}{c|}{ ToolHang} & \multicolumn{1}{c}{ PushT} \\
%  & ph & mh & ph & mh & ph & mh & ph & mh & ph & ph \\
% \midrule
% LSTM-GMM \cite{robomimic} & \small \textbf{1.00}/.960 & \small \textbf{1.00}/.926 & \small \textbf{1.00}/.912 & \small \textbf{1.00}/.806 & \small .955/.732 & \small .864/.588 & \small .758/.467 & \small .621/.199 & \small .667/.312 & \small .896/.832 \\
% IBC  \cite{ibc} & \small .000/.000 & \small .000/.000 & \small .000/.000 & \small .000/.000 & \small .000/.000 & \small .000/.000 & \small .000/.000 & \small .000/.000 & \small .000/.000 & \small .000/.000 \\
% BET & \small \textbf{1.00}/.961 & \small \textbf{1.00}/.992 & \small \textbf{1.00}/.892 & \small \textbf{1.00}/.897 & \small .758/.520 & \small .682/.427 & \small .379/.145 & \small .212/.064 & \small .576/.200 & \small .797/.713 \\
% \midrule
% DiffusionPolicy-C & \small \textbf{1.00}/.985 & \small \textbf{1.00}/.970 & \small \textbf{1.00}/.959 & \small \textbf{1.00}/\textbf{.961} & \small \textbf{1.00}/\textbf{.929} & \small \textbf{.970}/\textbf{.821} & \small .939/.821 & \small \textbf{.682}/\textbf{.455} & \small .515/.221 & \small .994/\textbf{.992} \\
% DiffusionPolicy-T & \small \textbf{1.00}/\textbf{1.00} & \small \textbf{1.00}/\textbf{.999} & \small \textbf{1.00}/\textbf{.996} & \small \textbf{1.00}/.939 & \small \textbf{1.00}/.886 & \small .955/.812 & \small \textbf{1.00}/\textbf{.841} & \small .622/.351 & \small \textbf{1.00}/\textbf{.873} & \small \textbf{1.00}/.991 \\
% \bottomrule
% \end{tabular}

% this version uses number from the original paper (Robomimic)
% \setlength\tabcolsep{ 4.4 pt}
% \begin{tabular}{r|cc|cc|cc|cc|c|c}
% \toprule
%  & \multicolumn{2}{c|}{Lift} & \multicolumn{2}{c|}{Can} & \multicolumn{2}{c|}{Square} & \multicolumn{2}{c|}{Transport} & \multicolumn{1}{c|}{ToolHang} & \multicolumn{1}{c}{PushT} \\
%  & ph & mh & ph & mh & ph & mh & ph & mh & ph & ph \\
% \midrule
% LSTM-GMM \cite{robomimic} & \small \textbf{1.00}/0.96 & \small \textbf{1.00}/0.93 & \small \textbf{1.00}/0.91 & \small \textbf{1.00}/0.81 & \small 0.84/0.73 & \small 0.78/0.59 & \small 0.71/0.47 & \small 0.65/0.20 & \small 0.19/0.31 & \small 0.67/0.61 \\
% IBC \cite{ibc} & \small 0.79/0.41 & \small 0.15/0.02 & \small 0.00/0.00 & \small 0.01/0.01 & \small 0.00/0.00 & \small 0.00/0.00 & \small 0.00/0.00 & \small 0.00/0.00 & \small 0.00/0.00 & \small 0.90/0.84 \\
% BET \cite{bet} & \small \textbf{1.00}/0.96 & \small \textbf{1.00}/0.99 & \small \textbf{1.00}/0.89 & \small \textbf{1.00}/0.90 & \small 0.76/0.52 & \small 0.68/0.43 & \small 0.38/0.14 & \small 0.21/0.06 & \small 0.58/0.20 & \small 0.79/0.70 \\
% \midrule
% DiffusionPolicy-C & \small \textbf{1.00}/0.98 & \small \textbf{1.00}/0.97 & \small \textbf{1.00}/0.96 & \small \textbf{1.00}/\textbf{0.96} & \small \textbf{1.00}/\textbf{0.93} & \small \textbf{0.97}/\textbf{0.82} & \small 0.94/0.82 & \small \textbf{0.68}/\textbf{0.46} & \small 0.50/0.30 & \small 0.95/\textbf{0.91} \\
% DiffusionPolicy-T & \small \textbf{1.00}/\textbf{1.00} & \small \textbf{1.00}/\textbf{1.00} & \small \textbf{1.00}/\textbf{1.00} & \small \textbf{1.00}/0.94 & \small \textbf{1.00}/0.89 & \small 0.95/0.81 & \small \textbf{1.00}/\textbf{0.84} & \small 0.62/0.35 & \small \textbf{1.00}/\textbf{0.87} & \small \textbf{0.95}/0.79 \\
% \bottomrule
% \end{tabular}


% this version uses number we ran
\setlength\tabcolsep{ 3 pt}
\begin{tabular}{r|cc|cc|cc|cc|c|c}
\toprule
 & \multicolumn{2}{c|}{Lift} & \multicolumn{2}{c|}{Can} & \multicolumn{2}{c|}{Square} & \multicolumn{2}{c|}{Transport} & \multicolumn{1}{c|}{ToolHang} & \multicolumn{1}{c}{Push-T} \\
 & ph & mh & ph & mh & ph & mh & ph & mh & ph & ph \\
\midrule
LSTM-GMM & \small \textbf{1.00}/0.96 & \small \textbf{1.00}/0.93 & \small \textbf{1.00}/0.91 & \small \textbf{1.00}/0.81 & \small 0.95/0.73 & \small 0.86/0.59 & \small 0.76/0.47 & \small 0.62/0.20 & \small 0.67/0.31 & \small 0.67/0.61 \\
IBC & \small 0.79/0.41 & \small 0.15/0.02 & \small 0.00/0.00 & \small 0.01/0.01 & \small 0.00/0.00 & \small 0.00/0.00 & \small 0.00/0.00 & \small 0.00/0.00 & \small 0.00/0.00 & \small 0.90/0.84 \\
BET & \small \textbf{1.00}/0.96 & \small \textbf{1.00}/0.99 & \small \textbf{1.00}/0.89 & \small \textbf{1.00}/0.90 & \small 0.76/0.52 & \small 0.68/0.43 & \small 0.38/0.14 & \small 0.21/0.06 & \small 0.58/0.20 & \small 0.79/0.70 \\
\midrule
DiffusionPolicy-C & \small \textbf{1.00}/0.98 & \small \textbf{1.00}/0.97 & \small \textbf{1.00}/0.96 & \small \textbf{1.00}/\textbf{0.96} & \small \textbf{1.00}/\textbf{0.93} & \small \textbf{0.97}/\textbf{0.82} & \small 0.94/0.82 & \small \textbf{0.68}/\textbf{0.46} & \small 0.50/0.30 & \small 0.95/\textbf{0.91} \\
DiffusionPolicy-T & \small \textbf{1.00}/\textbf{1.00} & \small \textbf{1.00}/\textbf{1.00} & \small \textbf{1.00}/\textbf{1.00} & \small \textbf{1.00}/0.94 & \small \textbf{1.00}/0.89 & \small 0.95/0.81 & \small \textbf{1.00}/\textbf{0.84} & \small 0.62/0.35 & \small \textbf{1.00}/\textbf{0.87} & \small \textbf{0.95}/0.79 \\
\bottomrule
\end{tabular}

% Feb 28 update, reran diffusion cnn with obs global cond
% \setlength\tabcolsep{ 4.4 pt}
% \begin{tabular}{r|cc|cc|cc|cc|c|c}
% \toprule
%  & \multicolumn{2}{c|}{Lift} & \multicolumn{2}{c|}{Can} & \multicolumn{2}{c|}{Square} & \multicolumn{2}{c|}{Transport} & \multicolumn{1}{c|}{ToolHang} & \multicolumn{1}{c}{PushT} \\
%  & ph & mh & ph & mh & ph & mh & ph & mh & ph & ph \\
% \midrule
% LSTM-GMM \cite{robomimic} & \small \textbf{1.00}/0.96 & \small \textbf{1.00}/0.93 & \small \textbf{1.00}/0.91 & \small \textbf{1.00}/0.81 & \small 0.84/0.73 & \small 0.78/0.59 & \small 0.71/0.47 & \small 0.65/0.20 & \small 0.19/0.31 & \small 0.67/0.61 \\
% IBC \cite{ibc} & \small 0.79/0.41 & \small 0.15/0.02 & \small 0.00/0.00 & \small 0.01/0.01 & \small 0.00/0.00 & \small 0.00/0.00 & \small 0.00/0.00 & \small 0.00/0.00 & \small 0.00/0.00 & \small 0.90/0.84 \\
% BET & \small \textbf{1.00}/0.96 & \small \textbf{1.00}/0.99 & \small \textbf{1.00}/0.89 & \small \textbf{1.00}/0.90 & \small 0.76/0.52 & \small 0.68/0.43 & \small 0.38/0.14 & \small 0.21/0.06 & \small 0.58/0.20 & \small 0.79/0.70 \\
% DiffusionPolicy-C & \small \textbf{1.00}/\textbf{1.00} & \small \textbf{1.00}/0.98 & \small \textbf{1.00}/\textbf{1.00} & \small \textbf{1.00}/\textbf{0.99} & \small \textbf{1.00}/\textbf{0.95} & \small \textbf{0.98}/\textbf{0.86} & \small 0.95/\textbf{0.85} & \small \textbf{0.77}/\textbf{0.60} & \small 0.86/0.52 & \small 0.95/\textbf{0.91} \\
% DiffusionPolicy-T & \small \textbf{1.00}/\textbf{1.00} & \small \textbf{1.00}/\textbf{1.00} & \small \textbf{1.00}/1.00 & \small \textbf{1.00}/0.94 & \small \textbf{1.00}/0.89 & \small 0.95/0.81 & \small \textbf{1.00}/0.84 & \small 0.62/0.35 & \small \textbf{1.00}/\textbf{0.87} & \small \textbf{0.95}/0.79 \\
% \bottomrule
% \end{tabular}


\caption{\textbf{Behavior Cloning Benchmark (State Policy) \label{tab:table_low_dim} } 
We present success rates with different checkpoint selection methods in the format of (max performance) / (average of last 10 checkpoints), with each averaged across 3 training seeds and 50 different environment initial conditions (150 in total). 
LSTM-GMM corresponds to BC-RNN in RoboMimic\cite{robomimic}, which we reproduced and obtained slightly {better} results than the original paper. Our results show that Diffusion Policy significantly improves state-of-the-art performance across the board.
% \label{tab:table_low_dim}
% Numbers are success rate shown with different checkpoint selection methods: (max performance) / (average of last 10 checkpoints), which are then averaged across 3 training seeds and 50 different env initial conditions each (150 in total). Higher the better. LSTM-GMM correspond to BC-RNN in RoboMimic \cite{robomimic}. LSTM-GMM numbers are from our reproduction (generally slightly higher than the original paper).  \label{tab:table_low_dim}
}
\vspace{2mm}


% \setlength\tabcolsep{ 4.4 pt}
% \begin{tabular}{r|cc|cc|cc|cc|c|c}
% \toprule
%  & \multicolumn{2}{c|}{ Lift} & \multicolumn{2}{c|}{ Can} & \multicolumn{2}{c|}{ Square} & \multicolumn{2}{c|}{ Transport} & \multicolumn{1}{c|}{ ToolHang} & \multicolumn{1}{c}{ PushT} \\
%  & ph & mh & ph & mh & ph & mh & ph & mh & ph & ph \\
% \midrule
% LSTM-GMM \cite{robomimic} & \small \textbf{1.00}/.000 & \small \textbf{1.00}/.000 & \small .980/.000 & \small .960/.000 & \small .820/.000 & \small .767/.000 & \small .720/.000 & \small .420/.000 & \small .673/.000 & \small .782/.690 \\
% IBC \cite{ibc} & \small .000/.000 & \small .000/.000 & \small .000/.000 & \small .000/.000 & \small .000/.000 & \small .000/.000 & \small .000/.000 & \small .000/.000 & \small .000/.000 & \small .000/.000 \\

% \midrule
% DiffusionPolicy-C & \small \textbf{1.00}/\textbf{1.00} & \small \textbf{1.00}/\textbf{.997} & \small \textbf{1.00}/.974 & \small \textbf{1.00}/\textbf{.962} & \small .985/\textbf{.923} & \small .924/.674 & \small \textbf{.955}/.712 & \small \textbf{.697}/\textbf{.489} & \small \textbf{.849}/\textbf{.614} & \small \textbf{.984}/\textbf{.937} \\
% DiffusionPolicy-T & \small \textbf{1.00}/\textbf{1.00} & \small \textbf{1.00}/.992 & \small \textbf{1.00}/\textbf{.979} & \small \textbf{1.00}/.958 & \small \textbf{1.00}/.900 & \small \textbf{.939}/\textbf{.841} & \small .839/\textbf{.762} & \small .530/.346 & \small .712/.472 & \small .909/.810 \\
% \bottomrule
% \end{tabular}

% this version uses number from the paper
% \setlength\tabcolsep{ 4.4 pt}
% \begin{tabular}{r|cc|cc|cc|cc|c|c}
% \toprule
%  & \multicolumn{2}{c|}{Lift} & \multicolumn{2}{c|}{Can} & \multicolumn{2}{c|}{Square} & \multicolumn{2}{c|}{Transport} & \multicolumn{1}{c|}{ToolHang} & \multicolumn{1}{c}{PushT} \\
%  & ph & mh & ph & mh & ph & mh & ph & mh & ph & ph \\
% \midrule
% LSTM-GMM \cite{robomimic} & \small \textbf{1.00}/0.96 & \small \textbf{1.00}/0.95 & \small 0.98/0.88 & \small 0.96/0.90 & \small 0.82/0.59 & \small 0.77/0.38 & \small 0.72/0.62 & \small 0.42/0.23 & \small 0.67/0.49 & \small 0.69/0.54 \\
% IBC \cite{ibc} & \small 0.94/0.73 & \small 0.39/0.05 & \small 0.08/0.01 & \small 0.00/0.00 & \small 0.03/0.00 & \small 0.00/0.00 & \small 0.00/0.00 & \small 0.00/0.00 & \small 0.00/0.00 & \small 0.75/0.64 \\
% \midrule
% DiffusionPolicy-C & \small \textbf{1.00}/\textbf{1.00} & \small \textbf{1.00}/\textbf{1.00} & \small \textbf{1.00}/0.97 & \small \textbf{1.00}/0.96 & \small 0.98/\textbf{0.92} & \small \textbf{0.98}/\textbf{0.84} & \small \textbf{1.00}/\textbf{0.93} & \small \textbf{0.86}/\textbf{0.71} & \small \textbf{0.95}/\textbf{0.79} & \small \textbf{0.91}/\textbf{0.84} \\
% DiffusionPolicy-T & \small \textbf{1.00}/\textbf{1.00} & \small \textbf{1.00}/0.99 & \small \textbf{1.00}/\textbf{0.98} & \small \textbf{1.00}/\textbf{0.98} & \small \textbf{1.00}/0.90 & \small 0.94/0.80 & \small 0.98/0.81 & \small 0.71/0.48 & \small 0.76/0.47 & \small 0.78/0.66 \\
% \bottomrule
% \end{tabular}

% this version uses number from our experiments
% \setlength\tabcolsep{ 4.4 pt}
% \begin{tabular}{r|cc|cc|cc|cc|c|c}
% \toprule
%  & \multicolumn{2}{c|}{Lift} & \multicolumn{2}{c|}{Can} & \multicolumn{2}{c|}{Square} & \multicolumn{2}{c|}{Transport} & \multicolumn{1}{c|}{ToolHang} & \multicolumn{1}{c}{PushT} \\
%  & ph & mh & ph & mh & ph & mh & ph & mh & ph & ph \\
% \midrule
% LSTM-GMM \cite{robomimic} & \small \textbf{1.00}/0.96 & \small \textbf{1.00}/0.95 & \small 0.98/0.88 & \small 0.96/0.90 & \small 0.82/0.59 & \small 0.77/0.38 & \small 0.72/0.62 & \small 0.42/0.24 & \small 0.67/0.49 & \small 0.69/0.54 \\
% IBC \cite{ibc} & \small 0.94/0.73 & \small 0.39/0.05 & \small 0.08/0.01 & \small 0.00/0.00 & \small 0.03/0.00 & \small 0.00/0.00 & \small 0.00/0.00 & \small 0.00/0.00 & \small 0.00/0.00 & \small 0.75/0.64 \\
% DiffusionPolicy-C & \small \textbf{1.00}/\textbf{1.00} & \small \textbf{1.00}/\textbf{1.00} & \small \textbf{1.00}/0.97 & \small \textbf{1.00}/0.96 & \small 0.98/\textbf{0.92} & \small \textbf{0.98}/\textbf{0.84} & \small \textbf{1.00}/\textbf{0.93} & \small \textbf{0.89}/\textbf{0.69} & \small \textbf{0.95}/\textbf{0.73} & \small \textbf{0.91}/\textbf{0.84} \\
% DiffusionPolicy-T & \small \textbf{1.00}/\textbf{1.00} & \small \textbf{1.00}/0.99 & \small \textbf{1.00}/\textbf{0.98} & \small \textbf{1.00}/\textbf{0.98} & \small \textbf{1.00}/0.90 & \small 0.94/0.80 & \small 0.98/0.81 & \small 0.73/0.50 & \small 0.76/0.47 & \small 0.78/0.66 \\
% \bottomrule
% \end{tabular}

\setlength\tabcolsep{ 3 pt}
\begin{tabular}{r|cc|cc|cc|cc|c|c}
\toprule
 & \multicolumn{2}{c|}{Lift} & \multicolumn{2}{c|}{Can} & \multicolumn{2}{c|}{Square} & \multicolumn{2}{c|}{Transport} & \multicolumn{1}{c|}{ToolHang} & \multicolumn{1}{c}{Push-T} \\
 & ph & mh & ph & mh & ph & mh & ph & mh & ph & ph \\
\midrule
LSTM-GMM & \small \textbf{1.00}/0.96 & \small \textbf{1.00}/0.95 & \small \textbf{1.00}/0.88 & \small 0.98/0.90 & \small 0.82/0.59 & \small 0.64/0.38 & \small 0.88/0.62 & \small 0.44/0.24 & \small 0.68/0.49 & \small 0.69/0.54 \\
IBC & \small 0.94/0.73 & \small 0.39/0.05 & \small 0.08/0.01 & \small 0.00/0.00 & \small 0.03/0.00 & \small 0.00/0.00 & \small 0.00/0.00 & \small 0.00/0.00 & \small 0.00/0.00 & \small 0.75/0.64 \\
DiffusionPolicy-C & \small \textbf{1.00}/\textbf{1.00} & \small \textbf{1.00}/\textbf{1.00} & \small \textbf{1.00}/0.97 & \small \textbf{1.00}/0.96 & \small 0.98/\textbf{0.92} & \small \textbf{0.98}/\textbf{0.84} & \small \textbf{1.00}/\textbf{0.93} & \small \textbf{0.89}/\textbf{0.69} & \small \textbf{0.95}/\textbf{0.73} & \small \textbf{0.91}/\textbf{0.84} \\
DiffusionPolicy-T & \small \textbf{1.00}/\textbf{1.00} & \small \textbf{1.00}/0.99 & \small \textbf{1.00}/\textbf{0.98} & \small \textbf{1.00}/\textbf{0.98} & \small \textbf{1.00}/0.90 & \small 0.94/0.80 & \small 0.98/0.81 & \small 0.73/0.50 & \small 0.76/0.47 & \small 0.78/0.66 \\
\bottomrule
\end{tabular}

\caption{\textbf{Behavior Cloning Benchmark (Visual Policy) \label{tab:table_image}} Performance are reported in the same format as in Tab \ref{tab:table_low_dim}. LSTM-GMM numbers were reproduced to get a complete evaluation in addition to the best checkpoint performance reported. Diffusion Policy shows consistent performance improvement, especially for complex tasks like Transport and ToolHang. }
}
%\vspace{-3mm}
\end{table*}
```
Evaluation
==========

We systematically evaluate Diffusion Policy on 15 tasks from 4 benchmarks [@ibc; @gupta2019relay; @robomimic; @bet]. This evaluation suite includes both simulated and real environments, single and multiple task benchmarks, fully actuated and under-actuated systems, and rigid and fluid objects. We found Diffusion Policy to consistently outperform the prior state-of-the-art on all of the tested benchmarks, with an average success-rate improvement of 46.9%. In the following sections, we provide an overview of each task, our evaluation methodology on that task, and our key takeaways.

Simulation Environments and datasets
------------------------------------

**Robomimic** [@robomimic] is a large-scale robotic manipulation benchmark designed to study imitation learning and offline RL. The benchmark consists of 5 tasks with a proficient human (PH) teleoperated demonstration dataset for each and mixed proficient/non-proficient human (MH) demonstration datasets for 4 of the tasks (9 variants in total). For each variant, we report results for both state- and image-based observations. Properties for each task are summarized in Tab `\ref{tab:robomimic_tasks}`{=latex}.

```{=latex}
\centering
```
```{=latex}
\setlength
```
```{=latex}
\tabcolsep{2 pt}
```
```{=latex}
\small
```
::: {#tab:robomimic_tasks}
                Task                                      \# Rob                         \# Obj   ActD   \#PH   \#MH   Steps   Img?   HiPrec
  -------------------------------- ---------------------------------------------------- -------- ------ ------ ------ ------- ------ --------
   `\multicolumn{1}{c}{}`{=latex}   `\multicolumn{8}{c}{Simulation Benchmark}`{=latex}                                               
                Lift                                        1                              1       7     200    300     400    Yes      No
                Can                                         1                              1       7     200    300     400    Yes      No
               Square                                       1                              1       7     200    300     400    Yes     Yes
             Transport                                      2                              3       14    200    300     700    Yes      No
              ToolHang                                      1                              2       7     200     0      700    Yes     Yes
               Push-T                                       1                              1       2     200     0      300    Yes     Yes
             BlockPush                                      1                              2       2      0      0      350     No      No
              Kitchen                                       1                              7       9     656     0      280     No      No
   `\multicolumn{1}{c}{}`{=latex}   `\multicolumn{8}{c}{Realworld Benchmark}`{=latex}                                                
               Push-T                                       1                              1       2     136     0      600    Yes     Yes
             6DoF Pour                                      1                            liquid    6      90     0      600    Yes      No
            Peri Spread                                     1                            liquid    6      90     0      600    Yes      No
              Mug Flip                                      1                              1       7     250     0      600    Yes      No

  : **Tasks Summary.** \# Rob: number of robots, \#Obj: number of objects, ActD: action dimension, PH: proficient-human demonstration, MH: multi-human demonstration, Steps: max number of rollout steps, HiPrec: whether the task has a high precision requirement. BlockPush uses 1000 episodes of scripted demonstrations.
:::

```{=latex}
\vspace{-5mm}
```
**Push-T** `\label{sec:eval_sim_pusht}`{=latex} adapted from IBC [@ibc], requires pushing a T-shaped block (gray) to a fixed target (red) with a circular end-effector (blue)s. Variation is added by random initial conditions for T block and end-effector. The task requires exploiting complex and contact-rich object dynamics to push the T block precisely, using point contacts. There are two variants: one with RGB image observations and another with 9 2D keypoints obtained from the ground-truth pose of the T block, both with proprioception for end-effector location.

**Multimodal Block Pushing** adapted from BET [@bet], this task tests the policy's ability to model multimodal action distributions by pushing two blocks into two squares in any order. The demonstration data is generated by a scripted oracle with access to groundtruth state info. This oracle randomly selects an initial block to push and moves it to a randomly selected square. The remaining block is then pushed into the remaining square. This task contains **long-horizon** multimodality that can not be modeled by a single function mapping from observation to action.

**Franka Kitchen** is a popular environment for evaluating the ability of IL and Offline-RL methods to learn multiple long-horizon tasks. Proposed in Relay Policy Learning [@gupta2019relay], the Franka Kitchen environment contains 7 objects for interaction and comes with a human demonstration dataset of 566 demonstrations, each completing 4 tasks in arbitrary order. The goal is to execute as many demonstrated tasks as possible, regardless of order, showcasing both short-horizon and long-horizon multimodality.

```{=latex}
\centering
```
![image](figure/multitask_thumbnails.png){width="0.9\\linewidth"}

```{=latex}
\vspace{2mm}
```
```{=latex}
\setlength
```
```{=latex}
\tabcolsep{4.8 pt}
```
  ------------------- ------------------------------------------ --------------------------------------- --------------------------- --------------------------- --------------------------- ---------------------------
                       `\multicolumn{2}{c|}{BlockPush}`{=latex}   `\multicolumn{4}{c}{Kitchen}`{=latex}                                                                                      
                                          p1                                       p2                                p1                          p2                          p3                          p4
             LSTM-GMM           `\small `{=latex}0.03                     `\small `{=latex}0.01           `\small `{=latex}**1.00**     `\small `{=latex}0.90       `\small `{=latex}0.74       `\small `{=latex}0.34
                  IBC           `\small `{=latex}0.01                     `\small `{=latex}0.00             `\small `{=latex}0.99       `\small `{=latex}0.87       `\small `{=latex}0.61       `\small `{=latex}0.24
                  BET           `\small `{=latex}0.96                     `\small `{=latex}0.71             `\small `{=latex}0.99       `\small `{=latex}0.93       `\small `{=latex}0.71       `\small `{=latex}0.44
    DiffusionPolicy-C           `\small `{=latex}0.36                     `\small `{=latex}0.11           `\small `{=latex}**1.00**   `\small `{=latex}**1.00**   `\small `{=latex}**1.00**   `\small `{=latex}**0.99**
    DiffusionPolicy-T         `\small `{=latex}**0.99**                 `\small `{=latex}**0.94**         `\small `{=latex}**1.00**     `\small `{=latex}0.99       `\small `{=latex}0.99       `\small `{=latex}0.96
  ------------------- ------------------------------------------ --------------------------------------- --------------------------- --------------------------- --------------------------- ---------------------------

  : **Multi-Stage Tasks (State Observation)**. `\label{tab:multi_stage}`{=latex} For PushBlock, $px$ is the frequency of pushing $x$ blocks into the targets. For Kitchen, $px$ is the frequency of interacting with $x$ or more objects (e.g. bottom burner). Diffusion Policy performs better, especially for difficult metrics such as $p2$ for Block Pushing and $p4$ for Kitchen, as demonstrated by our results.

```{=latex}
\vspace{-4mm}
```
Evaluation Methodology
----------------------

We present the **best-performing for each baseline method** on each benchmark from all possible sources -- our reproduced result (LSTM-GMM) or original number reported in the paper (BET, IBC). We report results from the average of the last 10 checkpoints (saved every 50 epochs) across **3** training seeds and **50** environment initializations [^1] (an average of **1500** experiments in total). The metric for most tasks is success rate, except for the Push-T task, which uses target area coverage. In addition, we report the average of best-performing checkpoints for robomimic and Push-T tasks to be consistent with the evaluation methodology of their respective original papers [@robomimic; @ibc]. All state-based tasks are trained for 4500 epochs, and image-based tasks for 3000 epochs. Each method is evaluated with its best-performing action space: position control for Diffusion Policy and velocity control for baselines (the effect of action space will be discussed in detail in Sec `\ref{sec:eval_pos_vs_vel}`{=latex}). The results from these simulation benchmarks are summarized in Table `\ref{tab:table_low_dim}`{=latex} and Table `\ref{tab:table_image}`{=latex}.

Key Findings
------------

Diffusion Policy outperforms alternative methods on all tasks and variants, with both state and vision observations, in our simulation benchmark study (Tabs `\ref{tab:table_low_dim}`{=latex}, `\ref{tab:table_image}`{=latex} and `\ref{tab:multi_stage}`{=latex}) with an average improvement of 46.9%. The following paragraphs summarize the key takeaways.

**Diffusion Policy can express short-horizon multimodality.** We define short-horizon action multimodality as multiple ways of achieving **the same immediate goal**, which is prevalent in human demonstration data [@robomimic]. In Fig `\ref{fig:multimodal}`{=latex}, we present a case study of this type of short-horizon multimodality in the Push-T task. Diffusion Policy learns to approach the contact point equally likely from left or right, while LSTM-GMM [@robomimic] and IBC [@ibc] exhibit bias toward one side and BET [@bet] cannot commit to one mode.

**Diffusion Policy can express long-horizon multimodality.** Long-horizon multimodality is the completion of **different sub-goals** in inconsistent order. For example, the order of pushing a particular block in the Block Push task or the order of interacting with 7 possible objects in the Kitchen task are arbitrary. We find that Diffusion Policy copes well with this type of multimodality; it outperforms baselines on both tasks by a large margin: 32% improvement on Block Push's p2 metric and 213% improvement on Kitchen's p4 metric.

**Diffusion Policy can better leverage position control.** `\label{sec:eval_pos_vs_vel}`{=latex} Our ablation study (Fig. `\ref{fig:pos_vs_vel}`{=latex}) shows that selecting position control as the diffusion-policy action space significantly outperformed velocity control. The baseline methods we evaluate, however, work best with velocity control (and this is reflected in the literature where most existing work reports using velocity-control action spaces [@robomimic; @bet; @zhang2018deep; @florence2019self; @mandlekar2020learning; @mandlekar2020iris]).

**The tradeoff in action horizon.** As discussed in Sec `\ref{sec:action_sequence}`{=latex}, having an action horizon greater than 1 helps the policy predict consistent actions and compensate for idle portions of the demonstration, but too long a horizon reduces performance due to slow reaction time. Our experiment confirms this trade-off (Fig. `\ref{fig:ablation}`{=latex} left) and found the action horizon of 8 steps to be optimal for most tasks that we tested.

**Robustness against latency.** Diffusion Policy employs receding horizon position control to predict a sequence of actions into the future. This design helps address the latency gap caused by image processing, policy inference, and network delay. Our ablation study with simulated latency showed Diffusion Policy is able to maintain peak performance with latency up to 4 steps (Fig `\ref{fig:ablation}`{=latex}). We also find that velocity control is more affected by latency than position control, likely due to compounding error effects.

**Diffusion Policy is stable to train.** We found that the optimal hyperparameters for Diffusion Policy are mostly consistent across tasks. In contrast, IBC [@ibc] is prone to training instability. This property is discussed in Sec `\ref{sec:ibc_stability}`{=latex}.

Ablation Study {#sec:arch_ablation}
--------------

We explore alternative vision encoder design decisions on the simulated robomimic square task. Specifically, we evaluated 3 different architectures: ResNet-18, ResNet-34 [@resnet] and ViT-B/16 [@dosovitskiy2020image]. For each architecture, we evaluated 3 different training strategies: training end-to-end from scratch, using frozen pre-trained vision encoder, and finetuning pre-trained vision encoders (with 10x lower learning rate with respect to the policy network). We use ImageNet-21k [@ridnik2021imagenet21k] pretraining for ResNet and CLIP [@radford2021learning] pretraining for ViT-B/16. The quantitative comparison on square task with proficient-human (PH) dataset is shown in Tab. `\ref{tab:ablation_vision_encorder}`{=latex}.

We found training ViT from scratch to be challenging (with only 22% success rate), likely due to the limited amount data. We also found training with frozen pretrained vision encoder to yield poor performance, which indicates that diffusion policy prefers different vision representation than what is offered in popular pretraining methods. However, we found finetuning the pretrained vision encoder with a small learning rate (10x smaller vs diffusion policy network) gives the best performance overall. This is especially true for the CLIP-trained ViT-B/16, which reaches 98% success rate with only 50 epochs of training. Overall, the best performance across different architectures is not large, despite their significant theoretical capacity gap. We anticipate that their performance gap could be more pronounced on a complex task.

```{=latex}
\centering
```
::: {#tab:ablation_vision_encorder}
  ----------------- -------- ------------------------------------------ ------------
       Archicture &   From    `\multicolumn{2}{c}{Pretrained}`{=latex}  
    Prertain Datset  Scatch                    frozen                    finetuning
    Resnet18 (in21)   0.94                      0.58                        0.92
    Resnet34 (in21)   0.92                      0.40                        0.94
    ViT-base (clip)   0.22                      0.70                        0.98
  ----------------- -------- ------------------------------------------ ------------

  : **Vision Encoder Comparison** All models are trained on the robomimic square (ph) task using CNN-based diffusion policy. Each model is trained for 500 epochs and evaluated every 50 epochs under 50 different environment initial conditions.
:::

```{=latex}
\vspace{-2mm}
```
Realworld Evaluation
====================

We evaluated Diffusion Policy in the realworld performance on 4 tasks across 2 hardware setups -- with training data from different demonstrators for each setup. On the realworld Push-T task, we perform ablations examining Diffusion Policy on 2 architecture options and 3 visual encoder options; we also benchmarked against 2 baseline methods with both position-control and velocity-control action spaces. On all tasks, Diffusion Policy variants with both CNN backbones and end-to-end-trained visual encoders yielded the best performance. More details about the task setup and parameters may be found in supplemental materials.

```{=latex}
\centering
```
![image](figure/real_task_setup.png){width="0.9\\linewidth"}

```{=latex}
\vspace{2mm}
```
```{=latex}
\setlength
```
```{=latex}
\tabcolsep{1.2pt}
```
```{=latex}
\small
```
  ------- ------- ------------------------------------ ----------------------------------------- ------------------------------------------------ ------ ------- -------- ------ ----------
           Human   `\multicolumn{2}{c|}{IBC}`{=latex}   `\multicolumn{2}{c|}{LSTM-GMM}`{=latex}   `\multicolumn{4}{c}{Diffusion Policy}`{=latex}                                 
           Demo                   pos                                     vel                                          pos                         vel    T-E2E   ImgNet   R3M      E2E
    IoU    0.84                   0.14                                   0.19                                          0.24                        0.25   0.53     0.24    0.66   **0.80**
   Succ%   1.00                   0.00                                   0.00                                          0.20                        0.10   0.65     0.15    0.80   **0.95**
   Dur.    20.3                   56.3                                   41.6                                          47.3                        51.7   57.5     55.8    31.7   **22.9**
  ------- ------- ------------------------------------ ----------------------------------------- ------------------------------------------------ ------ ------- -------- ------ ----------

  : **Realworld Push-T Experiment.** `\label{tab:real_pusht}`{=latex} a) Hardware setup. b) Illustration of the task. The robot needs to `\raisebox{-0.9pt}{1}`{=latex} precisely push the T-shaped block into the target region, **and** `\raisebox{-0.9pt}{2}`{=latex} move the end-effector to the end-zone. c) The ground truth end state used to calculate IoU metrics used in this table. Table: Success is defined by the end-state IoU greater than the minimum IoU in the demonstration dataset. Average episode duration presented in seconds. T-E2E stands for end-to-end trained Transformer-based Diffusion Policy

```{=latex}
\vspace{-4mm}
```
```{=latex}
\begin{figure*}
\centering
\includegraphics[width=\linewidth]{figure/real_results.pdf}

% https://docs.google.com/drawings/d/1LY-oKJ32jSTlIpMnEnoyYGfd58Il9Zzf-YhFS8Axin0/edit
\caption{\textbf{Realworld Push-T Comparisons.} 
\label{fig:real_pusht_comparison}
Columns 1-4 show action trajectories based on key events. The last column shows averaged images of the end state. 
\textbf{A}: Diffusion policy (End2End) achieves more accurate and consistent end states.
\textbf{B}: Diffusion Policy (R3M) gets stuck initially but later recovers and finishes the task. 
\textbf{C}: LSTM-GMM fails to reach the end zone while adjusting the T block, blocking the eval camera view.
\textbf{D}: IBC prematurely ends the pushing stage.
}
\vspace{-2mm}
\end{figure*}
```
Realworld Push-T Task
---------------------

Real-world Push-T is significantly harder than the simulated version due to 3 modifications: 1. The real-world Push-T task is **multi-stage**. It requires the robot to `\raisebox{-0.9pt}{1}`{=latex} push the T block into the target and then `\raisebox{-0.9pt}{2}`{=latex} move its end-effector into a designated end-zone to avoid occlusion. 2. The policy needs to make fine adjustments to make sure the T is fully in the goal region before heading to the end-zone, creating additional short-term multimodality. 3. The IoU metric is measured at the **last step** instead of taking the maximum over all steps. We threshold success rate by the minimum achieved IoU metric from the human demonstration dataset. Our UR5-based experiment setup is shown in Fig `\ref{tab:real_pusht}`{=latex}. Diffusion Policy predicts robot commands at 10 Hz and these commands then linearly interpolated to 125 Hz for robot execution.

**Result Analysis.** Diffusion Policy performed close to human level with 95% success rate and 0.8 v.s. 0.84 average IoU, compared with the 0% and 20% success rate of best-performing IBC and LSTM-GMM variants. Fig `\ref{fig:real_pusht_comparison}`{=latex} qualitatively illustrates the behavior for each method starting from the same initial condition. We observed that poor performance during the transition between stages is the most common failure case for the baseline method due to high multimodality during those sections and an ambiguous decision boundary. LSTM-GMM got stuck near the T block in 8 out of 20 evaluations (3rd row), while IBC prematurely left the T block in 6 out of 20 evaluations (4th row). We did not follow the common practice of removing **idle actions** from training data due to task requirements, which also contributed to LSTM and IBC's tendency to overfit on small actions and get stuck in this task. The results are best appreciated with videos in supplemental materials.

**End-to-end v.s. pre-trained vision encoders** We tested Diffusion Policy with pre-trained vision encoders (ImageNet [@deng2009imagenet] and R3M[@nair2022r3m]), as seen in Tab. `\ref{tab:real_pusht}`{=latex}. Diffusion Policy with R3M achieves an 80% success rate but predicts jittery actions and is more likely to get stuck compared to the end-to-end trained version. Diffusion Policy with ImageNet showed less promising results with abrupt actions and poor performance. We found that end-to-end training is still the most effective way to incorporate visual observation into Diffusion Policy, and our best-performing models were all end-to-end trained.

```{=latex}
\centering
```
![**Robustness Test for Diffusion Policy.** `\label{fig:robustness}`{=latex} **Left**: A waving hand in front of the camera for 3 seconds causes slight jitter, but the predicted actions still function as expected. **Middle**: Diffusion Policy immediately corrects shifted block position to the goal state during the pushing stage. **Right**: Policy immediately aborts heading to the end zone, returning the block to goal state upon detecting block shift. This novel behavior was never demonstrated. Please check the videos in the supplementary material. ](figure/real_robustness.png){width="\\linewidth"}

```{=latex}
\vspace{-4mm}
```
**Robustness against perturbation** Diffusion Policy's robustness against visual and physical perturbations was evaluated in a separate episode from experiments in Tab `\ref{tab:real_pusht}`{=latex}. As shown in Fig `\ref{fig:robustness}`{=latex}, three types of perturbations are applied. 1) The front camera was blocked for 3 secs by a waving hand (left column), but the diffusion policy, despite exhibiting some jitter, remained on-course and pushed the T block into position. 2) We shifted the T block while Diffusion Policy was making fine adjustments to the T block's position. Diffusion policy immediately re-planned to push from the opposite direction, negating the impact of perturbation. 3) We moved the T block while the robot was en route to the end-zone after the first stage's completion. The Diffusion Policy immediately changed course to adjust the T block back to its target and then continued to the end-zone. This experiment indicates that Diffusion Policy may be able to **synthesize novel behavior** in response to unseen observations.

```{=latex}
\centering
```
![ **6DoF Mug Flipping Task.** `\label{fig:mug_task}`{=latex} The robot needs to `\raisebox{-0.9pt}{1}`{=latex} Pickup a randomly placed mug and place it lip down (marked orange). `\raisebox{-0.9pt}{2}`{=latex} Rotate the mug such that its handle is pointing left. ](figure/mug_task.png){width="0.9\\linewidth"}

```{=latex}
\vspace{1.5mm}
```
            Human   LSTM-GMM   Diffusion Policy
  -------- ------- ---------- ------------------
   Succ %    1.0      0.0            0.9

```{=latex}
\vspace{-4mm}
```
```{=latex}
\centering
```
![**Realworld Sauce Manipulation.** `\label{fig:real_sauce_manipulation}`{=latex} \[Left\] **6DoF pouring Task.** The robot needs to `\raisebox{-0.9pt}{1}`{=latex} dip the ladle to scoop sauce from the bowl, `\raisebox{-0.9pt}{2}`{=latex} approach the center of the pizza dough, `\raisebox{-0.9pt}{3}`{=latex} pour sauce, and `\raisebox{-0.9pt}{4}`{=latex} lift the ladle to finish the task. \[Right\] **Periodic spreading Task** The robot needs to `\raisebox{-0.9pt}{1}`{=latex} approach the center of the sauce with a grasped spoon, `\raisebox{-0.9pt}{2}`{=latex} spread the sauce to cover pizza in a spiral pattern, and `\raisebox{-0.9pt}{3}`{=latex} lift the spoon to finish the task. ](figure/real_sauce_setup.png){width="\\linewidth"}

```{=latex}
\vspace{1.5mm}
```
```{=latex}
\small
```
  ------------------ ------------------------------------- -------------------------------------- ---------- ----------
                      `\multicolumn{2}{c|}{Pour}`{=latex}   `\multicolumn{2}{c}{Spread}`{=latex}             
                                      IoU                                   Succ                   Coverage    Succ %
               Human                 0.79                                   1.00                     0.79       1.00
            LSTM-GMM                 0.06                                   0.00                     0.27       0.00
    Diffusion Policy               **0.74**                               **0.79**                 **0.77**   **1.00**
  ------------------ ------------------------------------- -------------------------------------- ---------- ----------

```{=latex}
\vspace{-4mm}
```
Mug Flipping Task
-----------------

The mug flipping task is designed to test Diffusion Policy's ability to handle complex **3D rotations** while operating close to the hardware's kinematic limits. The goal is to reorient a randomly placed mug to have `\raisebox{-0.9pt}{1}`{=latex} the lip facing down `\raisebox{-0.9pt}{2}`{=latex} the handle pointing left, as shown in Fig. `\ref{fig:mug_task}`{=latex}. Depending on the mug's initial pose, the demonstrator might directly place the mug in desired orientation, or may use additional push of the handle to rotation the mug. As a result, the demonstration dataset is highly multi-modal: grasp vs push, different types of grasps (forehand vs backhand) or local grasp adjustments (rotation around mug's principle axis), and are particularly challenging for baseline approaches to capture.

**Result Analysis.** Diffusion policy is able to complete this task with 90% success rate over 20 trials. The richness of captured behaviors is best appreciated with the video. Although never demonstrated, the policy is also able to sequence multiple pushes for handle alignment or regrasps for dropped mug when necessary. For comparison, we also train a LSTM-GMM policy trained with a subset of the same data. For 20 in-distribution initial conditions, the LSTM-GMM policy never aligns properly with respect to the mug, and fails to grasp in all trials.

Sauce Pouring and Spreading
---------------------------

The sauce pouring and spreading tasks are designed to test Diffusion Policy's ability to work with **non-rigid** objects, **6 Dof** action spaces, and **periodic** actions in real-world setups. Our Franka Panda setup and tasks are shown in Fig `\ref{fig:real_sauce_manipulation}`{=latex}. The goal for the **6DoF pouring task** is to pour one full ladle of sauce onto the center of the pizza dough, with performance measured by IoU between the poured sauce mask and a nominal circle at the center of the pizza dough (illustrated by the green circle in Fig `\ref{fig:real_sauce_manipulation}`{=latex}). The goal for the **periodic spreading task** is to spread sauce on pizza dough, with performance measured by sauce coverage. Variations across evaluation episodes come from random locations for the dough and the sauce bowl. The success rate is computed by thresholding with minimum human performance. Results are best viewed in supplemental videos. Both tasks were trained with the same Push-T hyperparameters, and successful policies were achieved on the first attempt.

The sauce pouring task requires the robot to remain stationary for a period of time to fill the ladle with viscous tomato sauce. The resulting idle actions are known to be challenging for behavior cloning algorithms and therefore are often avoided or filtered out. Fine adjustments during pouring are necessary during sauce pouring to ensure coverage and to achieve the desired shape.

The demonstrated sauce-spreading strategy is inspired by the human chef technique, which requires both a long-horizon cyclic pattern to maximize coverage and short-horizon feedback for even distribution (since the tomato sauce used often drips out in lumps with unpredictable sizes). Periodic motions are known to be difficult to learn and therefore are often addressed by specialized action representations [@yang2022periodic]. Both tasks require the policy to self-terminate by lifting the ladle/spoon.

**Result Analysis.** Diffusion policy achieves close-to-human performance on both tasks, with coverage 0.74 vs 0.79 on pouring and 0.77 vs 0.79 on spreading. Diffusion policy reacted gracefully to external perturbations such as moving the pizza dough by hand during pouring and spreading. Results are best appreciated with videos in the supplemental material.

LSTM-GMM performs poorly on both sauce pouring and spreading tasks. It failed to lift the ladle after successfully scooping sauce in 15 out of 20 of the pouring trials. When the ladle was successfully lifted, the sauce was poured off-centered. LSTM-GMM failed to self-terminate in all trials. We suspect LSTM-GMM's hidden state failed to capture sufficiently long history to distinguish between the ladle dipping and the lifting phases of the task. For sauce spreading, LSTM-GMM always lifts the spoon right after the start, and failed to make contact with the sauce in all 20 experiments.

Realworld Bimanual Tasks {#sec:eval_bimanual}
========================

Beyond single arm setup, we further demonstrate Diffusion Policy on several challenging bimanual tasks. To enable bimanual tasks, the majority of effort was spent on extending our robot stack to support multi-arm teleopration and control. Diffusion Policy worked out of the box for these tasks without hyperparameter tuning.

Observation and Action Spaces
-----------------------------

The proprioceptive observation space is extended to include the poses of both end-effectors and the gripper widths of both grippers. We also extend the observation space to include the actual and desired values of these quantities. The image observation space is comprised of two scene cameras and two wrist cameras, one attached to each arm. The action space is extended to include the desired poses of both end-effectors and the desired gripper widths of both grippers.

Teleoperation
-------------

For these coordinated bimanual tasks, we found using 2 SpaceMouse simultaneously quite challenging for the demonstrator. Thus, we implemented two new teleoperation modes: using a Meta Quest Pro VR device with two hand controllers, or haptic-enabled control using 2 [ Haption Virtuose`\legalTM`{=latex} 6D HF TAO](https://www.haption.com/en/products-en/virtuose-6d-tao-en.html#fa-download-downloads) devices using bilateral position-position coupling as described succinctly in the haptics section of @siciliano2008springer. This coupling is performed between a Haption device and a Franka Panda arm. More details on the controllers themselves may be found in Sec. `\ref{sec:franka_setup}`{=latex}. The following provides more details on each task and policy performance.

Bimanual Egg Beater
-------------------

The bimanual egg beater task is illustrated and described in Fig. `\ref{fig:real_egg_beater}`{=latex}, using a [OXO`\legalTM `{=latex}Egg Beater](https://www.oxo.com/egg-beater.html) and a [Room Essentials`\legalTM `{=latex}plastic bowl](https://www.target.com/p/114oz-plastic-serving-bowl-jet-gray-room-essentials-8482/-/A-86701588). We chose this task to illustrate the importance of haptic feedback for teleoperating bimanual manipulation even for common daily life tasks such as coordinated tool use. Without haptic feedback, an expert was unable to successfully complete a single demonstration out of 10 trials. 5 failed due to robot pulling the crank handle off the egg beater; 3 failed due to robot losing grasp of the handle; 2 failed due to robot triggering torque limit. In contrast, the same operator could easily perform this task 10 out of 10 times with haptic feedback. Using haptic feedback made it possible for the demonstrations to be both quicker and higher quality than without feedback.

```{=latex}
\centering
```
![**Bimanual Egg Beater Manipulation.** `\label{fig:real_egg_beater}`{=latex} The robot needs to `\raisebox{-0.9pt}{1}`{=latex} push the bowl into position (only if too close to the left arm), `\raisebox{-0.9pt}{2}`{=latex} approach and pick up the egg beater with the right arm, `\raisebox{-0.9pt}{3}`{=latex} place the egg beater in the bowl, `\raisebox{-0.9pt}{4}`{=latex} approach and grasp the egg beater crank handle, and `\raisebox{-0.9pt}{5}`{=latex} turn the crank handle 3 or more times. ](figure/real_egg_beater_setup_compressed.png){width="\\linewidth"}

```{=latex}
\vspace{-4mm}
```
**Result Analysis.** Diffusion policy is able to complete this task with 55% success rate over 20 trials, trained using 210 demonstrations. The primary failure modes for these were out-of-domain initial positioning of the egg beater, or missing the egg beater crank handle or losing grasp of it. The initial and final states for all rollouts are visualized in `\ref{fig:egg_beater_ini}`{=latex} and `\ref{fig:egg_beater_last}`{=latex}.

Bimanual Mat Unrolling
----------------------

The mat unrolling task is shown and described in Fig. `\ref{fig:real_unroll_mat}`{=latex}, using a [XXL Dog Buddy`\legalTM `{=latex}Dog Mat](https://www.amazon.com/DogBuddy-Dog-Food-Mat-Waterproof/dp/B08GGDNB71). This task was teleoperated using the VR setup, as it did not require rich haptic feedback to perform the task. We taught this skill to be omnidextrous, meaning it can unroll either to the left or right depending on the initial condition.

```{=latex}
\centering
```
![**Bimanual Mat Unrolling.** `\label{fig:real_unroll_mat}`{=latex} The robot needs to `\raisebox{-0.9pt}{1}`{=latex} pick up one side of the mat (if needed), using the left or right arm, `\raisebox{-0.9pt}{2}`{=latex} lift and unroll the mat (if needed), `\raisebox{-0.9pt}{3}`{=latex} ensure that both sides of the mat are grasped, `\raisebox{-0.9pt}{4}`{=latex} lift the mat, `\raisebox{-0.9pt}{5}`{=latex} place the mat oriented with the table, mostly centered, and `\raisebox{-0.9pt}{6}`{=latex} release the mat. ](figure/real_unroll_mat_setup_compressed.png){width="\\linewidth"}

```{=latex}
\vspace{-4mm}
```
**Result Analysis.** Diffusion policy is able to complete this task with 75% success rate over 20 trials, trained using 162 demonstrations. The primary failure modes for these were missed grasps during initial grasp of the mat, where the policy struggled to correct itself and thus got stuck repeating the same behavior. The initial and final states for all rollouts are visualized in `\ref{fig:unroll_mat_ini}`{=latex} and `\ref{fig:unroll_mat_last}`{=latex}.

Bimanual Shirt Folding.
-----------------------

The shirt folding task is described and illustrated in Fig. `\ref{fig:real_fold_shirt}`{=latex}, using a short-sleeve T-shirt. This task was also teleoperated using the VR setup as it did not require rich feedback to perform the task. Due to the kinematic and workspace constraints, this task is notably longer and can take up to nine discrete steps. The last few steps require both grippers to come very close towards each other. Having our mid-level controller explicitly handling collision avoidance was especially important for both teleoperation and policy rollout.

```{=latex}
\centering
```
![**Bimanual Shirt Folding.** `\label{fig:real_fold_shirt}`{=latex} The robot needs to `\raisebox{-0.9pt}{1}`{=latex} approach and grasp the closest sleeve with both arms, `\raisebox{-0.9pt}{2}`{=latex} fold the sleeve and release, `\raisebox{-0.9pt}{3}`{=latex} drag the shirt closer (if needed), `\raisebox{-0.9pt}{4}`{=latex} approach and grasp the other sleeve with both arms, `\raisebox{-0.9pt}{5}`{=latex} fold the sleeve and release, `\raisebox{-0.9pt}{6}`{=latex} drag the shirt to a orientation for folding, `\raisebox{-0.9pt}{7}`{=latex} grasp and fold the shirt in half by its collar, `\raisebox{-0.9pt}{8}`{=latex} drag the shirt to the center, and `\raisebox{-0.9pt}{9}`{=latex} smooth out the shirt and move the arms away. ](figure/real_fold_shirt_setup.jpg){width="\\linewidth"}

```{=latex}
\vspace{-4mm}
```
**Result Analysis.** Diffusion policy is able to complete this task with 75% success rate over 20 trials, trained using 284 demonstrations. The primary failure modes for these were missed grasps for initial folding (the sleeves and the color), and the policy being unable to stop adjusting the shirt at the end. The initial and final states for all rollouts are visualized in `\ref{fig:fold_shirt_ini}`{=latex} and `\ref{fig:fold_shirt_last}`{=latex}.

Related Work
============

Creating capable robots without requiring explicit programming of behaviors is a longstanding challenge in the field [@atkeson1997robot; @argall2009survey; @ravichandar2020recent]. While conceptually simple, behavior cloning has shown surprising promise on an array of real-world robot tasks, including manipulation [@zhang2018deep; @florence2019self; @mandlekar2020learning; @mandlekar2020iris; @zeng2021transporter; @rahmatizadeh2018vision; @avigal2022speedfolding] and autonomous driving [@pomerleau1988alvinn; @bojarski2016end]. Current behavior cloning approaches can be categorized into two groups, depending on the policy's structure.

**Explicit Policy.** The simplest form of explicit policies maps from world state or observation directly to action [@pomerleau1988alvinn; @zhang2018deep; @florence2019self; @ross2011reduction; @toyer2020magical; @rahmatizadeh2018vision; @bojarski2016end]. They can be supervised with a direct regression loss and have efficient inference time with one forward pass. Unfortunately, this type of policy is not suitable for modeling multi-modal demonstrated behavior, and struggles with high-precision tasks [@ibc]. A popular approach to model multimodal action distributions while maintaining the simplicity of direction action mapping is convert the regression task into classification by discretizing the action space [@zeng2021transporter; @wu2020spatial; @avigal2022speedfolding]. However, the number of bins needed to approximate a continuous action space grows exponentially with increasing dimensionality. Another approach is to combine Categorical and Gaussian distributions to represent continuous multimodal distributions via the use of MDNs [@bishop1994mixture; @robomimic] or clustering with offset prediction [@bet; @sharma2018multiple]. Nevertheless, these models tend to be sensitive to hyperparameter tuning, exhibit mode collapse, and are still limited in their ability to express high-precision behavior [@ibc].

**Implicit Policy.** Implicit policies [@ibc; @jarrett2020strictly] define distributions over actions by using Energy-Based Models (EBMs) [@lecun06atutorial; @du2019implicit; @dai2019exponential; @grathwohl2020stein; @du2020improved]. In this setting, each action is assigned an energy value, with action prediction corresponding to the optimization problem of finding a minimal energy action. Since different actions may be assigned low energies, implicit policies naturally represent multi-modal distributions. However, existing implicit policies [@ibc] are unstable to train due to the necessity of drawing negative samples when computing the underlying Info-NCE loss.

**Diffusion Models.** Diffusion models are probabilistic generative models that iteratively refine randomly sampled noise into draws from an underlying distribution. They can also be conceptually understood as learning the gradient field of an implicit action score and then optimizing that gradient during inference. Diffusion models [@sohldickstein2015nonequilibrium; @ho2020denoising] have recently been applied to solve various different control tasks [@janner2022diffuser; @urain2022se; @ajay2022conditional].

In particular, @janner2022diffuser and @huang2023diffusion explore how diffusion models may be used in the context of planning and infer a trajectory of actions that may be executed in a given environment. In the context of Reinforcement Learning, @wang2022diffusion use diffusion model for policy representation and regularization with state-based observations. In contrast, in this work, we explore how diffusion models may instead be effectively applied in the context of behavioral cloning for effective visuomotor control policy. To construct effective visuomotor control policies, we propose to combine DDPM's ability to predict high-dimensional action squences with closed-loop control, as well as a new transformer architecture for action diffusion and a manner to integrate visual inputs into the action diffusion model.

@wang2023diffusion explore how diffusion models learned from expert demonstrations can be used to augment classical explicit polices without directly taking advantage of diffusion models as policy representation.

Concurrent to us, @pearce2023imitating, @reuss2023goal and @hansen2023idql has conducted a complimentary analysis of diffusion-based policies in simulated environments. While they focus more on effective sampling strategies, leveraging classifier-free guidance for goal-conditioning as well as applications in Reinforcement Learning, and we focus on effective action spaces, our empirical findings largely concur in the simulated regime. In addition, our extensive real-world experiments provide strong evidence for the importance of a receding-horizon prediction scheme, the careful choice between velocity and position control, and the necessity of optimization for real-time inference and other critical design decisions for a physical robot system.

Limitations and Future Work
===========================

Although we have demonstrated the effectiveness of diffusion policy in both simulation and real-world systems, there are limitations that future work can improve. First, our implementation inherits limitations from behavior cloning, such as suboptimal performance with inadequate demonstration data. Diffusion policy can be applied to other paradigms, such as reinforcement learning [@wang2023diffusion; @hansen2023idql], to take advantage of suboptimal and negative data. Second, diffusion policy has higher computational costs and inference latency compared to simpler methods like LSTM-GMM. Our action sequence prediction approach partially mitigates this issue, but may not suffice for tasks requiring high rate control. Future work can exploit the latest advancements in diffusion model acceleration methods to reduce the number of inference steps required, such as new noise schedules [@chen2023importance], inference solvers [@karras2022elucidating], and consistency models [@song2023consistency].

Conclusion
==========

In this work, we assess the feasibility of diffusion-based policies for robot behaviors. Through a comprehensive evaluation of 15 tasks in simulation and the real world, we demonstrate that diffusion-based visuomotor policies consistently and definitively outperform existing methods while also being stable and easy to train. Our results also highlight critical design factors, including receding-horizon action prediction, end-effector position control, and efficient visual conditioning, that is crucial for unlocking the full potential of diffusion-based policies. While many factors affect the ultimate quality of behavior-cloned policies --- including the quality and quantity of demonstrations, the physical capabilities of the robot, the policy architecture, and the pretraining regime used --- our experimental results strongly indicate that policy structure poses a significant performance bottleneck during behavior cloning. We hope that this work drives further exploration in the field into diffusion-based policies and highlights the importance of considering all aspects of the behavior cloning process beyond just the data used for policy training.

Acknowledgement
===============

We'd like to thank Naveen Kuppuswamy, Hongkai Dai, Aykut Önol, Terry Suh, Tao Pang, Huy Ha, Samir Gadre, Kevin Zakka and Brandon Amos for their thoughtful discussions. We thank Jarod Wilson for 3D printing support and Huy Ha for photography and lighting advice. We thank Xiang Li for discovering the bug in our evaluation code on GitHub.

```{=latex}
\begin{funding}
This work was supported by the Toyota Research Institute, NSF CMMI-2037101 and NSF IIS-2132519. We would like to thank Google for the UR5 robot hardware. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors.
\end{funding}
```
```{=latex}
\bibliographystyle{SageH}
```
```{=latex}
\appendix
```
Diffusion Policy Implementation Details
=======================================

Normalization
-------------

Properly normalizing action data is critical to achieve best performance for Diffusion Policy. Scaling the min and max of each action dimension independently to $[-1,1]$ works well for most tasks. Since DDPMs clip prediction to $[-1,1]$ at each iteration to ensure stability, the common zero-mean unit-variance normalization will cause some region of the action space to be inaccessible. When the data variance is small (e.g., near constant value), shift the data to zero-mean without scaling to prevent numerical issues. We leave action dimensions corresponding to rotation representations (e.g. Quaternion) unchanged.

Rotation Representation
-----------------------

For all environments with velocity control action space, we followed the standard practice [@robomimic] to use 3D axis-angle representation for the rotation component of action. Since velocity action commands are usually close to 0, the singularity and discontinuity of the axis-angle representation don't usually cause problems. We used the 6D rotation representation proposed in @zhou2019continuity for all environments (real-world and simulation) with positional control action space.

Image Augmentation
------------------

Following @robomimic, we employed random crop augmentation during training. The crop size for each task is indicated in Tab. `\ref{tab:hparam_cnn}`{=latex}. During inference, we take a static center crop with the same size.

Hyperparameters
---------------

Hyerparameters used for Diffusion Policy on both simulation and realworld benchmarks are shown in Tab. `\ref{tab:hparam_cnn}`{=latex} and Tab. `\ref{tab:hparam_transformer}`{=latex}. Since the Block Push task uses a Markovian scripted oracle policy to generate demonstration data, we found its optimal hyper parameter for observation and action horizon to be very different from other tasks with human teleop demostrations.

We found that the optimal hyperparameters for CNN-based Diffusion Policy are consistent across tasks. In contrast, transformer-based Diffusion Policy's optimal attention dropout rate and weight decay varies greatly across different tasks. During tuning, we found increasing the number of parameters in CNN-based Diffusion Policy always improves performance, therefore the optimal model size is limited by the available compute and memory capacity. On the other hand, increasing model size for transformer-based Diffusion Policy (in particular number of layers) hurts performance sometimes. For CNN-based Diffusion Policy, We found using FiLM conditioning to pass-in observations is better than impainting on all tasks except Push-T. Performance reported for DiffusionPolicy-C on Push-T in Tab. `\ref{tab:table_low_dim}`{=latex} used impaiting instead of FiLM.

On simulation benchmarks, we used the iDDPM algorithm [@nichol2021improved] with the same 100 denoising diffusion iterations for both training and inference. We used DDIM [@song2021ddim] on realworld benchmarks to reduce the inference denoising iterations to 16 therefore reducing inference latency.

We used batch size of 256 for all state-based experiments and 64 for all image-based experiments. For learning-rate scheduling, we used cosine schedule with linear warmup. CNN-based Diffusion Policy is warmed up for 500 steps while Transformer-based Diffusion Policy is warmed up for 1000 steps.

```{=latex}
\begin{table*}
\centering
\setlength\tabcolsep{3 pt}
\small
\begin{tabular}{l|llllllllllll}
\toprule
           % & Ctrl & To & Ta & Tp & ImgRes    & CropRes   & \#D-params & \#V-params & Lr   & WDecay & D-Iters Train & D-Iters Eval \\
\textbf{H-Param} & \textbf{Ctrl} & \textbf{To} & \textbf{Ta} & \textbf{Tp} & \textbf{ImgRes} & \textbf{CropRes} & \textbf{\#D-Params} & \textbf{\#V-Params} & \textbf{Lr} & \textbf{WDecay} & \textbf{D-Iters Train} & \textbf{D-Iters Eval} \\

\midrule
Lift       & Pos  & 2  & 8  & 16 & 2x84x84   & 2x76x76   & 256      & 22       & 1e-4 & 1e-6   & 100           & 100          \\
Can        & Pos  & 2  & 8  & 16 & 2x84x84   & 2x76x76   & 256      & 22       & 1e-4 & 1e-6   & 100           & 100          \\
Square     & Pos  & 2  & 8  & 16 & 2x84x84   & 2x76x76   & 256      & 22       & 1e-4 & 1e-6   & 100           & 100          \\
Transport  & Pos  & 2  & 8  & 16 & 4x84x85   & 4x76x76   & 264      & 45       & 1e-4 & 1e-6   & 100           & 100          \\
ToolHang   & Pos  & 2  & 8  & 16 & 2x240x240 & 2x216x216 & 256      & 22       & 1e-4 & 1e-6   & 100           & 100          \\
Push-T     & Pos  & 2  & 8  & 16 & 1x96x96   & 1x84x84   & 256      & 22       & 1e-4 & 1e-6   & 100           & 100          \\
Block Push & Pos  & 3  & 1  & 12 & N/A       & N/A       & 256      & 0        & 1e-4 & 1e-6   & 100           & 100          \\
Kitchen    & Pos  & 2  & 8  & 16 & N/A       & N/A       & 256      & 0        & 1e-4 & 1e-6   & 100           & 100          \\
\midrule
Real Push-T     & Pos  & 2  & 6  & 16 & 2x320x240 & 2x288x216 & 67       & 22       & 1e-4 & 1e-6   & 100           & 16           \\
Real Pour       & Pos  & 2  & 8  & 16 & 2x320x240 & 2x288x216 & 67       & 22       & 1e-4 & 1e-6   & 100           & 16           \\
Real Spread     & Pos  & 2  & 8  & 16 & 2x320x240 & 2x288x216 & 67       & 22       & 1e-4 & 1e-6   & 100           & 16           \\
Real Mug Flip   & Pos  & 2  & 8  & 16 & 2x320x240 & 2x288x216 & 67       & 22       & 1e-4 & 1e-6   & 100           & 16           \\
\bottomrule
\end{tabular}
\caption{
\textbf{Hyperparameters for CNN-based Diffusion Policy}
\label{tab:hparam_cnn}
Ctrl: position or velocity control 
To: observation horizon 
Ta: action horizon 
Tp: action prediction horizon 
ImgRes: environment observation resolution (Camera views x W x H) 
CropRes: random crop resolution 
\#D-Params: diffusion network number of parameters in millions 
\#V-Params: vision encoder number of parameters in millions 
Lr: learining rate 
WDecay: weight decay
D-Iters Train: number of training diffusion iterations
D-Iters Eval: number of inference diffusion iterations (enabled by DDIM \cite{song2021ddim})
}

\vspace{4mm}
\centering
\setlength\tabcolsep{2.1 pt}
\begin{tabular}{l|lllllllllllll}
\toprule
           % & Ctrl & To & Ta & Tp & \#D-params & \#V-params & \#Layers & Emb Dim & Attn Dropout & Lr   & WDecay & D-Iters Train & D-Iters Eval \\
\textbf{H-Param} & \textbf{Ctrl} & \textbf{To} & \textbf{Ta} & \textbf{Tp} & \textbf{\#D-params} & \textbf{\#V-params} & \textbf{\#Layers} & \textbf{Emb Dim} & \textbf{Attn Drp} & \textbf{Lr} & \textbf{WDecay} & \textbf{D-Iters Train} & \textbf{D-Iters Eval} \\

\midrule
Lift       & Pos  & 2  & 8  & 10 & 9        & 22       & 8      & 256     & 0.3          & 1e-4 & 1e-3   & 100           & 100          \\
Can        & Pos  & 2  & 8  & 10 & 9        & 22       & 8      & 256     & 0.3          & 1e-4 & 1e-3   & 100           & 100          \\
Square     & Pos  & 2  & 8  & 10 & 9        & 22       & 8      & 256     & 0.3          & 1e-4 & 1e-3   & 100           & 100          \\
Transport  & Pos  & 2  & 8  & 10 & 9        & 45       & 8      & 256     & 0.3          & 1e-4 & 1e-3   & 100           & 100          \\
ToolHang   & Pos  & 2  & 8  & 10 & 9        & 22       & 8      & 256     & 0.3          & 1e-4 & 1e-3   & 100           & 100          \\
Push-T     & Pos  & 2  & 8  & 16 & 9        & 22       & 8      & 256     & 0.01         & 1e-4 & 1e-1   & 100           & 100          \\
Block Push & Vel  & 3  & 1  & 5  & 9        & 0        & 8      & 256     & 0.3          & 1e-4 & 1e-3   & 100           & 100          \\
Kitchen    & Pos  & 4  & 8  & 16 & 80       & 0        & 8      & 768     & 0.1          & 1e-4 & 1e-3   & 100           & 100          \\
\midrule
Real Push-T     & Pos  & 2  & 6  & 16 & 80      & 22       & 8      & 768     & 0.3          & 1e-4 & 1e-3   & 100           & 16           \\
\bottomrule
\end{tabular}
\caption{
\textbf{Hyperparameters for Transformer-based Diffusion Policy}
\label{tab:hparam_transformer}
Ctrl: position or velocity control 
To: observation horizon 
Ta: action horizon 
Tp: action prediction horizon 
\#D-Params: diffusion network number of parameters in millions 
\#V-Params: vision encoder number of parameters in millions 
Emb Dim: transformer token embedding dimension
Attn Drp: transformer attention dropout probability
Lr: learining rate 
WDecay: weight decay (for transformer only)
D-Iters Train: number of training diffusion iterations
D-Iters Eval: number of inference diffusion iterations (enabled by DDIM \cite{song2021ddim})
}
\vspace{-5mm}
\end{table*}
```
```{=latex}
\centering
```
![ **Observation Horizon Ablation Study.** `\label{fig:obs_horizon_ablation}`{=latex} State-based Diffusion Policy is not sensitive to observation horizon. Vision-based Diffusion Policy prefers low but $>1$ observation horizon, with $2$ being a good compromise for most tasks. ](figure/obs_horizon_figure.png){width="\\linewidth"}

```{=latex}
\vspace{-2mm}
```
```{=latex}
\vspace{-3mm}
```
```{=latex}
\centering
```
![ **Data Efficiency Ablation Study.** `\label{fig:data_efficiency}`{=latex} Diffusion Policy outperforms LSTM-GMM [@robomimic] at every training dataset size. ](figure/sample_efficiency_figure.png){width="\\linewidth"}

```{=latex}
\vspace{-2mm}
```
```{=latex}
\vspace{-5mm}
```
Data Efficiency
---------------

We found Diffusion Policy to outperform LSTM-GMM [@robomimic] at every training dataset size, as shown in Fig. `\ref{fig:data_efficiency}`{=latex}.

Additional Ablation Results
===========================

Observation Horizon
-------------------

We found state-based Diffusion Policy to be insensitive to observation horizon, as shown in Fig. `\ref{fig:obs_horizon_ablation}`{=latex}. However, vision-based Diffusion Policy, in particular the variant with CNN backbone, see performance decrease with increasing observation horizon. In practice, we found an observation horizon of 2 is good for most of the tasks for both state and image observations.

Performance Improvement Calculation
-----------------------------------

For each task $i$ (column) reported in Tab. `\ref{tab:table_low_dim}`{=latex}, Tab. `\ref{tab:table_image}`{=latex} and Tab. `\ref{tab:multi_stage}`{=latex} (mh results ignored), we find the maximum performance for baseline methods $max\_baseline_i$ and the maximum performance for Diffusion Policy variant (CNN vs Transformer) $max\_ours_i$. For each task, the performance improvement is calculated as $improvement_i = \frac{max\_ours_i-max\_baseline_i}{max\_baseline_i}$ (positive for all tasks). Finally, the average improvement is calculated as $avg\_improvement=\frac{1}{N}\sum^i_N improvement_i=0.46858 \approx 46.9\%$.

Realworld Task Details
======================

Push-T
------

### Demonstrations

136 demonstrations are collected and used for training. The initial condition is varied by randomly pushing or tossing the T block onto the table. Prior to this data collection session, the operator has performed this task for many hours and should be considered proficient at this task.

### Evaluation

We used a fixed training time of 12 hours for each method, and selected the last checkpoint for each, with the exception of IBC, where the checkpoint with minimum training set action prediction MSE error due to IBC's training stability issue. The difficulty of training and checkpoint selection for IBC is demonstrated in main text Fig. 7. Each method is evaluated for 20 episodes, all starting from the same set of initial conditions. To ensure the consistency of initial conditions, we carefully adjusted the pose of the T block and the robot according to overlayed images from the top-down camera. Each evaluation episode is terminated by either keeping the end-effector within the end-zone for more than 0.5 second, or by reaching the 60 sec time limit. The IoU metric is directly computed in the top-down camera pixel space.

Sauce Pouring and Spreading
---------------------------

### Demonstrations

50 demonstrations are collected, and 90% are used for training for each task. For pouring, initial locations of the pizza dough and sauce bowl are varied. After each demonstration, sauce is poured back into the bowl, and the dough is wiped clean. For spreading, location of the pizza dough as well as the poured sauce shape are varied. For resetting, we manually gather sauce towards the center of the dough, and wipe the remaining dough clean. The rotational components for tele-op commands are discarded during spreading and sauce transferring to avoid accidentally scooping or spilling sauce.

### Evaluation

Both Diffusion Policy and LSTM-GMM are trained for 1000 epochs. The last checkpoint is used for evaluation.

Each method is evaluated from the same set of random initial conditions, where positions of the pizza dough and sauce bowl are varied. We use a similar protocol as in **Push-T** to set up initial conditions. We do not try to match initial shape of poured sauce for spreading. Instead, we make sure the amount of sauce is fixed during all experiments.

The evaluation episodes are terminated by moving the spoon upward (away form the dough) for 0.5 seconds, or when the operator deems the policy's behavior is unsafe.

The coverage metric is computed by first projecting the RGB image from both the left and right cameras onto the table space through homography, then computing the coverage in each projected image. The maximum coverage between the left and right cameras is reported.

Realworld Setup Details
=======================

### UR5 robot station {#sec:ur5_setup}

Experiments for the **Push-T** task are performed on the UR5 robot station.

The UR5 robot accepts end-effector space positional command at 125Hz, which is linearly interpolated from the 10Hz command from either human demonstration or the policy. The interpolation controller limits the end-effector velocity to be below 0.43 m/s and its position to be within the region 1cm above the table for safety reason. Position-controlled policies directly predicts the desired end-effector pose, while velocity-controlled policies predicts the difference the current positional setpoint and the previous setpoint.

The UR5 robot station has 5 realsense D415 depth camera recording 720p RGB videos at 30fps. Only 2 of the cameras are used for policy observation, which are down-sampled to 320x240 at 10fps.

During demonstration, the operator teleoperates the robot via a 3dconnexion SpaceMouse at 10Hz.

Franka Robot Station {#sec:franka_setup}
--------------------

Experiments for **Sauce Pouring and Spreading, Bimanual Egg Beater, Bimanual Mat Unrolling, and Bimanual Shirt Folding** tasks are performed on the Franka robot station.

For the non-haptic control, a custom mid-level controller is implemented to generate desired joint positions from desired end effector poses from the learned policies. At each time step, we solve a differential kinematics problem (formulated as a Quadratic Program) to compute the desired joint velocity to track the desired end effector velocity. The resulting joint velocity is Euler integrated into joint position, which is tracked by a joint-level controller on the robot. This formulation allows us to impose constraints such as collision avoidance for the two arms and the table, safety region for end effector and joint limits. It also enables regulating redundant DoF in the null space of the end effector commands. This mid-level controller is particularly valuable for safeguarding the learned policy during hardware deployment.

For haptic teleoperation control, another custom mid-level controller is implemented, but formulated as a pure torque-controller. The controller is formulated using Operational Space Control @khatib1987osc as a Quadratic Program operating at 200 Hz, where position, velocity, and torque limits are added as constraints, and the primary spatial objective and secondary null-space posture objectives are posed as costs. This, coupled with a good model of the Franka Panda arm, including reflected rotor inertias, allows us to perform good tracking with pure spatial feedback, and even better tracking with feedforward spatial acceleration. Collision avoidance has not yet been enabled for this control mode.

Note that for inference, we use the non-haptic control. Future work intends to simplify this control strategy and only use a single controller for our given objectives.

The operator uses a SpaceMouse or VR controller input device(s) to control the robot's end effector(s), and the grippers are controlled by a trigger button on the respective device. Tele-op and learned policies run at 10Hz, and the mid-level controller runs around 1kHz. Desired end effector pose commands are interpolated by the mid-level controller. This station has 2 realsense D415 RGBD camera streaming VGA RGB images at 30fps, which are downsampled to 320x240 at 10fps as input to the learned policies.

Initial and Final States of Bimanual Tasks {#sec:bimanual_ini_fial}
------------------------------------------

The following figures show the initial and final state of four bimanual tasks. Green and red boxes indicate successful and failed rollouts respectively. Since mat and shirt are very flat objects, we used a homographic projection to better visualize the initial and final states.

```{=latex}
\centering
```
![Final states for Mat Unrolling](figure/ijrr24_unroll_mat_ini.jpg){#fig:unroll_mat_last width="\\linewidth"}

```{=latex}
\centering
```
![Final states for Mat Unrolling](figure/ijrr24_unroll_mat_last.jpg){#fig:unroll_mat_last width="\\linewidth"}

```{=latex}
\centering
```
![Final states for Egg Beater](figure/ijrr24_egg_beater_ini.jpg){#fig:egg_beater_last width="\\linewidth"}

```{=latex}
\centering
```
![Final states for Egg Beater](figure/ijrr24_egg_beater_last.jpg){#fig:egg_beater_last width="\\linewidth"}

```{=latex}
\centering
```
![Final states for Shirt Folding](figure/ijrr24_fold_shirt_ini.jpg){#fig:fold_shirt_last width="\\linewidth"}

```{=latex}
\centering
```
![Final states for Shirt Folding](figure/ijrr24_fold_shirt_last.jpg){#fig:fold_shirt_last width="\\linewidth"}

[^1]: Due to a bug in our evaluation code, only 22 environment initializations are used for robomimic tasks. This does not change our conclusion since all baseline methods are evaluated in the same way.