Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Source

Core Claim

Diffusion Policy models a distribution over future action trajectories by denoising action chunks conditioned on recent observations. It is one of the clearest sources for treating robot motor control as conditional generation over continuous control-input trajectories.

Method Notes

  • The model samples noisy future actions and iteratively denoises them into an executable action chunk.
  • Inference is used in a receding-horizon loop: generate a chunk, execute part of it, observe again, then regenerate.
  • The source compares CNN-conditioned and Transformer-based diffusion policies; diffusion is the action distribution model, while attention is one possible denoising-network architecture.

Evidence And Limitations

The paper reports consistent improvement over behavior-cloning baselines across simulation and real-world tasks, including robustness to some visual and physical perturbations. It also notes the central tradeoff: denoising improves multimodal continuous action modeling but raises inference latency relative to one-pass regression policies.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Multi-modal future distributionspartially closesDenoises future action chunks and explicitly targets multi-modal continuous action distributions.The modeled distribution is over actions, not future observations or latent system states.
Control and counterfactualspartially closesRuns in a receding-horizon closed loop: observe, generate an action sequence, execute part of it, then replan.It is imitation policy learning, not a learned world model that compares candidate intervention consequences.
Dynamic compute allocationwarningIterative denoising supports expressive action generation but adds latency relative to one-pass policies.Needs acceleration or hybrid heads for high-rate control loops and digital operational systems.

Open Questions

  • Which latency-reduction methods preserve closed-loop robustness for high-rate contact tasks?
  • Should time-series foundation models borrow diffusion over future observation blocks, future control chunks, or both?