DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Source

Core Claim

DROID provides a large in-the-wild robot manipulation dataset spanning many scenes, tasks, and buildings, with synchronized visual observations and language annotations for policy learning.

Sensor-Time-Series Notes

  • The dataset is embodied trajectory data: each episode is an ordered sequence rather than an independent image or static table row.
  • Each episode includes synchronized RGB camera streams, camera calibration, depth information, and natural-language instructions.
  • DROID is useful for studying how generalist policies adapt to new observation streams, scene distributions, and task language.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Context interfaceadjacentThe readable raw Markdown describes 76K robot trajectories, 350 hours, synchronized camera streams, depth, calibration, and natural-language instructions.The local raw Markdown is an incomplete LaTeX conversion and does not expose a general time-series context schema.
Data diversity and long tailadjacentDROID spans many scenes, tasks, buildings, and months, making it useful for testing distribution breadth in embodied trajectories.Demonstration data alone does not provide counterfactual outcomes for untried actions.
Causal and control modelinginsufficient evidenceThe source is a trajectory dataset for policy learning, not a benchmark of candidate actions and future outcomes.Needs explicit action-conditioned rollout targets and intervention evaluation.

Open Questions

  • How much policy transfer comes from broader scene coverage versus better temporal coverage of manipulation trajectories?
  • Which parts of DROID should be modeled as observation history, static context, action history, or exogenous variation?