DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Source
- Raw Markdown: paper_droid-2024.md
- PDF: paper_droid-2024.pdf
- Preprint: arXiv 2403.12945
- Project page: droid-dataset.github.io
Core Claim
DROID provides a large in-the-wild robot manipulation dataset spanning many scenes, tasks, and buildings, with synchronized visual observations and language annotations for policy learning.
Sensor-Time-Series Notes
- The dataset is embodied trajectory data: each episode is an ordered sequence rather than an independent image or static table row.
- Each episode includes synchronized RGB camera streams, camera calibration, depth information, and natural-language instructions.
- DROID is useful for studying how generalist policies adapt to new observation streams, scene distributions, and task language.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Context interface | adjacent | The readable raw Markdown describes 76K robot trajectories, 350 hours, synchronized camera streams, depth, calibration, and natural-language instructions. | The local raw Markdown is an incomplete LaTeX conversion and does not expose a general time-series context schema. |
| Data diversity and long tail | adjacent | DROID spans many scenes, tasks, buildings, and months, making it useful for testing distribution breadth in embodied trajectories. | Demonstration data alone does not provide counterfactual outcomes for untried actions. |
| Causal and control modeling | insufficient evidence | The source is a trajectory dataset for policy learning, not a benchmark of candidate actions and future outcomes. | Needs explicit action-conditioned rollout targets and intervention evaluation. |
Links Into The Wiki
- Robotics Time-Series Modeling
- Action-Conditioned Time-Series Datasets
- Foundation Time-Series Model Research Agenda
- World Models
Open Questions
- How much policy transfer comes from broader scene coverage versus better temporal coverage of manipulation trajectories?
- Which parts of DROID should be modeled as observation history, static context, action history, or exogenous variation?