Action-Conditioned Time-Series Datasets
Scope
Terminology on this page follows Terminology.
This page compares non-vision-heavy datasets that can support world models with actions or interventions. Here, “time series” is broad: it includes regular sensor streams, irregular medical/event logs, control trajectories, recommender decision logs, tutoring interaction sequences, graph telemetry, and any ordered sequence where a model can condition on an action or intervention at time t to predict later observations.
The strongest candidates expose a transition-like channel: observation_t, action_t, optional reward_t, and observation_{t+1}. Weaker candidates expose logged decisions or treatments but have thin next-state observations or strong observational confounding.
For datasets used to train policies in imagined rollouts, the s,a,r,s' contract should also record whether transition data and reward annotations come from different streams, with separate costs, noise, and bias risks. On Training in Imagination is the local source for this data-economics split.
Latent Action Models are complementary to this dataset view. They infer action-like codes from observation transitions when the action channel is missing, but they do not turn a dataset into a typed action-conditioned dataset until those codes are aligned with real actions, control inputs, interventions, and outcomes.
This page intentionally excludes vision-heavy trajectory datasets. Those datasets also contain time series, and they can be very important for action-conditioned world models, but they require image/video encoders and belong in a separate embodied/visual world-model comparison. Excluded examples include V-D4RL, MineRL, Atari DQN Replay, Open X-Embodiment, DROID, BridgeData V2, RoboNet, CALVIN, RoboTurk, and RoAM. For that embodied/visual branch, use Robotics Time-Series Modeling and World Model for Robot Learning Survey rather than expanding this non-vision dataset table.
Most remaining datasets are still not pure univariate time series. The Modalities Needed column lists the non-temporal modalities or structured data types that a training pipeline must understand in addition to temporal order.
Selection Tiers
- Tier 1: direct world-model datasets provide explicit sequential observations and actions and are immediately usable for action-conditioned dynamics learning.
- Tier 2: longitudinal intervention datasets provide real interventions/treatments over time but require careful causal handling because actions are often confounded by state.
- Tier 3: logged action-response datasets provide actions and rewards/outcomes, but temporal state dynamics are weaker than in trajectory datasets.
- Near-miss: passive time-series datasets are useful for passive world-model pretraining or forecasting, but do not expose controllable actions.
Offline RL And Numeric Control Trajectories
| Dataset | Time-Series Structure | Modalities Needed | Action Channel | World-Model Fit | Caveat |
|---|---|---|---|---|---|
| Minari D4RL | Episodic offline RL transitions across MuJoCo, AntMaze, Adroit, Kitchen, and related tasks | Numeric state vectors; rewards; terminals; task IDs for mixed datasets; sometimes goal/state annotations | Environment control action at each step | Tier 1; clean s,a,r,s' benchmark for latent/state dynamics | Some tasks are benchmark-specific, and the page excludes visual variants |
| RL Unplugged | Replayed transitions from multiple RL domains | Numeric states for control tasks; rewards/discounts; action labels; domain metadata | Discrete or continuous environment actions | Tier 1 for non-visual subsets; diverse offline RL source for action-conditioned dynamics | Some RL Unplugged domains are visual and SHOULD be filtered out for this non-vision page |
Healthcare And Physiology
| Dataset | Time-Series Structure | Modalities Needed | Action / Intervention Channel | World-Model Fit | Caveat |
|---|---|---|---|---|---|
| MIMIC-IV | Irregular hospital/ICU EHR time series | Numeric vitals/labs; categorical codes; medication/procedure tables; demographics; clinical notes if used | Medications, fluids, procedures, ventilation-related events, orders | Tier 2; strong for treatment-conditioned patient dynamics | Observational, confounded, credentialed access |
| eICU-CRD | Multi-center ICU longitudinal records | Numeric vitals/labs; categorical diagnoses/treatments; medication/infusion records; care-plan tables | Medications, infusion drugs, treatments, procedures | Tier 2; strong multi-hospital treatment-response source | Heterogeneous schema and confounding |
| HiRID | High-resolution ICU records | High-frequency numeric physiology; labs; medication/event tables; patient metadata | ICU treatments, medications, interventions, clinical events | Tier 2; good for high-frequency physiology dynamics | Access and preprocessing complexity |
| AmsterdamUMCdb | European ICU observation/event series | Numeric vitals/labs; medication/infusion tables; device/ventilation records; demographics | Medications, fluids, feeding, transfusions, procedures | Tier 2; strong ICU dynamics dataset | Observational and access-controlled |
| OhioT1DM | Continuous glucose and patient event streams | Continuous glucose monitor values; insulin logs; meal/carbohydrate records; exercise/sleep/stress event features | Insulin, meals/carbs, exercise, sleep, stress | Tier 1/2; small but clean physiology-control source | Small participant count and per-person variability |
| HeartSteps | Participant decision points and activity outcomes over weeks | Mobile-sensing/context features; step-count/activity outcomes; survey/context variables; intervention messages | Micro-randomized activity suggestions | Tier 2; cleaner causal interventions than routine care logs | Small behavioral domain |
Recommender, Bandit, And Marketing Logs
| Dataset | Time-Series Structure | Modalities Needed | Action / Intervention Channel | World-Model Fit | Caveat |
|---|---|---|---|---|---|
| KuaiRand | Sequential user-video interactions with random exposure | User IDs/features; item/video IDs and metadata; categorical feedback events; watch/click/like signals; timestamps | Video/item exposure and feedback signals | Tier 1/3; strong for user-response dynamics without requiring video pixels | State is user-behavioral, not physical-world dynamics |
| Open Bandit Dataset | Logged fashion recommendation decisions | User/context features; item/category IDs; logged propensities; clicks/conversions/rewards | Recommended item/action, reward, propensity | Tier 3; strong for off-policy action-response modeling | Thin next-state dynamics |
| Webscope R6 line | News recommendation decision logs | User/context features; article IDs/features; click rewards; randomized serving logs | Article action and click reward under randomized traffic | Tier 3; classic contextual bandit benchmark | Weak sequential state compared with world-model trajectories |
| Criteo Uplift | Marketing treatment records | User/ad context features; treatment flag; visit/conversion outcomes | Binary treatment/control with visit/conversion outcomes | Tier 3; useful for treatment-effect modeling | Mostly one-step, not rich temporal dynamics |
Education And Tutoring Logs
| Dataset | Time-Series Structure | Modalities Needed | Action / Intervention Channel | World-Model Fit | Caveat |
|---|---|---|---|---|---|
| EdNet | Large-scale student activity sequences | Student IDs; question/skill IDs; correctness; timestamps; lecture/purchase/platform event categories | Question solving, lecture consumption, purchases, platform events | Tier 2/3; useful for student-state dynamics | Actions mix student behavior and platform interventions |
| ASSISTments 2009-2010 | Student problem-solving sequences | Student/problem/skill IDs; correctness; hint counts; attempt metadata; timestamps | Attempts, hints, first-action type, problem assignments | Tier 2/3; useful for knowledge tracing and pedagogical dynamics | Action granularity varies by release |
| KDD Cup 2010 | Cognitive Tutor student-step logs | Student/problem/step/knowledge-component IDs; correctness; opportunity counts; hint/attempt features | Responses, opportunities, problem steps, hint/attempt-related fields | Tier 2/3; useful for educational sequence modeling | Not a clean controllable intervention benchmark |
| PSLC DataShop | Repository of many learning-science event logs | Dataset-specific student/tutor event tables; skill/problem IDs; correctness; hints; timestamps | Student actions, tutor responses, hints, instructional events | Tier 2/3; broad source for education action-time-series | Requires dataset-by-dataset curation |
Causal And Interventional Validation
| Dataset | Time-Series Structure | Modalities Needed | Action / Intervention Channel | World-Model Fit | Caveat |
|---|---|---|---|---|---|
| CausalWorld | Simulated robot manipulation episodes | Numeric simulator state; robot/object poses; task/intervention metadata; optional visual observations SHOULD be ignored for this page | Robot actions plus causal/environment interventions | Tier 1/validation; good for causal generalization under interventions | Benchmark/environment more than fixed real-world dataset |
| Causal Chambers | Real physical-system measurements and interventional data | Numeric sensor streams; actuator/control settings; known causal graphs; experiment metadata | Controlled interventions over physical variables | Tier 2/validation; useful for intervention fidelity tests | Not always a sequential control dataset in RL format |
Passive Time-Series Near-Miss
| Dataset | Time-Series Structure | Modalities Needed | Action Channel | World-Model Fit | Caveat |
|---|---|---|---|---|---|
| ChronoGraph | Graph-structured multivariate microservice telemetry over time | Graph topology; node metrics; edge metrics; incident/anomaly labels; service/dependency metadata | No explicit controllable action channel in the paper | Useful for passive graph/time-series world-model pretraining | Incident windows are labels/exogenous shocks, not operator interventions |
| TelecomTS | 5G observability KPI windows with anomaly labels, root-cause labels, natural-language descriptions, troubleshooting tickets, and Q&A | Numeric/categorical KPIs; telecom labels; text descriptions/tickets; Q&A | No operator-action channel; controlled jamming and synthetic anomaly injections are events/benchmark conditions | Useful for passive/multimodal observability pretraining and diagnosis evaluation | Lab/testbed data; synthetic anomalies and generated tickets need artifact checks |
Modality Takeaways
- Mostly numeric temporal control: D4RL, OhioT1DM, CausalWorld, Causal Chambers, and non-visual parts of RL Unplugged can be approached with multivariate time-series models.
- Irregular event/EHR data: MIMIC-IV, eICU-CRD, HiRID, and AmsterdamUMCdb require event-table modeling, coding systems, missingness handling, and often irregular-time encodings.
- Structured relational data: KuaiRand, Open Bandit Dataset, Yahoo! contextual bandit, and Criteo Uplift require user-item/action-response structure rather than image/video understanding.
- Education event logs: EdNet, ASSISTments, KDD Cup 2010, and PSLC DataShop require student/problem/skill identifiers, correctness, hints, and timestamps.
- Graph and telecom observability: ChronoGraph requires graph topology plus temporal node/edge metrics, while TelecomTS requires scale-preserving KPI streams plus operational text. Both remain passive unless action logs are joined.
- Action-free trajectories with hidden controls: Genie shows that a latent action model can recover action-like codes from image/video transitions, but those codes need alignment before they count as typed actions or control inputs.
Practical Recommendations
- For a first non-vision action-conditioned world-model baseline, start with D4RL, non-visual RL Unplugged tasks, or OhioT1DM, depending on whether the desired domain is control, benchmark RL, or physiology.
- For real treatment/intervention modeling, use MIMIC-IV, eICU-CRD, HiRID, AmsterdamUMCdb, OhioT1DM, and HeartSteps, but model confounding explicitly.
- For user-response and logged decision modeling, KuaiRand is the strongest sequential candidate; Open Bandit Dataset, Yahoo! contextual bandit, and Criteo Uplift are better treated as contextual action-response datasets.
- ChronoGraph and TelecomTS should stay in the passive/near-miss bucket unless external deployment, remediation, autoscaling, rollback, or operator-action logs are joined to them.
Relation To Foundation TSFM Agenda
This page supports the Foundation Time-Series Model Research Agenda at the dataset-interface layer: it identifies which corpora can test action-conditioned rollout and which are only passive pretraining or diagnosis sources.
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Control and counterfactuals | partially closes | Separates clean state-action trajectories, longitudinal interventions, logged action-response data, and passive near-misses. | Many real-world sources are confounded, one-step, or missing next-state dynamics. |
| Time representation and event streams | adjacent | Healthcare, education, recommender, and observability entries expose irregular events, treatments, hints, exposures, and telemetry. | Needs a shared event/action schema and benchmark protocol across domains. |
Observability and recommender logs are especially relevant to the digital-world robot north star because they show non-robotic systems with observations and possible intervention surfaces. Public datasets still rarely join telemetry with typed operator actions and outcomes.
Open Questions
- Which non-vision dataset family should anchor Alex’s first action-conditioned world-model experiment: clean RL transitions, irregular healthcare interventions, recommender logs, education logs, or graph telemetry?
- Should passive datasets like ChronoGraph be included in a separate pretraining pool for representation learning before action-conditioned finetuning?
- How should the wiki distinguish controllable actions from exogenous events, treatments, platform decisions, and observed human behavior?
- How should action-conditioned datasets record reward-source provenance, label cost, fidelity/noise assumptions, and bias audits alongside transition tuples?
- Which non-vision modality stack should be prioritized first: multivariate time series, irregular EHR event streams, recommender user-item events, education event logs, or graph-temporal observability data?