NeoRL-2
Source
- Dataset metadata snapshot: neorl2-2025
- Metadata JSON: metadata.json
- Official GitHub: https://github.com/polixir/NeoRL2
- Official Hugging Face dataset: https://huggingface.co/datasets/polixirai/NeoRL2
- arXiv preprint: https://arxiv.org/abs/2503.19267
Core Claim
NeoRL-2 is a near-real-world offline reinforcement-learning benchmark with explicit transition tuples and evaluation simulators. It targets practical offline RL difficulties that are underrepresented in simpler benchmarks: delays, exogenous factors, safety constraints, rule-based behavior policies, conservative data, and limited data.
Dataset Notes
- The paper and GitHub README describe seven tasks: Pipeline, Simglucose, RocketRecovery, RandomFrictionHopper, DMSD, Fusion, and SafetyHalfCheetah.
- The GitHub interface returns
obs,next_obs,action,reward,done, andindex. - The Hugging Face parquet artifact uses
observations,actions,rewards,next_observations, andterminals. - Datasets are generated by online RL algorithms or PID policies, then suboptimal policies with returns from 50% to 80% of expert return are selected.
- Hugging Face reports 980848 rows and about 130 MB total file size.
Task Shapes
| Task | Observation shape | Action shape | Done flag | Max timesteps |
|---|---|---|---|---|
| Pipeline | 52 | 1 | false | 1000 |
| Simglucose | 31 | 1 | true | 480 |
| RocketRecovery | 7 | 2 | true | 500 |
| RandomFrictionHopper | 13 | 3 | true | 1000 |
| DMSD | 6 | 2 | false | 100 |
| Fusion | 15 | 6 | false | 100 |
| SafetyHalfCheetah | 18 | 6 | false | 1000 |
Action-Time-Series Notes
NeoRL-2 is a clean action-conditioned trajectory source:
observation_t + action_t -> reward_t + observation_{t+1} + terminal_tThis makes it better aligned with action-conditioned world-model training than logged decision datasets that only expose one-step outcomes. The harder part is that the benchmark intentionally includes delayed effects, external factors, conservative behavior policies, and safety constraints.
Gotchas
- The paper is an arXiv preprint from 2025; use it as a current benchmark artifact, not as peer-reviewed settled evidence.
- The tasks are simulated to reflect practical issues; they are not direct real-world business data.
- The paper reports that current baselines often fail to significantly improve over the behavior policy, and no reported baseline reaches the paper’s solved threshold.
- Hugging Face config metadata lists
SalespromotionandSimglucose-highin addition to the seven paper/GitHub tasks. - GitHub says datasets are CC BY 4.0 and code is Apache 2.0, while Hugging Face frontmatter marks the dataset repo as
apache-2.0.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Causal structure, counterfactuals, and control | partially closes | Provides explicit observation_t, action_t, reward, observation_{t+1}, and terminal signals across seven non-vision control tasks. | Simulators are benchmark approximations; no direct real-world deployment data. |
| Benchmarks: what level of modeling is tested? | partially closes | Stresses delayed effects, external factors, safety constraints, rule-based behavior policies, conservative data, and limited data. | Needs TSFM-native model comparisons and benchmark protocols beyond offline RL baselines. |
| Time representation and irregular event streams | adjacent | Pipeline and Simglucose explicitly test delay; trajectories have finite horizons and termination signals. | Mostly fixed simulator step interfaces rather than irregular event streams. |
| Context interface: channel context and general context | adjacent | Task identity and environment-specific properties define different control domains and constraints. | No unified typed context schema for cross-domain transfer. |