NeoRL-2

Source

Core Claim

NeoRL-2 is a near-real-world offline reinforcement-learning benchmark with explicit transition tuples and evaluation simulators. It targets practical offline RL difficulties that are underrepresented in simpler benchmarks: delays, exogenous factors, safety constraints, rule-based behavior policies, conservative data, and limited data.

Dataset Notes

  • The paper and GitHub README describe seven tasks: Pipeline, Simglucose, RocketRecovery, RandomFrictionHopper, DMSD, Fusion, and SafetyHalfCheetah.
  • The GitHub interface returns obs, next_obs, action, reward, done, and index.
  • The Hugging Face parquet artifact uses observations, actions, rewards, next_observations, and terminals.
  • Datasets are generated by online RL algorithms or PID policies, then suboptimal policies with returns from 50% to 80% of expert return are selected.
  • Hugging Face reports 980848 rows and about 130 MB total file size.

Task Shapes

TaskObservation shapeAction shapeDone flagMax timesteps
Pipeline521false1000
Simglucose311true480
RocketRecovery72true500
RandomFrictionHopper133true1000
DMSD62false100
Fusion156false100
SafetyHalfCheetah186false1000

Action-Time-Series Notes

NeoRL-2 is a clean action-conditioned trajectory source:

observation_t + action_t -> reward_t + observation_{t+1} + terminal_t

This makes it better aligned with action-conditioned world-model training than logged decision datasets that only expose one-step outcomes. The harder part is that the benchmark intentionally includes delayed effects, external factors, conservative behavior policies, and safety constraints.

Gotchas

  • The paper is an arXiv preprint from 2025; use it as a current benchmark artifact, not as peer-reviewed settled evidence.
  • The tasks are simulated to reflect practical issues; they are not direct real-world business data.
  • The paper reports that current baselines often fail to significantly improve over the behavior policy, and no reported baseline reaches the paper’s solved threshold.
  • Hugging Face config metadata lists Salespromotion and Simglucose-high in addition to the seven paper/GitHub tasks.
  • GitHub says datasets are CC BY 4.0 and code is Apache 2.0, while Hugging Face frontmatter marks the dataset repo as apache-2.0.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Causal structure, counterfactuals, and controlpartially closesProvides explicit observation_t, action_t, reward, observation_{t+1}, and terminal signals across seven non-vision control tasks.Simulators are benchmark approximations; no direct real-world deployment data.
Benchmarks: what level of modeling is tested?partially closesStresses delayed effects, external factors, safety constraints, rule-based behavior policies, conservative data, and limited data.Needs TSFM-native model comparisons and benchmark protocols beyond offline RL baselines.
Time representation and irregular event streamsadjacentPipeline and Simglucose explicitly test delay; trajectories have finite horizons and termination signals.Mostly fixed simulator step interfaces rather than irregular event streams.
Context interface: channel context and general contextadjacentTask identity and environment-specific properties define different control domains and constraints.No unified typed context schema for cross-domain transfer.