RL Unplugged: A Suite of Benchmarks for Offline Reinforcement Learning
Source
- Raw Markdown: paper_rl-unplugged-2020.md
- PDF: paper_rl-unplugged-2020.pdf
Core Claim
RL Unplugged collects logged RL transitions from several domains, including Atari, DeepMind Control, and DeepMind Lab, for offline RL evaluation.
Action-Time-Series Notes
- The time-series unit is a stream of replayed transitions with observations, actions, rewards, and discounts.
- Action semantics vary by domain, from Atari discrete controls to continuous control actions.
- It is valuable for world models because the data is already formatted as action-conditioned dynamics.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Causal structure, counterfactuals, and control | partially closes | Provides logged transition tuples with observations, actions, rewards, next observations, and sequence data for offline RL. | Simulated/game domains do not match operational time-series interventions or digital-system remediation. |
| Benchmarks: what level of modeling is tested? | partially closes | Separates online and offline policy selection and includes partial observability, action-space diversity, stochasticity, and nonstationarity. | Benchmark scores control algorithms, not general TSFM state, forecast, generation, and explanation surfaces. |
| Streaming state, long context, and constant updates | adjacent | Sequence data include future states, actions, and rewards for recurrent models needing memory. | Does not test always-on serving or retained latent-state refresh. |