Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation

Source

Core Claim

Open Bandit Dataset provides logged bandit feedback from ZOZOTOWN with actions, rewards, and propensities for off-policy evaluation.

Action-Time-Series Notes

  • It has explicit actions and propensities, but its temporal dynamics are weaker than full trajectory datasets.
  • It is best viewed as contextual action-response data rather than a rich world-model dataset.
  • It is useful for testing causal/off-policy pieces of an action-conditioned modeling stack.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Causal structure, counterfactuals, and controlpartially closesThe readable raw abstract describes large-scale bandit feedback data for evaluating off-policy estimators and bandit algorithms.The converted raw Markdown is not fully expanded, and the benchmark is one-step recommendation rather than state rollout.
Benchmark levelwarningLogged rewards and known policies test action selection and off-policy evaluation.Needs temporal state transitions, richer context history, and next-state targets.