Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation

Source

Open Bandit Dataset provides logged bandit feedback from ZOZOTOWN with actions, rewards, and propensities for off-policy evaluation.

It has explicit actions and propensities, but its temporal dynamics are weaker than full trajectory datasets.
It is best viewed as contextual action-response data rather than a rich world-model dataset.
It is useful for testing causal/off-policy pieces of an action-conditioned modeling stack.

Agenda slot	Verdict	Evidence	Missing pieces
Causal structure, counterfactuals, and control	partially closes	The readable raw abstract describes large-scale bandit feedback data for evaluating off-policy estimators and bandit algorithms.	The converted raw Markdown is not fully expanded, and the benchmark is one-step recommendation rather than state rollout.
Benchmark level	warning	Logged rewards and known policies test action selection and off-policy evaluation.	Needs temporal state transitions, richer context history, and next-state targets.