A Contextual-Bandit Approach to Personalized News Article Recommendation

Source

The Yahoo! Front Page news recommendation line uses randomized logged traffic to evaluate contextual bandit policies over article actions.

The sequence is a temporal log of recommendation decisions, contexts, actions, and click rewards.
It is valuable for action-response modeling but often lacks a rich next-state observation.
It belongs in a weak-time-series / bandit category for world-model comparison.

Agenda slot	Verdict	Evidence	Missing pieces
Causal structure, counterfactuals, and control	partially closes	The paper formalizes sequential contextual-bandit actions, clicked rewards, and unbiased offline policy evaluation from randomized traffic.	Feedback is one-step reward only; unchosen-arm outcomes and next-state trajectories are absent.
Dynamic compute allocation	adjacent	LinUCB updates fixed-size matrices incrementally and is designed for fast online personalization under large traffic.	This is algorithmic efficiency, not adaptive neural inference compute.
Benchmark level	warning	The 36M-event Yahoo random bucket is strong for off-policy bandit evaluation.	It does not test latent-state maintenance, multivariate dynamics, or action-conditioned rollouts.