Time-Series Benchmark Hygiene

Summary

Time-series foundation model rankings are brittle unless task, protocol, context length, horizon, covariates, leakage controls, and adaptation mode match. Future source pages should link here instead of repeating the same benchmark caveats.

The latent-state wiki frame adds one more hygiene rule: average forecast error is not enough to establish system understanding. Benchmarks should say whether they test observation forecasting, representation quality, context use, rare-event sensitivity, state maintenance, or action-conditioned rollout.

Foundation Time-Series Model Research Agenda is the central rubric for this page: benchmark claims should identify which agenda slot they test and whether they close, partially close, or merely touch the path toward a foundation time-series model.

Required Separations

Zero-shot base-model results should be separated from few-shot adaptation, linear probing, full fine-tuning, and dataset-specific training.
Fine-tuned and ensemble entries should be separated from base released checkpoints. Toto 2.0, for example, reports base models, a fine-tuned 2.5B variant, and a family-and-friends ensemble.
Frozen feature extraction for classification should be separated from label-free zero-shot prediction. Many classification “zero-shot” results still train a Random Forest, SVM, logistic-regression head, or similar downstream classifier on target labels.
Label-scarce classification and SSL reports should separate supervised-only hypersweeps, validation-set size, model-selection budget, and unlabeled-objective gains. S4L is the historical vision analogue: before crediting auxiliary unlabeled-data objectives, compare against strong supervised-only baselines and validation-size sensitivity.
Single-model results should be separated from representation fusion, such as MantisV2 plus TiViT features.
Point forecasting, quantile forecasting, probabilistic forecasting, imputation, anomaly detection, classification, and reasoning benchmarks should not be ranked as if they measure one ability.
Observation forecasting, latent-state prediction, representation learning, and action-conditioned world modeling should not be collapsed into one claim.
Generalization reports should separate hypothesis-space capacity, selected-solution complexity, and observed held-out performance. Deep Learning is Not So Mysterious or Different is the upstream warning: fitting randomized labels or reaching zero training loss shows flexibility, not that the selected structured-data solution is complex or will generalize poorly. Time-series randomization controls should preserve or deliberately break autocorrelation, channel structure, event timing, and exogenous-variable alignment so the tested complexity measure has a clear meaning.
Latent-state accessibility diagnostics should separate coarse component or class presence from dense state variables. Aionoscope is the local case: categorical component AUROC can look strong while timing, phase, amplitude, frequency, or regime variables remain weakly recoverable under dense probes.
Fixed-recipe stress tests should not be read as fully tuned leaderboards. LeNEPA intentionally measures recipe reuse after changing pretraining signal family, not the best possible Diag-tuned JEPA recipe.
One-step prediction, decision-usable multi-step rollout, closed-loop transfer, and evidence-driven model revision should be separated. Agentic World Modeling names these as L1 Predictor, L2 Simulator, and L3 Evolver boundaries.
Time-series-agent reports should separate perception, reasoning, planning/action, memory/knowledge, temporal world modeling, and reliability layers. Awesome Agentic Time Series is useful as a taxonomy for this separation, but its listed papers still need primary-source checks before their scores or claims are compared.
Interactive agent benchmarks should separate query planning, evidence integration, hypothesis construction, query efficiency, and non-informative action rate. Agentic Automata Learning is the local symbolic benchmark warning for this split: final task success alone can hide whether an agent gathered insufficient evidence or failed to infer a stable latent model from sufficient evidence.
Learned-simulator scores should be separated from real-environment or live-stand transfer. World Models is the historical hygiene case: a controller can maximize reward inside an imperfect learned dynamics model while failing after transfer.
Action-conditioned world-model reports should separate prediction error, planning success, solver budget, per-step latency, distribution-shift regime, and factor-of-variation settings. stable-worldmodel is the current local case: its Push-T analyses show that prediction error can overlap heavily between successful and failed plans under distribution shift.
Test-time adaptation inside control loops should separate the frozen pretrained model, adapted parameters, update objective, buffer policy, number of gradient steps, reset or persistence policy, added latency, and whether the reported gain comes from prediction-error reduction or better action ranking. AdaJEPA is the local case for this split.
Power-grid control, surrogate, and security-analysis reports should separate challenge survival/cost score, single-agent versus multi-agent protocol, FACTS setpoint control versus topology reconfiguration, topology-only versus topology-plus-redispatch/storage action spaces, topology-action preprocessing, disconnected-line bus-assignment handling, topology reversion policy, candidate-action feasibility filters, stochastic seed policy, evaluation horizon or chronics length, contingency stochasticity, grid size, runtime-shield lookahead horizon, fallback action, action-space restriction, false-veto/false-accept rates, preventive N-1 robustness before a contingency, corrective behavior after a contingency, learned-screening latency, residual risk at fixed simulator budget, simulator-call count, wall-clock latency, simulator cost to reach a risk target, simulator access, load-flow solver emulation, DC-approximation baselines, topology-size generalization, physical-line assumptions, sparse/dense implementation speed, explicit-topology synthetic tests versus observational records with hidden or surrogate topology, simulator speedup versus deployment-grade accuracy, downstream policy/control utility, and online planning budget. Grid2Op is the local case: RL2Grid, MARL2Grid-TR, L2RPN challenge papers, soft-label action ranking, Gibbs-prior risk surrogates, runtime shields, LLM-guided replay-buffer shaping, and policy-distillation controllers are related but not interchangeable evidence axes.
Graph time-series and graph-control reports should separate modeling quality from sparse-kernel/backend quality. IO-Aware GNN Layers is the local case: direct GNN baselines should report whether they use fused graph attention, cached cuSPARSE, degree-aware reductions, graph reordering, or default DGL/PyG composition, plus peak memory and wall-clock latency.
Embodied world-model evaluations should separate open-loop action-conditioned prediction, closed-loop task utility or policy evaluation, and physical consistency, controllability, and executability diagnostics. World Model for Robot Learning Survey is the local robotics survey map for that split.
Training-in-imagination reports should separate dynamics-transition error, reward-model error, reward annotation cost, reward noise, and reward bias. On Training in Imagination is the current local source for why zero-mean reward noise and systematic reward bias have different policy-gradient consequences.
Upstream dynamic-compute results, such as EBT language/video/image scaling, should be separated from numeric time-series forecasting, generation, and control evidence unless the benchmark directly tests those interfaces.
Test-time-memory results should separate memory-only modules, full architectures, chunk size, update cost, baseline strength, and adaptation mode. Titans Revisited is the current local check for this failure mode.
Segment-level recurrent-memory results should separate synthetic recall/rewrite tasks, long-context QA, language-modeling transfer, segment size, sequential update cost, and direct numeric or action-conditioned evidence.
Sleep-time consolidation results should separate memory capacity, consolidation compute, wake-time prediction latency, training sequentiality, and whether the evidence is synthetic/language or numeric/action-conditioned.
Looped-efficient-attention reports should separate zero-shot quality, long-context retrieval, prefill throughput, decode throughput, KV-cache footprint, recurrent-state footprint, batch-size OOM frontier, and kernel availability. LT2 is the local case: linear/sparse looped mixers improve long-context language serving, but that is not direct evidence for numeric time-series state retention or action-conditioned rollout.
Sparse-attention reports should separate selector training, selected-token budget, block size, retrieval/selection recall, dense-baseline quality, prefill throughput, decode throughput, hardware generation, public kernel availability, and full serving-stack behavior. MiniMax Sparse Attention is the local case: vendor-reported 1M-context speedups and benchmark parity are important, but they should not be treated as numeric time-series state-retention or action-conditioned rollout evidence.
Architecture papers with follow-up benchmark narratives should separate paper results, official but unreleased claims, and public-code reproducibility. Dragon Hatchling’s Sudoku result is official Pathway narrative, not currently open-reproduced from the public BDH repository.
Recursive puzzle papers should separate controlled recursive-model baselines from frontier LLM reference scores. GRAM reports large-reasoning-model Sudoku/ARC numbers as benchmark-difficulty context, not as matched training-data, prompting, tool-use, or inference-budget comparisons.
Generation benchmarks should separate pointwise fidelity metrics, retrieval or rank metrics, text-condition ablations, and downstream utility. T2S’s WAPE, MSE, and MRR@10 results show text-conditioned fidelity, not forecasting utility or world-model readiness.
TimeCraft-style generation benchmarks should separate distributional fidelity, text controllability, target-aware downstream utility, causal/interventional validity, irregular-continuous reconstruction, and online pretraining gain. TimeCraft, TimeDP, BRIDGE, TarDiff, CaTSG, OATS, and Diff-MN are not comparable on one scalar leaderboard.
For BRIDGE-style text-controlled generation, separate generated-description quality, no-text/no-prototype ablations, text-alignment scores, human ratings, and downstream forecasting augmentation. None of these alone proves operational context use or causal validity.
For irregular-continuous generation, separate arbitrary-time reconstruction or interpolation quality from real irregular sampling, missingness-not-at-random, streaming updates, and downstream decision utility. Diff-MN is the local case because much of its irregularity is simulated by random observation dropping.
Diffusion forecasting reports such as MG-TSD should separate probabilistic forecast metrics such as CRPS, NMAE, and NRMSE from synthetic-generation fidelity, text-control, downstream-utility, or world-model evidence.
Bias-mitigation reports should separate bias or shortcut metrics, fidelity/quality metrics, prompt or condition alignment, and OOD forecasting metrics. InvDiff is the local case for this split.
Financial-market generation benchmarks such as DiGA and MarS should separate control-target error, stylized-fact fidelity, market-impact or what-if validity, downstream trading-agent utility, latency, data-access/reproducibility, and simulator-assumption checks.
Healthcare TSFM reports should separate inherently irregular clinical datasets from regular datasets with simulated missingness, out-of-distribution tests from pretraining-source-family tests, and passive forecasting from treatment or intervention modeling. MIRA is the local irregular-clinical case; SensorFM is the wearable-health case, where benchmark hygiene should additionally separate private versus public data, one-minute aggregate features versus raw waveforms, frozen linear probes versus agent-searched downstream heads, and Personal Health Agent grounding from diagnosis or intervention planning.
Retrieval-augmented forecasting should report knowledge-base provenance, overlap, and retrieval candidates separately from base-model zero-shot scores. TimeRAF is the local case.
Streaming-generation benchmarks should separate aggregate quality from temporal artifact diagnostics such as repetition, false silence or background activity, drift over time, and quantization degradation. Moshi is the current local audio case.
Generative model reports should separate sample quality from memorization risk. Why Diffusion Models Don’t Memorize shows that a diffusion model can reach good samples before late training-set memorization, so generated time-series benchmarks should report checkpoint/update count, dataset size, and memorization or duplicate probes alongside fidelity metrics.
Target-optimized synthetic or curriculum data should separate surface sample quality from downstream training effect. Synthetic Data for any Differentiable Target shows in language-model SFT that benign-looking examples can be optimized to induce hidden weight or metric changes, so time-series curriculum and synthetic-data benchmarks should audit representation drift, rare-state retention, normal-behavior retention, and poisoning-like side effects. Target-aware generators such as TarDiff also need guidance-set isolation, downstream-model diversity checks, and metric-overfitting audits. OATS-style online TSFM augmentation additionally needs reference-set provenance and evaluation-dataset isolation audits, because its guidance can depend on small reference samples from evaluation-like distributions.
Query-conditioned data-attribution reports should separate the query distribution, upstream source corpus, sampled candidate pool, fraction inspected for scoring, selected unique examples, repeated sample exposures, optimizer updates, fine-tuning compute, attribution compute, target metrics, untargeted retention metrics, and query/test taxonomy overlap. Motion Attribution for Video Generation is the local case: “10% beats 100%” means fine-tuning a pretrained model on 1,000 selected clips versus all 10,000 clips in the sampled candidate pool on targeted dynamic degree. Attribution still processes all 10,000 candidates and costs about 150 A100-hours; the paper does not establish 10% candidate-data access, matched total compute, or a 10-times end-to-end saving, and not every visual-quality metric improves.
Reference-guided reasoning or RL benchmarks should report correct-reference, wrong-reference, no-reference, and judge-size controls before treating dense partial-progress scores as reliable training signals. ExpRL is the current language-reasoning case: problem-matched references help capable judges verify partial progress, while wrong references and too-weak judges degrade the reward signal.
Data filtering and curriculum reports should include no-filter, loose-filter, and dedup-only controls when the source corpus has natural diversity. A Bitter Lesson for Data Filtering shows that filtering wins can reverse with more compute, and No Filter shows that benchmark-friendly filters can hurt underrepresented distribution slices.
Scaling-law reports should separate clean pretraining fits from perturbation-aware fits, including noise/SNR model, low-bit serving, post-training pressure, fit grid, and extrapolation distance. LLMs as Noisy Channels is the current upstream language-model case.
Compute-optimal or fixed-FLOPs reports should state the cost model, parameter-count convention, data or effective-data unit, fit region, loss precision, fitting objective, and sensitivity to optimizer, schedule, tokenizer or patcher, data mix, and repeated data. Scaling Laws, Carefully is the current upstream method-hygiene source.
Training-dynamics reports should separate aggregate loss curves from capability-emergence probes, including absolute thresholds, checkpoint density, probe construction, representation-extraction layer/head, and whether relative thresholds change cross-model conclusions. Implicit Curriculum Hypothesis is the current upstream language-model case.
Representation-similarity reports should separate raw metrics, calibrated metrics, layer-search aggregates, and downstream utility. Aristotelian Representation Hypothesis is the current calibration case: raw global CKA-style convergence and max-over-layer trends can be width/depth artifacts, and temporal data need dependence-preserving nulls rather than naive row shuffles.
Hybrid-architecture reports should separate state-conditioned targets from exact-copy, repeated-normal, and structural-closure targets before crediting recurrence, attention, or a specific mixer for aggregate gains. Comparing Transformers and Hybrid Models at the Token Level is the current upstream language-model case: filtered token losses expose a state-like versus copy-like split that aggregate loss hides.
Long-context hybrid-memory reports should additionally separate router or retention policy, explicit-cache budget, maintained recurrent-state size, dual-state update cost, selected-span recall, cache-read normalization, and serving latency. Gated DeltaNet is the scalar-gated recurrent baseline behind the newer hybrid-memory comparisons; Oryx, Hybrid Associative Memories, and HOLA are current upstream language-model cases. HOLA also requires separating parameter overhead from cache-state bytes and distinguishing a 2k-trained RoPE full-attention checkpoint from a length-extrapolating baseline. Their results should not be treated as time-series state preservation until rare-regime, event-timing, exogenous-variable, and action-history probes are reported under matched budgets.
Training-system claims should separate memory footprint, wall-clock throughput, total training compute, from-scratch versus pretrained-conversion settings, and objective-specific proxy metrics. DiffusionBlocks is the current local case for this separation.
Upstream language diffusion metrics such as generative perplexity, entropy, BLEU, and ROUGE should stay separate from numeric time-series fidelity, calibration, generation utility, and control evidence. ELF is the current local case.
Masked diffusion language-model benchmarks should separate likelihood or likelihood-bound claims from task-specific scoring surrogates. iLLaDA reports confidence-based multiple-choice scoring that performs better empirically than a likelihood-style baseline, so those scores should not be treated as direct likelihood evidence or as numeric time-series calibration evidence.
KV-cache or retrieval-memory compression results should separate byte savings, quality recovery, hardware-native compute support, dequantization cost, latency, throughput, memory-pressure regime, and target-task quality. TurboQuant is the current local case for this separation.
Sharpness, flatness, and edge-of-stability claims should state optimizer, loss, batch regime, and whether the reported quantity is full-batch Hessian sharpness or batch sharpness.
Incident-response QA charts such as ARFBench should not be treated as forecasting leaderboards. They may mix frontier VLMs, LLMs, post-trained or hybrid time-series/VLM systems, domain experts, and oracle combinations.

Benchmark Families

Forecasting benchmark names in this wiki include BOOM, Context is Key, GIFT-Eval, TIME, Chronos-ZS, fev-bench, Monash, LSF/LTSF, Time-Series-Library, Informer-style ETT tasks, Darts, Chronos Benchmark II, and Time-HD. Multi-task observability datasets such as TelecomTS and incident-response QA benchmarks such as ARFBench should be compared on their own task axes rather than forced into a forecasting-only rank. These differ in horizon, frequency, metric, target/covariate interface, text/context interface, channel count, channel-dependency structure, and leakage policy.

Dataset Anchors

Dataset	Use It For	Main Caveat
Context is Key	Text-conditioned probabilistic forecasting where context is essential.	Univariate and text-only; benchmark-first with no training split; numeric-only forecasters see an intentionally incomplete interface.
TelecomTS	5G observability anomaly detection, root-cause analysis, forecasting, and time-series/text Q&A.	Lab/testbed data with 18 channels; synthetic anomalies and generated tickets need artifact checks; no operator action channel.
Time-HD	HDTSF at the thousand-channel scale.	Passive forecasting only; sparse public dataset card.
BOOM	Observability forecasting at the grouped-query, high-cardinality scale.	No operator actions/interventions; Datadog pre-production source.
NeoRL-2	Non-vision action-conditioned transition benchmark with rewards and simulators under delays, exogenous factors, safety constraints, conservative policies, and limited data.	Simulated tasks; 2025 arXiv preprint; exact Hugging Face configs and license terms need pinning.
CityLearn	Building-energy control and demand-response trajectories from a simulator/schema with continuous storage and device controls.	Not a single immutable payload; package version, schema, reward/cost function, action set, and source-data provenance must be pinned.
Grid2Op	Graph-structured power-grid control episodes with topology/control inputs, contingencies, safety constraints, rewards, and simulator-backed next observations.	Challenge ecosystem, not fixed payload; pin Grid2Op version, backend, chronics, action masks/reductions, reward/cost, train/test scenarios, simulator access, planning budget, latency, single-agent versus multi-agent setup, and whether learned surrogates rank actions, predict risk, or model full transitions.
Tennessee Eastman Process Simulation Data	Industrial process anomaly detection and fault diagnosis with measured and manipulated variables.	Synthetic simulator data; manipulated variables need preprocessing into control inputs; no rewards, remediation actions, or counterfactual intervention protocol.
Aionoscope	Controlled latent-state accessibility diagnostics with exact categorical and dense labels.	Synthetic single-channel public validation snapshot; mean-pooled linear probes and native-length adapters limit ranking claims.
GIFT-Eval	Broad general-purpose TSFM evaluation and leaderboard comparisons.	Dataset-count/version summaries differ; not observability-specific.
TIME	Strict zero-shot and contamination-resistant claims.	Not primarily HDTSF or observability.
Time-Series-Library	Legacy LSF/LTSF continuity, especially ETT/Electricity/Weather comparisons.	Low-dimensional and often saturated.

Classification and representation-learning benchmarks include UCR and UEA. They test labeled shape or sequence discrimination, not direct future-value forecasting.

Static tabular benchmarks such as TALENT or small-data TabPFN-style suites are adjacent but should be kept separate from time-series evaluations because rows are not temporal histories by default.

TabPFN-3 makes this separation especially important. Its technical report includes static TabArena results, API/enterprise TabPFN-3-Plus and Thinking entries, and a specialized TabPFN-TS-3 time-series checkpoint; those are not one benchmark protocol.

Leakage And Overlap Risks

Broad pretraining corpora can include public datasets, training splits, or near-duplicates that later appear in benchmark reports. Chronos-2 explicitly discusses GIFT-Eval overlap and reports a synthetic-only ablation for stricter zero-shot evidence. TiRex flags overlap risks for some baselines and reports an appendix update intended to remove full GiftEval overlap. FlowState makes data-overlap handling part of its GIFT-style zero-shot claim and separately evaluates unseen sampling rates. Toto 2.0 says public forecasting datasets were excluded from pretraining, making leakage control part of its claim. Sundial is another protocol-sensitive case: it reports excluded evaluation datasets for TimeBench, uses different sampling counts for FEV and GIFT-Eval, and separately reports zero-shot and once-fine-tuned FEV settings.

Synthetic data does not eliminate benchmark risk by itself. Template families can resemble benchmark dynamics, and synthetic generators can encode artifacts that make downstream ranks look stronger than real transfer would be.

For target-optimized synthetic or curriculum data, surface quality is not enough. Synthetic Data for any Differentiable Target is language-model evidence rather than time-series evidence, but it shows that generated examples can be optimized for hidden downstream training effects while looking benign. Time-series curriculum and synthetic-data benchmarks should therefore audit downstream representation drift, rare-state retention, normal-behavior retention, and poisoning-like side effects, not only sample fidelity or aggregate score.

For filtered pretraining corpora, “filtered” is not automatically cleaner evidence. A Bitter Lesson for Data Filtering shows that the best data choice can depend on compute scale; No Filter shows that familiar benchmark gains can hide cultural and socioeconomic coverage loss. Time-series reports should therefore name the filter target, compare no-filter or loose-filter controls, and include tail-slice probes before claiming a curriculum or filter improves representations.

For checkpoint-time capability reports, aggregate loss is not enough. Implicit Curriculum Hypothesis shows in language models that absolute-threshold probe emergence can be stable while relative-threshold definitions change the conclusion. Time-series reports should therefore define the capability threshold, probe suite, checkpoint spacing, and representation-extraction protocol before comparing training runs.

Cross-Paper Comparison Checklist

Before comparing two reported ranks, check:

Is the task forecasting, classification, imputation, anomaly detection, reasoning, or generation?
Is the model zero-shot, few-shot, linear-probed, fine-tuned, or ensembled?
Are context length, horizon, patch size, frequency, and rollout method comparable?
Are known future exogenous variables, covariates, or grouped series available to both models?
Is textual context essential to the correct forecast, merely descriptive metadata, or unavailable?
Are metrics aligned, such as MASE, CRPS, weighted quantile loss, MSE, MAE, WAPE, MRR@10, accuracy, macro-F1, or rank?
For generation tasks, is the report measuring text-conditioned fidelity, retrieval/ranking, sample diversity, or downstream task utility?
For observability datasets, is the target forecasting, anomaly presence, anomaly interval, root cause, or language QA?
Does a pretraining corpus, adaptation split, retrieval knowledge base, or synthetic generator include benchmark train/test data, private telemetry, near-neighbor windows, or unreleased generation artifacts?
Is the model univariate, channel-independent, or native multivariate?
Is the benchmark channel count high enough to test high-dimensional multivariate behavior rather than ordinary low-channel forecasting?

Evidence

The source pages already show why this page is needed. Toto 2.0 mixes base, fine-tuned, and ensemble leaderboard entries; the Toto 2.0 TSALM talk adds ARFBench, where VLMs, hybrid Toto/VLM systems, humans, and oracle combinations are compared on incident-response QA; TiRex, FlowState, and Chronos-2 discuss overlap or stricter zero-shot settings; Moirai 2.0 reports a smaller model outperforming larger variants under one aggregate; U-Cast argues that low-channel forecasting benchmarks under-test native multivariate dependency modeling; Context is Key changes the input contract by making natural-language context essential; TelecomTS changes the target from forecasting alone to observability diagnosis and time-series/text QA; Titans Revisited shows that test-time memory introduces separate protocol axes for chunking, memory-update cost, baseline strength, and whether the report measures a full architecture or an isolated memory component; TabPFN-v2, TabPFN-3, and TabICL operate mainly on static tabular tasks; TabPFN-TS-3, TempoPFN, MantisV2, and UniShape each change the task or adaptation mode again.

Aionoscope adds a local diagnostic example: one benchmark can deliberately separate categorical component accessibility from dense process-state accessibility, and the paper’s own framing prevents treating the public validation snapshot as a stable leaderboard. LeNEPA adds the matching method-side protocol example: its PTB-XL/Diag comparison is a fixed-recipe stress test and should not be rewritten as a claim that no possible Diag-tuned JEPA recipe could match it.

S4L adds an older cross-domain baseline warning. Its semi-supervised ImageNet gains are only interpretable because the paper reports strong supervised-only baseline sweeps, validation-size sensitivity, and model-selection details. Time-series classification and SSL benchmarks should carry the same burden before treating unlabeled-data objectives as true representation gains.

World Models adds the learned-simulator version of the same problem. Virtual-environment reward, real-environment transfer, and exploitability/uncertainty settings must be reported separately because a controller can exploit an imperfect learned dynamics model.

stable-worldmodel adds the modern JEPA/world-model implementation version of the problem. Common solvers, trajectory storage, and factor-of-variation sweeps make robustness claims easier to audit, but they also show why one aggregate score is too small: in-distribution planning, OOD planning, prediction error, latency, and solver settings are distinct protocol axes.

AdaJEPA adds the adaptation-mode variant of the same problem. A report can improve closed-loop success by changing the model during evaluation, so the benchmark record must say whether the comparison is frozen, within-episode adapted, persistently fine-tuned, or trained with extra target-domain data.

World Model for Robot Learning Survey adds the embodied-robotics taxonomy version. A future-video score can be useful for world-model work, but it should not be treated as control evidence unless the protocol also tests action responsiveness, closed-loop decision utility, or executability.

On Training in Imagination adds the reward-economics version. If a benchmark or paper says imagined training works, the wiki should ask whether the dynamics model and reward model were learned from different data streams, whether the reward labels were cheap/noisy or expensive/clean, and whether any reward error was zero-mean noise or systematic bias.

Awesome Agentic Time Series adds the agentic-time-series benchmark map. It makes clear that forecasting benchmarks, reasoning/QA benchmarks, engineering-agent tasks, future-prediction tasks, and decision-centric benchmarks stress different parts of a system. The wiki should therefore avoid rolling them into one “agentic time-series” leaderboard unless the environment, action space, feedback, tool budget, memory policy, and evaluation target match.

The fresh Grid2Op sources are a concrete example of this hygiene rule. RL2Grid is a single-agent benchmark preprint whose ICLR 2025 OpenReview submission is withdrawn; MARL2Grid-TR is an accepted ICLR 2026 multi-agent benchmark; AI challenge for safe and low carbon power grid operation is a challenge-analysis paper; soft-label topology actions is an imitation-learning action-ranker; Gibbs-prior topology control is a preprint with a one-step overload-risk surrogate; runtime safety shielding is simulator-backed online filtering; LLM-guided safe RL is training-time transition refinement; interpretable policy distillation is controller compression/auditability evidence; GNN transmission-grid topology is representation/OOD-topology hygiene; and targeted exploration is line-switching/cascading-failure evidence with survival-time, action-diversity, and fixed-budget training metrics. Older power-system world-model/shield papers such as WMAP should stay separate from Grid2Op topology-control benchmarks because FACTS setpoint control, topology reconfiguration, and simulator-backed action ranking are different protocols. They should not be collapsed into one Grid2Op SOTA rank without matching environment, action space, simulator budget, graph representation, closed-loop protocol, and safety protocol.

Relation To Foundation TSFM Agenda

This page is the benchmark-hygiene slot for the Foundation Time-Series Model Research Agenda. It should be used to decide what a reported score actually tests: observation forecasting, state prediction, context sensitivity, rare-regime preservation, native multivariate scaling, generation fidelity, reasoning, or action-conditioned rollout. The agenda verdict for benchmark claims SHOULD stay slot-level; one aggregate rank cannot close the whole foundation-model path.

Open Questions

What minimum overparametrization audit should a TSFM report: train fit, held-out loss, effective rank or dimensionality, compression protocol, structured-versus-randomized controls, and rare-state retention?
Should the wiki maintain a normalized benchmark table only for results that share protocol and metrics?
How should private training corpora or private observability benchmarks be weighted relative to fully reproducible public benchmarks?
Should HDTSF benchmarks report channel-count, correlation, hierarchy, memory, and training-time axes next to forecasting error?
Should context-aided benchmarks report context ablations, corrupted-context controls, and region-of-interest metrics by default?
What is the right benchmark for action-conditioned time-series world models where interventions, not only forecasts, matter?
What is the minimum reproducible protocol bundle for imagined-rollout training: source trajectories, action schema, reward-label provenance, reward-noise/bias audit, solver config, factor-of-variation settings, distribution-shift split, failed-action receipts, and latency budget?
What minimum memorization-probe bundle should generated time-series benchmarks report: checkpoint/update count, dataset size, nearest-neighbor or subsequence matches, duplicate checks, membership inference, and downstream leakage?
What minimum hidden-effect audit should target-optimized synthetic or curriculum data report: representation drift, rare-state retention, normal-behavior retention, metric overfitting, and poisoning-like side effects?
What minimum no-filter audit should a data filter report: no-filter and loose-filter baselines, dedup-only baseline, compute scale, model size, tail-slice coverage, regime/tenant/device representation probes analogous to geo-localization, and corrupt-window rejection?
What minimum reliability protocol should an agentic time-series benchmark require across perception, reasoning, tool calls, memory updates, actions, feedback, grounding, replay, auditability, safety, and cost before a system is called deployable?
What minimum protocol should within-episode test-time adaptation report: update target, buffer, step count, reset policy, latency, adaptation-failure cases, safety shield, and frozen/retrained baselines?
What minimum scarce-label protocol should classification and SSL benchmarks report: supervised-only hypersweeps, validation-set size, model-selection budget, downstream classifier or fine-tuning mode, and leakage controls?
What minimum latent-state accessibility protocol should report categorical versus dense targets, layer-selection budget, probe capacity, public versus hidden streams, native-length confounds, and real-task transfer checks?

Alex Open Research Wiki

Explorer

Time-Series Benchmark Hygiene

Time-Series Benchmark Hygiene

Summary

Required Separations

Benchmark Families

Dataset Anchors

Leakage And Overlap Risks

Cross-Paper Comparison Checklist

Evidence

Relation To Foundation TSFM Agenda

Open Questions

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

Time-Series Benchmark Hygiene

Time-Series Benchmark Hygiene

Summary

Required Separations

Benchmark Families

Dataset Anchors

Leakage And Overlap Risks

Cross-Paper Comparison Checklist

Evidence

Relation To Foundation TSFM Agenda

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks