Time-Series Benchmark Hygiene
Summary
Time-series foundation model rankings are brittle unless task, protocol, context length, horizon, covariates, leakage controls, and adaptation mode match. Future source pages should link here instead of repeating the same benchmark caveats.
The latent-state wiki frame adds one more hygiene rule: average forecast error is not enough to establish system understanding. Benchmarks should say whether they test observation forecasting, representation quality, context use, rare-event sensitivity, state maintenance, or action-conditioned rollout.
Foundation Time-Series Model Research Agenda is the central rubric for this page: benchmark claims should identify which agenda slot they test and whether they close, partially close, or merely touch the path toward a foundation time-series model.
Required Separations
- Zero-shot base-model results should be separated from few-shot adaptation, linear probing, full fine-tuning, and dataset-specific training.
- Fine-tuned and ensemble entries should be separated from base released checkpoints. Toto 2.0, for example, reports base models, a fine-tuned 2.5B variant, and a family-and-friends ensemble.
- Frozen feature extraction for classification should be separated from label-free zero-shot prediction. Many classification “zero-shot” results still train a Random Forest, SVM, logistic-regression head, or similar downstream classifier on target labels.
- Single-model results should be separated from representation fusion, such as MantisV2 plus TiViT features.
- Point forecasting, quantile forecasting, probabilistic forecasting, imputation, anomaly detection, classification, and reasoning benchmarks should not be ranked as if they measure one ability.
- Observation forecasting, latent-state prediction, representation learning, and action-conditioned world modeling should not be collapsed into one claim.
- One-step prediction, decision-usable multi-step rollout, closed-loop transfer, and evidence-driven model revision should be separated. Agentic World Modeling names these as L1 Predictor, L2 Simulator, and L3 Evolver boundaries.
- Learned-simulator scores should be separated from real-environment or live-stand transfer. World Models is the historical hygiene case: a controller can maximize reward inside an imperfect learned dynamics model while failing after transfer.
- Action-conditioned world-model reports should separate prediction error, planning success, solver budget, per-step latency, distribution-shift regime, and factor-of-variation settings. stable-worldmodel is the current local case: its Push-T analyses show that prediction error can overlap heavily between successful and failed plans under distribution shift.
- Embodied world-model evaluations should separate open-loop action-conditioned prediction, closed-loop task utility or policy evaluation, and physical consistency, controllability, and executability diagnostics. World Model for Robot Learning Survey is the local robotics survey map for that split.
- Training-in-imagination reports should separate dynamics-transition error, reward-model error, reward annotation cost, reward noise, and reward bias. On Training in Imagination is the current local source for why zero-mean reward noise and systematic reward bias have different policy-gradient consequences.
- Upstream dynamic-compute results, such as EBT language/video/image scaling, should be separated from numeric time-series forecasting, generation, and control evidence unless the benchmark directly tests those interfaces.
- Test-time-memory results should separate memory-only modules, full architectures, chunk size, update cost, baseline strength, and adaptation mode. Titans Revisited is the current local check for this failure mode.
- Segment-level recurrent-memory results should separate synthetic recall/rewrite tasks, long-context QA, language-modeling transfer, segment size, sequential update cost, and direct numeric or action-conditioned evidence.
- Sleep-time consolidation results should separate memory capacity, consolidation compute, wake-time prediction latency, training sequentiality, and whether the evidence is synthetic/language or numeric/action-conditioned.
- Architecture papers with follow-up benchmark narratives should separate paper results, official but unreleased claims, and public-code reproducibility. Dragon Hatchling’s Sudoku result is official Pathway narrative, not currently open-reproduced from the public BDH repository.
- Generation benchmarks should separate pointwise fidelity metrics, retrieval or rank metrics, text-condition ablations, and downstream utility. T2S’s WAPE, MSE, and MRR@10 results show text-conditioned fidelity, not forecasting utility or world-model readiness.
- Training-system claims should separate memory footprint, wall-clock throughput, total training compute, from-scratch versus pretrained-conversion settings, and objective-specific proxy metrics. DiffusionBlocks is the current local case for this separation.
- Upstream language diffusion metrics such as generative perplexity, entropy, BLEU, and ROUGE should stay separate from numeric time-series fidelity, calibration, generation utility, and control evidence. ELF is the current local case.
- KV-cache or retrieval-memory compression results should separate byte savings, quality recovery, hardware-native compute support, dequantization cost, latency, throughput, memory-pressure regime, and target-task quality. TurboQuant is the current local case for this separation.
- Sharpness, flatness, and edge-of-stability claims should state optimizer, loss, batch regime, and whether the reported quantity is full-batch Hessian sharpness or batch sharpness.
- Incident-response QA charts such as ARFBench should not be treated as forecasting leaderboards. They may mix frontier VLMs, LLMs, post-trained or hybrid time-series/VLM systems, domain experts, and oracle combinations.
Benchmark Families
Forecasting benchmark names in this wiki include BOOM, Context is Key, GIFT-Eval, TIME, Chronos-ZS, fev-bench, Monash, LSF/LTSF, Time-Series-Library, Informer-style ETT tasks, Darts, Chronos Benchmark II, and Time-HD. Multi-task observability datasets such as TelecomTS and incident-response QA benchmarks such as ARFBench should be compared on their own task axes rather than forced into a forecasting-only rank. These differ in horizon, frequency, metric, target/covariate interface, text/context interface, channel count, channel-dependency structure, and leakage policy.
Dataset Anchors
| Dataset | Use It For | Main Caveat |
|---|---|---|
| Context is Key | Text-conditioned probabilistic forecasting where context is essential. | Univariate and text-only; benchmark-first with no training split; numeric-only forecasters see an intentionally incomplete interface. |
| TelecomTS | 5G observability anomaly detection, root-cause analysis, forecasting, and time-series/text Q&A. | Lab/testbed data with 18 channels; synthetic anomalies and generated tickets need artifact checks; no operator action channel. |
| Time-HD | HDTSF at the thousand-channel scale. | Passive forecasting only; sparse public dataset card. |
| BOOM | Observability forecasting at the grouped-query, high-cardinality scale. | No operator actions/interventions; Datadog pre-production source. |
| GIFT-Eval | Broad general-purpose TSFM evaluation and leaderboard comparisons. | Dataset-count/version summaries differ; not observability-specific. |
| TIME | Strict zero-shot and contamination-resistant claims. | Not primarily HDTSF or observability. |
| Time-Series-Library | Legacy LSF/LTSF continuity, especially ETT/Electricity/Weather comparisons. | Low-dimensional and often saturated. |
Classification and representation-learning benchmarks include UCR and UEA. They test labeled shape or sequence discrimination, not direct future-value forecasting.
Static tabular benchmarks such as TALENT or small-data TabPFN-style suites are adjacent but should be kept separate from time-series evaluations because rows are not temporal histories by default.
TabPFN-3 makes this separation especially important. Its technical report includes static TabArena results, API/enterprise TabPFN-3-Plus and Thinking entries, and a specialized TabPFN-TS-3 time-series checkpoint; those are not one benchmark protocol.
Leakage And Overlap Risks
Broad pretraining corpora can include public datasets, training splits, or near-duplicates that later appear in benchmark reports. Chronos-2 explicitly discusses GIFT-Eval overlap and reports a synthetic-only ablation for stricter zero-shot evidence. TiRex flags overlap risks for some baselines and reports an appendix update intended to remove full GiftEval overlap. FlowState makes data-overlap handling part of its GIFT-style zero-shot claim and separately evaluates unseen sampling rates. Toto 2.0 says public forecasting datasets were excluded from pretraining, making leakage control part of its claim. Sundial is another protocol-sensitive case: it reports excluded evaluation datasets for TimeBench, uses different sampling counts for FEV and GIFT-Eval, and separately reports zero-shot and once-fine-tuned FEV settings.
Synthetic data does not eliminate benchmark risk by itself. Template families can resemble benchmark dynamics, and synthetic generators can encode artifacts that make downstream ranks look stronger than real transfer would be.
Cross-Paper Comparison Checklist
Before comparing two reported ranks, check:
- Is the task forecasting, classification, imputation, anomaly detection, reasoning, or generation?
- Is the model zero-shot, few-shot, linear-probed, fine-tuned, or ensembled?
- Are context length, horizon, patch size, frequency, and rollout method comparable?
- Are known future exogenous variables, covariates, or grouped series available to both models?
- Is textual context essential to the correct forecast, merely descriptive metadata, or unavailable?
- Are metrics aligned, such as MASE, CRPS, weighted quantile loss, MSE, MAE, WAPE, MRR@10, accuracy, macro-F1, or rank?
- For generation tasks, is the report measuring text-conditioned fidelity, retrieval/ranking, sample diversity, or downstream task utility?
- For observability datasets, is the target forecasting, anomaly presence, anomaly interval, root cause, or language QA?
- Does either pretraining corpus include benchmark train or test data, private telemetry, or unreleased synthetic generators?
- Is the model univariate, channel-independent, or native multivariate?
- Is the benchmark channel count high enough to test high-dimensional multivariate behavior rather than ordinary low-channel forecasting?
Evidence
The source pages already show why this page is needed. Toto 2.0 mixes base, fine-tuned, and ensemble leaderboard entries; the Toto 2.0 TSALM talk adds ARFBench, where VLMs, hybrid Toto/VLM systems, humans, and oracle combinations are compared on incident-response QA; TiRex, FlowState, and Chronos-2 discuss overlap or stricter zero-shot settings; Moirai 2.0 reports a smaller model outperforming larger variants under one aggregate; U-Cast argues that low-channel forecasting benchmarks under-test native multivariate dependency modeling; Context is Key changes the input contract by making natural-language context essential; TelecomTS changes the target from forecasting alone to observability diagnosis and time-series/text QA; Titans Revisited shows that test-time memory introduces separate protocol axes for chunking, memory-update cost, baseline strength, and whether the report measures a full architecture or an isolated memory component; TabPFN-v2, TabPFN-3, and TabICL operate mainly on static tabular tasks; TabPFN-TS-3, TempoPFN, MantisV2, and UniShape each change the task or adaptation mode again.
World Models adds the learned-simulator version of the same problem. Virtual-environment reward, real-environment transfer, and exploitability/uncertainty settings must be reported separately because a controller can exploit an imperfect learned dynamics model.
stable-worldmodel adds the modern JEPA/world-model implementation version of the problem. Common solvers, trajectory storage, and factor-of-variation sweeps make robustness claims easier to audit, but they also show why one aggregate score is too small: in-distribution planning, OOD planning, prediction error, latency, and solver settings are distinct protocol axes.
World Model for Robot Learning Survey adds the embodied-robotics taxonomy version. A future-video score can be useful for world-model work, but it should not be treated as control evidence unless the protocol also tests action responsiveness, closed-loop decision utility, or executability.
On Training in Imagination adds the reward-economics version. If a benchmark or paper says imagined training works, the wiki should ask whether the dynamics model and reward model were learned from different data streams, whether the reward labels were cheap/noisy or expensive/clean, and whether any reward error was zero-mean noise or systematic bias.
Relation To Foundation TSFM Agenda
This page is the benchmark-hygiene slot for the Foundation Time-Series Model Research Agenda. It should be used to decide what a reported score actually tests: observation forecasting, state prediction, context sensitivity, rare-regime preservation, native multivariate scaling, generation fidelity, reasoning, or action-conditioned rollout. The agenda verdict for benchmark claims SHOULD stay slot-level; one aggregate rank cannot close the whole foundation-model path.
Open Questions
- Should the wiki maintain a normalized benchmark table only for results that share protocol and metrics?
- How should private training corpora or private observability benchmarks be weighted relative to fully reproducible public benchmarks?
- Should HDTSF benchmarks report channel-count, correlation, hierarchy, memory, and training-time axes next to forecasting error?
- Should context-aided benchmarks report context ablations, corrupted-context controls, and region-of-interest metrics by default?
- What is the right benchmark for action-conditioned time-series world models where interventions, not only forecasts, matter?
- What is the minimum reproducible protocol bundle for imagined-rollout training: source trajectories, action schema, reward-label provenance, reward-noise/bias audit, solver config, factor-of-variation settings, distribution-shift split, failed-action receipts, and latency budget?
Related Pages
- Time-Series Foundation Models
- Foundation Time-Series Model Research Agenda
- Latent-State Time-Series Modeling
- Context-Aided Forecasting
- Time-Series Scaling And Efficiency
- High-Dimensional Time Series Forecasting
- Time-Series Classification Foundation Models
- Synthetic Data For Time Series
- Tabular Foundation Models