This Time is Different: An Observability Perspective on Time Series Foundation Models
Source
- Raw Markdown: paper_toto-2025.md
- PDF: paper_toto-2025.pdf
- Preprint: arXiv 2505.14766
- Official source: DataDog/toto
- Official checkpoint: Datadog/Toto-Open-Base-1.0
- Official BOOM dataset: https://huggingface.co/datasets/Datadog/BOOM
- Dataset pages: BOOM, GIFT-Eval, Time-Series-Library
Core Claim
Toto argues that observability metrics are a distinct and demanding multivariate time-series forecasting domain, and that a decoder-only foundation model trained on observability, public, and synthetic time series can outperform general-purpose time-series foundation models on both observability and standard forecasting benchmarks.
Key Contributions
- Introduces Toto, a 151M-parameter open-weights time-series foundation model for zero-shot probabilistic forecasting of observability metrics.
- Adds architecture choices aimed at high-cardinality, nonstationary multivariate time series: patch-based causal instance normalization, proportional factorized time-variate attention, a Student-T mixture prediction head, and a robust composite training loss.
- Builds a pretraining mixture of about 2.36T time-series points, including Datadog internal observability metrics, public datasets, and synthetic data.
- Introduces BOOM, an observability forecasting benchmark with about 350M observations across 2,807 real-world multivariate time series.
- Reports strong results on BOOM, GIFT-Eval, and LSF, positioning observability data as both a target domain and a stress test for general time-series foundation models.
Benchmarked Model Entry
| Model | Role In Paper | Notes | Official Artifact |
|---|---|---|---|
| Toto-Open-Base-1.0 | Main released and benchmarked Toto checkpoint | 151M-parameter decoder-only probabilistic forecaster with patch size 64, native context length 4096, proportional factorized attention, and a Student-T mixture output head. The paper evaluates it zero-shot on BOOM, GIFT-Eval, and LSF, with fine-tuning experiments on LSF. | Datadog/Toto-Open-Base-1.0 |
Method Notes
Toto is a passive dynamics model for multivariate time series. It forecasts future observations from historical observations and does not model actions, control inputs, interventions, or counterfactual policy choices as first-class channels in the paper.
Datadog later extends this line with Toto 2.0, which keeps the observability forecasting framing while turning the release into a scaled model family with contiguous patch masking and broader BOOM, GIFT-Eval, and TIME claims.
The model uses non-overlapping patches over time, maps each variate into a decoder-only Transformer stack, and alternates mostly time-wise attention with a smaller amount of variate-wise attention. The paper frames this proportional factorized attention as a way to preserve cross-variate structure while keeping inference practical for high-cardinality observability metrics.
The Student-T mixture output head makes Toto a probabilistic forecaster rather than only a point forecaster. The composite robust loss combines negative log likelihood with a robust point-prediction term to stabilize training on sparse, bursty, heavy-tailed observability data.
Evidence And Results
- BOOM: the paper reports Toto ahead of Moirai, TimesFM, Chronos-Bolt, Timer, Time-MoE, VisionTS, and naive baselines by normalized MASE, normalized CRPS, and average rank.
- GIFT-Eval: Toto is evaluated with the same inference settings as BOOM, using its native 4096-point context length.
- LSF: the paper reports both zero-shot and fine-tuned Toto results on ETTh1, ETTh2, ETTm1, ETTm2, Electricity, and Weather.
- Ablations identify causal scaling and the Student-T mixture head as especially important, with large NLL degradation when either is removed.
Limitations
- The paper is forecasting-centered; Toto is not presented as an action-conditioned world model for intervention or control reasoning.
- Observability training and benchmark data come from Datadog internal systems, so the paper provides scale and operational realism but only partial external visibility into the private corpus construction.
- BOOM uses Datadog observability metrics and preprocessing choices, so transfer to other monitoring stacks should be checked empirically.
- The benchmark claims depend on the chosen probabilistic metrics, context lengths, and official inference procedures for comparison models.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Native multivariate encoding and high-channel scaling | partially closes | Proportional factorized attention mixes time-wise and variate-wise interactions for high-cardinality observability metrics. | Does not use topology, service metadata, or action/deployment context. |
| Multi-modal future distributions | partially closes | Student-T mixture head and probabilistic sampling model heavy-tailed, sparse, bursty futures. | Does not test multiple decision-relevant futures under candidate interventions. |
| Benchmarks: observability level | partially closes | BOOM evaluates real observability forecasting across 2,807 multivariate series and exposes metric-specific pathologies. | No operator actions, deployments, rollbacks, or autoscaling controls are included. |
| Control and counterfactuals | insufficient evidence | Forecasting is passive despite the observability target domain. | Needs action/control input logs and counterfactual incident rollout. |
Links Into The Wiki
- Foundation Time-Series Model Research Agenda
- Toto
- High-Dimensional Time Series Forecasting
- BOOM
- GIFT-Eval
- Time-Series-Library
- Time-Series Foundation Models
- Observability Time Series
- Time-Series Benchmark Hygiene
- TimesFM
- Moirai
- Chronos-2
- Time-MoE
- Toto 2.0
Open Questions
- How much of Toto’s advantage comes from observability-specific data scale versus the architecture changes?
- Can proportional factorized attention transfer cleanly to other high-cardinality multivariate time-series domains such as finance, telemetry, or physiology?
- What benchmark would test observability forecasting under operator actions, deployments, rollbacks, or autoscaling control inputs rather than passive metric prediction?