Scaling-laws for Large Time-series Models

Source

Core Claim

Large decoder-only time-series Transformers exhibit LLM-like power-law scaling with parameter count, dataset size, and training compute. This is one of the key papers making time-series foundation models look like a justified scaling program rather than only a collection of benchmark tricks.

Key Contributions

  • Trains decoder-only forecasting Transformers across roughly five orders of magnitude in model size.
  • Builds a heterogeneous univariate time-series corpus with about 8 billion data points, 30,211,687 individual series, and 38 data sources.
  • Measures scaling behavior with MSE, CRPS, and log-likelihood rather than only a single point-forecast metric.
  • Finds architecture shape choices such as aspect ratio and number of heads are relatively weak compared with scale across broad ranges.
  • Uses a Student-t distribution head to handle heavy-tailed temporal observations more stably than a simple Gaussian or MSE-only head.

Method Notes

The model family is a passive forecasting model. It predicts future numeric observations from historical observations and does not expose actions, control inputs, interventions, or counterfactual rollout channels.

The paper focuses on univariate time series. It explicitly leaves multivariate scaling laws, exogenous variables, and richer distribution heads for future work. That boundary matters: the result supports TSFM scale-up, but not yet native high-dimensional multivariate world models.

Evidence And Results

The strongest durable evidence is the power-law fit across parameters, data, and compute on in-sequence next-step test losses. The paper also shows that data scaling only becomes clear when dataset diversity is preserved while scaling the amount of data.

The paper’s argument is not “bigger always wins on every leaderboard.” It is narrower and more important: time-series forecasting appears to have predictable neural scaling behavior under broad, heterogeneous pretraining.

Alex Notes

Limitations

  • Univariate-only study.
  • Largest models are around 100M parameters, so extrapolation to billion-scale TSFMs is still an extrapolation.
  • Forecasting is evaluated mainly through next-step or in-sequence loss; long-horizon rollout scaling is not the central experiment.
  • Does not address text context, action conditioning, causal interventions, or native high-dimensional channel structure.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Scaling substratepartially closesThe raw paper shows power-law behavior across parameters, data, and compute on a heterogeneous 8B-point, 38-source corpus.Largest models are about 100M parameters; billion-scale and long-rollout scaling remain extrapolations.
Data diversity and curriculumpartially closesData scaling only appears clearly when source diversity is preserved while adding more data.Does not specify a curriculum for rare events, controls, or observability workloads.
Native multivariate encoding and high-channel scalinginsufficient evidenceThe study is univariate passive forecasting.Needs multivariate and high-channel scaling-law experiments.
Causal structure, counterfactuals, and controlinsufficient evidenceThe study does not include exogenous variables, actions, control inputs, or interventions.Needs action-conditioned scaling-law experiments.

Open Questions

  • Do the same scaling exponents hold for multivariate time series with channel coupling?
  • How do scaling laws change when the model supports known future exogenous variables, actions, or interventions?
  • Can compact models such as Tiny Time Mixers keep their advantage once compared under explicit data/model/compute scaling curves?