Scaling-laws for Large Time-series Models

Source

Raw Markdown: paper_scaling-laws-large-time-series-models-2024.md
PDF: paper_scaling-laws-large-time-series-models-2024.pdf
Preprint: arXiv 2405.13867

Core Claim

Large decoder-only time-series Transformers exhibit LLM-like power-law scaling with parameter count, dataset size, and training compute. This is one of the key papers making time-series foundation models look like a justified scaling program rather than only a collection of benchmark tricks.

Key Contributions

Trains decoder-only forecasting Transformers across roughly five orders of magnitude in model size.
Builds a heterogeneous univariate time-series corpus with about 8 billion data points, 30,211,687 individual series, and 38 data sources.
Measures scaling behavior with MSE, CRPS, and log-likelihood rather than only a single point-forecast metric.
Finds architecture shape choices such as aspect ratio and number of heads are relatively weak compared with scale across broad ranges.
Uses a Student-t distribution head to handle heavy-tailed temporal observations more stably than a simple Gaussian or MSE-only head.

Method Notes

The model family is a passive forecasting model. It predicts future numeric observations from historical observations and does not expose actions, control inputs, interventions, or counterfactual rollout channels.

The paper focuses on univariate time series. It explicitly leaves multivariate scaling laws, exogenous variables, and richer distribution heads for future work. That boundary matters: the result supports TSFM scale-up, but not yet native high-dimensional multivariate world models.

Evidence And Results

The strongest durable evidence is the power-law fit across parameters, data, and compute on in-sequence next-step test losses. The paper also shows that data scaling only becomes clear when dataset diversity is preserved while scaling the amount of data.

The paper’s argument is not “bigger always wins on every leaderboard.” It is narrower and more important: time-series forecasting appears to have predictable neural scaling behavior under broad, heterogeneous pretraining.

Alex Notes

Alex flagged this with Scaling Law for Time Series Forecasting as evidence that TSFMs have a right to exist: they show a scaling-law substrate analogous to LLMs.

Limitations

Univariate-only study.
Largest models are around 100M parameters, so extrapolation to billion-scale TSFMs is still an extrapolation.
Forecasting is evaluated mainly through next-step or in-sequence loss; long-horizon rollout scaling is not the central experiment.
Does not address text context, action conditioning, causal interventions, or native high-dimensional channel structure.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Scaling substrate	partially closes	The raw paper shows power-law behavior across parameters, data, and compute on a heterogeneous 8B-point, 38-source corpus.	Largest models are about 100M parameters; billion-scale and long-rollout scaling remain extrapolations.
Data diversity and curriculum	partially closes	Data scaling only appears clearly when source diversity is preserved while adding more data.	Does not specify a curriculum for rare events, controls, or observability workloads.
Native multivariate encoding and high-channel scaling	insufficient evidence	The study is univariate passive forecasting.	Needs multivariate and high-channel scaling-law experiments.
Causal structure, counterfactuals, and control	insufficient evidence	The study does not include exogenous variables, actions, control inputs, or interventions.	Needs action-conditioned scaling-law experiments.

Links Into The Wiki

Open Questions

Do the same scaling exponents hold for multivariate time series with channel coupling?
How do scaling laws change when the model supports known future exogenous variables, actions, or interventions?
Can compact models such as Tiny Time Mixers keep their advantage once compared under explicit data/model/compute scaling curves?

Alex Open Research Wiki

Explorer

Scaling-laws for Large Time-series Models

Scaling-laws for Large Time-series Models

Source

Core Claim

Key Contributions

Method Notes

Evidence And Results

Alex Notes

Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks