Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

Source

Raw Markdown: paper_time-moe-2024.md
PDF: paper_time-moe-2024.pdf
Preprint: arXiv 2409.16040
Official code: Time-MoE/Time-MoE
Official checkpoint: Maple728/TimeMoE-50M
Official checkpoint: Maple728/TimeMoE-200M

Core Claim

Time-MoE argues that sparse mixture-of-experts scaling can make time-series forecasting foundation models larger and more capable without forcing inference cost to grow with total parameter count.

Key Contributions

Introduces a decoder-only autoregressive time-series forecasting foundation model with sparse temporal mixture-of-experts layers, causal attention, point-wise tokenization, and multi-resolution forecasting heads.
Builds Time-300B, a cleaned large-scale pretraining corpus spanning more than nine domains and about 309B time points.
Scales the model family up to a 2.4B total-parameter model with 1.1B activated parameters, while also training smaller 50M and 200M activated-parameter variants for lower-cost inference.
Reports that Time-MoE improves average forecasting errors over existing time-series foundation models in zero-shot settings and over strong full-shot forecasting baselines after one epoch of downstream fine-tuning.

Benchmarked Models

Model	Role In Paper	Notes	Official Artifact
Time-MoE-50M	Base benchmarked Time-MoE model	12 layers, 12 heads, 8 experts, top-2 routing, 384 hidden size, 50M activated parameters, and 113M total parameters.	Maple728/TimeMoE-50M
Time-MoE-200M	Large benchmarked Time-MoE model	12 layers, 12 heads, 8 experts, top-2 routing, 768 hidden size, 200M activated parameters, and 453M total parameters.	Maple728/TimeMoE-200M
Time-MoE-Ultra	Largest benchmarked Time-MoE model	36 layers, 16 heads, 8 experts, top-2 routing, 1024 hidden size, 1.1B activated parameters, and 2.4B total parameters.	Not listed in the requested official checkpoints.

Method Notes

Time-MoE is a passive forecasting model rather than an action-conditioned world model: it consumes observed time-series histories and predicts future numeric values, without explicit action, control input, or intervention channels. The paper handles multivariate time series through channel independence, turning each channel into a univariate sequence, so cross-channel dynamics are not the primary modeling target.

The architecture replaces dense feed-forward layers with sparse mixture-of-experts layers that include isolated experts plus a shared expert. Each time point is routed to a small top-k subset of experts, and an auxiliary balancing loss is used to reduce expert routing collapse.

Multi-resolution forecasting heads predict horizons of 1, 8, 32, and 64 time steps during training. At inference, a greedy scheduling procedure combines these heads autoregressively so the model can forecast flexible horizons rather than only one fixed output length.

Evidence And Results

Zero-shot forecasting is evaluated on ETTh1, ETTh2, ETTm1, ETTm2, Weather, and Global Temp with horizons 96, 192, 336, and 720; the paper reports more than 20% average MSE reduction over the most competitive zero-shot baselines.
In-distribution forecasting fine-tunes pretrained Time-MoE models for one epoch on the same benchmark family; the paper reports 24% average MSE reduction over recent full-shot forecasting baselines.
Sparse-vs-dense scaling experiments report that Time-MoE reduces training cost by 78% and inference cost by 39% relative to dense variants with comparable activated-parameter budgets.
Ablations report worse average MSE when removing mixture-of-experts layers, multi-resolution heads, Huber loss, or the auxiliary routing-balance loss.

Limitations

The paper is focused on forecasting accuracy and scaling behavior, not time-series reasoning, causal discovery, action-conditioned simulation, or intervention planning.
Channel-independent handling of multivariate time series improves universality but may miss important cross-channel structure in domains where coupled dynamics matter.
The largest model is not among the requested public Hugging Face checkpoints, so the released public artifacts most directly support the 50M and 200M variants.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Dynamic compute allocation	partially closes	Sparse MoE routes each time point through a small subset of experts while scaling total parameters to 2.4B.	Routing is per-token forecasting compute, not adaptive planning over spans, channels, or futures.
Point-wise numeric tokenization	partially closes	Uses point-wise tokenization and multi-resolution forecasting heads for flexible horizons.	No adaptive patching or explicit units, missingness, or context interface.
Native multivariate encoding	warning	Channel independence lets the model handle arbitrary variates but discards cross-channel interactions.	Needs native multivariate coupling for high-channel systems.
Benchmarks: forecasting level	partially closes	Reports zero-shot and in-distribution forecasting gains across standard benchmarks.	Does not test reasoning, interventions, or control utility.

Links Into The Wiki

Open Questions

How much of Time-MoE’s gain comes from sparse expert routing versus Time-300B data scale and cleaning?
Would explicit multivariate coupling improve results on domains where channel interactions are central?
Can sparse forecasting models like Time-MoE become useful backbones for reasoning-oriented systems such as TimeOmni-1?

Alex Open Research Wiki

Explorer

Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

Source

Core Claim

Key Contributions

Benchmarked Models

Method Notes

Evidence And Results

Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks