Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

Source

Core Claim

Time-MoE argues that sparse mixture-of-experts scaling can make time-series forecasting foundation models larger and more capable without forcing inference cost to grow with total parameter count.

Key Contributions

  • Introduces a decoder-only autoregressive time-series forecasting foundation model with sparse temporal mixture-of-experts layers, causal attention, point-wise tokenization, and multi-resolution forecasting heads.
  • Builds Time-300B, a cleaned large-scale pretraining corpus spanning more than nine domains and about 309B time points.
  • Scales the model family up to a 2.4B total-parameter model with 1.1B activated parameters, while also training smaller 50M and 200M activated-parameter variants for lower-cost inference.
  • Reports that Time-MoE improves average forecasting errors over existing time-series foundation models in zero-shot settings and over strong full-shot forecasting baselines after one epoch of downstream fine-tuning.

Benchmarked Models

ModelRole In PaperNotesOfficial Artifact
Time-MoE-50MBase benchmarked Time-MoE model12 layers, 12 heads, 8 experts, top-2 routing, 384 hidden size, 50M activated parameters, and 113M total parameters.Maple728/TimeMoE-50M
Time-MoE-200MLarge benchmarked Time-MoE model12 layers, 12 heads, 8 experts, top-2 routing, 768 hidden size, 200M activated parameters, and 453M total parameters.Maple728/TimeMoE-200M
Time-MoE-UltraLargest benchmarked Time-MoE model36 layers, 16 heads, 8 experts, top-2 routing, 1024 hidden size, 1.1B activated parameters, and 2.4B total parameters.Not listed in the requested official checkpoints.

Method Notes

Time-MoE is a passive forecasting model rather than an action-conditioned world model: it consumes observed time-series histories and predicts future numeric values, without explicit action, control input, or intervention channels. The paper handles multivariate time series through channel independence, turning each channel into a univariate sequence, so cross-channel dynamics are not the primary modeling target.

The architecture replaces dense feed-forward layers with sparse mixture-of-experts layers that include isolated experts plus a shared expert. Each time point is routed to a small top-k subset of experts, and an auxiliary balancing loss is used to reduce expert routing collapse.

Multi-resolution forecasting heads predict horizons of 1, 8, 32, and 64 time steps during training. At inference, a greedy scheduling procedure combines these heads autoregressively so the model can forecast flexible horizons rather than only one fixed output length.

Evidence And Results

  • Zero-shot forecasting is evaluated on ETTh1, ETTh2, ETTm1, ETTm2, Weather, and Global Temp with horizons 96, 192, 336, and 720; the paper reports more than 20% average MSE reduction over the most competitive zero-shot baselines.
  • In-distribution forecasting fine-tunes pretrained Time-MoE models for one epoch on the same benchmark family; the paper reports 24% average MSE reduction over recent full-shot forecasting baselines.
  • Sparse-vs-dense scaling experiments report that Time-MoE reduces training cost by 78% and inference cost by 39% relative to dense variants with comparable activated-parameter budgets.
  • Ablations report worse average MSE when removing mixture-of-experts layers, multi-resolution heads, Huber loss, or the auxiliary routing-balance loss.

Limitations

  • The paper is focused on forecasting accuracy and scaling behavior, not time-series reasoning, causal discovery, action-conditioned simulation, or intervention planning.
  • Channel-independent handling of multivariate time series improves universality but may miss important cross-channel structure in domains where coupled dynamics matter.
  • The largest model is not among the requested public Hugging Face checkpoints, so the released public artifacts most directly support the 50M and 200M variants.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Dynamic compute allocationpartially closesSparse MoE routes each time point through a small subset of experts while scaling total parameters to 2.4B.Routing is per-token forecasting compute, not adaptive planning over spans, channels, or futures.
Point-wise numeric tokenizationpartially closesUses point-wise tokenization and multi-resolution forecasting heads for flexible horizons.No adaptive patching or explicit units, missingness, or context interface.
Native multivariate encodingwarningChannel independence lets the model handle arbitrary variates but discards cross-channel interactions.Needs native multivariate coupling for high-channel systems.
Benchmarks: forecasting levelpartially closesReports zero-shot and in-distribution forecasting gains across standard benchmarks.Does not test reasoning, interventions, or control utility.

Open Questions

  • How much of Time-MoE’s gain comes from sparse expert routing versus Time-300B data scale and cleaning?
  • Would explicit multivariate coupling improve results on domains where channel interactions are central?
  • Can sparse forecasting models like Time-MoE become useful backbones for reasoning-oriented systems such as TimeOmni-1?