Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts

Source

Core Claim

Moirai-MoE argues that time-series foundation models should specialize by token-level learned routing instead of hand-assigned frequency buckets: sparse mixture-of-experts layers can let similar local time-series patterns share experts even when their sampling frequencies differ.

Key Contributions

  • Introduces Moirai-MoE, a sparse mixture-of-experts time-series foundation model built on the Moirai family.
  • Replaces Moirai’s multiple frequency-specific input and output projections with a single projection layer plus token-level expert routing inside Transformer blocks.
  • Uses a decoder-only forecasting objective so one training update can supervise multiple context lengths more efficiently than the earlier masked-encoder setup.
  • Proposes a token-cluster gating function, where expert routing is initialized from k-means centroids over pretrained Moirai token representations.
  • Evaluates on 39 datasets across in-distribution Monash forecasting and zero-shot forecasting settings.
  • Releases the implementation through Uni2TS and public small/base Moirai-MoE checkpoints.

Benchmarked Models

ModelRole In PaperNotesOfficial Artifact
Moirai-MoE-1.0-R-SmallSmall released benchmark model6 layers, 384 hidden size, 512 feed-forward size, 2 activated experts out of 32, about 11M activated parameters, and about 117M total parameters.Salesforce/moirai-moe-1.0-R-small
Moirai-MoE-1.0-R-BaseBase released benchmark model12 layers, 768 hidden size, 1024 feed-forward size, 2 activated experts out of 32, about 86M activated parameters, and about 935M total parameters.Salesforce/moirai-moe-1.0-R-base

Method Notes

Moirai-MoE is a passive dynamics model for forecasting: it predicts future numeric observations from observed histories and does not expose an explicit action, control input, intervention, or treatment channel. It is best read as a strong probabilistic time-series foundation model baseline, not as an action-conditioned world model.

The model keeps Moirai’s patch-based handling of multivariate time series, flattening variates into a causal Transformer sequence, but moves specialization into sparse experts. Each MoE layer routes a token to 2 of 32 experts, so activated compute stays close to dense Moirai at the same size class while total model capacity is much larger.

The paper’s main design bet is that frequency is a weak proxy for temporal structure. The authors show examples where different frequencies have similar patterns, same-frequency series have different patterns, and non-stationarity changes distribution within a short context window. Moirai-MoE therefore uses a single patch size and projection path, then lets routing adapt at the token level.

Evidence And Results

  • On 29 in-distribution Monash datasets, the paper reports that Moirai-MoE beats the compared Monash baselines, TimesFM, Chronos, and dense Moirai variants on aggregate normalized MAE.
  • The small Moirai-MoE model is reported as 17% better than dense Moirai small on the Monash aggregate, while also outperforming dense Moirai base and large in that setting.
  • On 10 zero-shot datasets outside LOTSA, Moirai-MoE base reports the best average CRPS and MASE among the compared foundation-model and full-shot baselines.
  • Against dense Moirai variants in zero-shot forecasting, Moirai-MoE small reports 3%-14% better CRPS and 8%-16% better MASE while using about 11M activated parameters.
  • Ablations indicate that switching from masked encoder to decoder-only training helps, but the larger gain comes from replacing frequency-level projections with MoE token specialization.
  • Expert analysis suggests that shallow layers use more diverse routing for local and periodic patterns, while deeper layers converge toward more frequency-invariant representations.

Limitations

  • The paper does not introduce an action-conditioned interface, so it does not directly address control, intervention planning, counterfactual simulation, or causal discovery.
  • Total parameter counts are much larger than activated parameter counts, especially for the base model, so memory and serving constraints still matter even when per-token compute is sparse.
  • The strongest comparison is within forecasting; broader time-series understanding, reasoning, classification, and generation tasks require separate evidence.
  • Some expert-routing analyses are based on visualization and aggregate behavior, so they should be treated as mechanistic hypotheses rather than complete explanations.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Dynamic compute allocationpartially closesSparse MoE layers route each token to 2 of 32 experts, increasing total capacity while keeping activated compute near dense Moirai.Serving memory and routing stability under distribution shift remain unresolved.
Patch and token specializationpartially closesToken-cluster gating replaces frequency-specific projections with learned pattern-level specialization.Uses one patch/projection path; does not learn fully adaptive temporal resolution.
Native multivariate encodingpartially closesVariates are flattened into a causal Transformer sequence so cross-variate attention can occur during forecasting.High-channel topology, context metadata, and control inputs are not first-class.
Causal and control modelinginsufficient evidenceThe model is a passive probabilistic forecaster with no explicit action, intervention, or treatment channel.Needs control-conditioned training and evaluation.

Open Questions

  • Can Moirai-MoE’s token-level specialization help with high-dimensional telemetry or observability event streams where frequency is less informative than regime, workload, or incident phase?
  • How stable are expert assignments under distribution shift, missingness, and rare interventions?
  • Would sparse expert routing become more useful for world-model-style systems if actions, control inputs, or interventions were encoded as future-conditioning variables?