Mixture Of Experts

Summary

MoE appears in this corpus as a tool for scaling and compute allocation, not only as a parameter-count trick.

What The Wiki Currently Believes

  • Beyond Language Modeling finds MoE useful for multimodal scaling and modality specialization.
  • ConceptMoE uses MoE to isolate the benefits of concept-level processing under matched FLOPs and total parameters.
  • Time-MoE brings sparse expert routing to billion-scale time-series forecasting.
  • Moirai-MoE tests MoE routing inside the Moirai forecasting family.
  • Sparse Layers are Critical to Scaling Looped Language Models adds a looped-depth variant: sparse experts can activate differently across repeated passes through shared layers, partially restoring expressivity that dense looped models lose.

Evidence

The sources treat MoE as a way to separate or reallocate computation where uniform processing is wasteful. In the time-series cases, the open question is whether routing should specialize by frequency, horizon, dataset family, covariate structure, or task.

Relation To Foundation TSFM Agenda

MoE maps to the dynamic-compute slot in the Foundation Time-Series Model Research Agenda. Current time-series sources show sparse capacity allocation, but the agenda-relevant test is sharper: routing should allocate compute to spans, channels, regimes, context, and candidate futures that matter for latent-state maintenance, not only lower average forecasting loss.

Open Questions

  • Can MoE routing align naturally with modality boundaries, concept boundaries, and task difficulty at the same time?
  • How should MoE interact with learned token compression?
  • Can expert diversity across loop iterations make recurrent depth competitive under the same expected FLOPs and memory budget?
  • What should expert specialization mean for multivariate time-series models?