Mixture Of Experts

Summary

MoE appears in this corpus as a tool for scaling and compute allocation, not only as a parameter-count trick.

What The Wiki Currently Believes

Beyond Language Modeling finds MoE useful for multimodal scaling and modality specialization.
ConceptMoE uses MoE to isolate the benefits of concept-level processing under matched FLOPs and total parameters.
Time-MoE brings sparse expert routing to billion-scale time-series forecasting.
Moirai-MoE tests MoE routing inside the Moirai forecasting family.
MIRA uses frequency-specific sparse expert routing for irregular medical time series.
Diff-MN uses MoE-NCDE dynamics for irregular continuous generation, then uses diffusion to parameterize sample-specific MoE weights.
Sparse Layers are Critical to Scaling Looped Language Models adds a looped-depth variant: sparse experts can activate differently across repeated passes through shared layers, partially restoring expressivity that dense looped models lose.
The Thinking Pixel adds a diffusion-generation variant: sparse LoRA-adapter experts are routed differently across latent recursion steps while refining continuous visual tokens.

Evidence

The sources treat MoE as a way to separate or reallocate computation where uniform processing is wasteful. In the time-series cases, the open question is whether routing should specialize by frequency, horizon, dataset family, covariate structure, temporal dynamics, or task. MIRA and Diff-MN make that more concrete: one routes clinical representations by temporal frequency, while the other uses expert dynamics for continuous generation under irregular observations. Thinking Pixel adds the visual-latent version: route sparse adapters by visual token state, diffusion timestep, and conditioning information, but its evidence is image generation rather than numeric time series.

Relation To Foundation TSFM Agenda

MoE maps to the dynamic-compute slot in the Foundation Time-Series Model Research Agenda. Current time-series sources show sparse capacity allocation, but the agenda-relevant test is sharper: routing should allocate compute to spans, channels, regimes, context, and candidate futures that matter for latent-state maintenance, not only lower average forecasting loss.

Open Questions

Can MoE routing align naturally with modality boundaries, concept boundaries, and task difficulty at the same time?
How should MoE interact with learned token compression?
Can expert diversity across loop iterations make recurrent depth competitive under the same expected FLOPs and memory budget?
Can sparse experts over latent recursion steps improve continuous generative models without hiding latency or memory costs?
What should expert specialization mean for multivariate time-series models?

Alex Open Research Wiki

Explorer

Mixture Of Experts

Mixture Of Experts

Summary

What The Wiki Currently Believes

Evidence

Relation To Foundation TSFM Agenda

Open Questions

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

Mixture Of Experts

Mixture Of Experts

Summary

What The Wiki Currently Believes

Evidence

Relation To Foundation TSFM Agenda

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks