Sparse Layers are Critical to Scaling Looped Language Models

Source

Core Claim

This paper argues that dense looped models scale poorly, while Looped-MoE models recover expressivity through divergent expert routing across loops and enable strong early exits.

Relevance To This Wiki

It is the main sparse-capacity answer to the looped-depth bottleneck: repeated shared layers may need changing experts across passes to regain diversity.

Limitations

MoE routing adds serving complexity and memory pressure even when active compute is sparse. Claims need matched compute and routing-stability checks.

Foundation TSFM Relevance

Adjacent to dynamic compute and mixture-of-experts for time-series models, especially early exits and budgeted recurrent depth.

Open Questions

  • What matched-budget baseline should this source be compared against: unique-depth Transformer layers, recurrent state, explicit memory, or extra inference steps?
  • Which claims transfer from token-sequence reasoning to multivariate time-series state tracking, event streams, or action-conditioned world models?
  • Do loop-boundary early exits translate into end-to-end autoregressive throughput once routing and batching overheads are included?