MoDA
Summary
MoDA, or Mixture-of-Depths Attention, is a Transformer attention operator that lets each head jointly attend to current sequence key/value pairs and depth key/value memories from previous layers.
Role In The Wiki
MoDA is the current source-backed object for content-based inter-layer retrieval. It is relevant to intermediate-layer representations, depth scaling, dynamic compute, and the open question of whether layer aggregation should be fixed, learned, sparse, or attention-based.
Official Artifacts
- Paper: https://arxiv.org/abs/2603.15619
- Blog: https://lh-zhu.github.io/The-Second-Half-of-Model-Architecture/
- Code: https://github.com/hustvl/MoDA
- Official X thread: https://x.com/lianghui_zhu/status/2045868757869080695
Evidence
Relation To Foundation TSFM Agenda
Use the source-level agenda mapping in moda-2026. At the entity level, MoDA should stay as the object card for the method, code, blog, and thread.
The core transfer hypothesis is narrow: MoDA may help future time-series or world-model systems retrieve useful intermediate state across depth, but the current evidence is language-model architecture evidence, not numeric time-series or control evidence.