MoDA

Summary

MoDA, or Mixture-of-Depths Attention, is a Transformer attention operator that lets each head jointly attend to current sequence key/value pairs and depth key/value memories from previous layers.

Role In The Wiki

MoDA is the current source-backed object for content-based inter-layer retrieval. It is relevant to intermediate-layer representations, depth scaling, dynamic compute, and the open question of whether layer aggregation should be fixed, learned, sparse, or attention-based.

Official Artifacts

Paper: https://arxiv.org/abs/2603.15619
Blog: https://lh-zhu.github.io/The-Second-Half-of-Model-Architecture/
Code: https://github.com/hustvl/MoDA
Official X thread: https://x.com/lianghui_zhu/status/2045868757869080695

Evidence

Mixture-of-Depths Attention

Relation To Foundation TSFM Agenda

Use the source-level agenda mapping in moda-2026. At the entity level, MoDA should stay as the object card for the method, code, blog, and thread.

The core transfer hypothesis is narrow: MoDA may help future time-series or world-model systems retrieve useful intermediate state across depth, but the current evidence is language-model architecture evidence, not numeric time-series or control evidence.

Alex Open Research Wiki

Explorer

MoDA

MoDA

Summary

Role In The Wiki

Official Artifacts

Evidence

Relation To Foundation TSFM Agenda

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

MoDA

MoDA

Summary

Role In The Wiki

Official Artifacts

Evidence

Relation To Foundation TSFM Agenda

Related Pages

Graph View

Table of Contents

Backlinks