MiniMax Sparse Attention

Summary

MiniMax Sparse Attention (MSA) is a learned block-sparse softmax attention mechanism built on Grouped-Query Attention. A lightweight Index Branch selects Top- key-value blocks independently for each GQA group, and the Main Branch computes exact softmax attention over the selected blocks.

Role In The Wiki

MSA is the current long-context sparse-attention anchor for this KB. Its value is the combined algorithm-plus-kernel claim: content-dependent block selection can keep much of dense GQA quality while making 1M-context prefill and decode materially cheaper.

For time-series and world-model work, MSA is an architecture analogy, not direct TSFM evidence. It suggests a way to route compute to relevant history spans, but the unselected spans are invisible to the layer. Any transfer to telemetry, graph time series, or action-conditioned streams MUST include preservation probes for rare regimes, event timing, exogenous variables, and action history.

Official Artifacts

Evidence