Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Source

Core Claim

Mamba-2 builds a bridge between selective SSMs and attention through structured state space duality: SSM sequence transformations can be viewed as semiseparable matrix mixers, making it possible to borrow attention-style algorithms and systems ideas while retaining recurrent inference.

Key Contributions

  • Identifies structured SSM transformations with semiseparable matrices and uses that lens to connect recurrent, scan, and attention-like quadratic forms.
  • Introduces the SSD algorithm, a block-decomposed semiseparable matrix multiplication method that is more hardware friendly than Mamba’s selective scan.
  • Designs the Mamba-2 block with larger state sizes, more Transformer-like training-system compatibility, tensor parallelism, and sequence parallelism.
  • Shows Mamba-2 Pareto-dominating Mamba and Transformer++ in the paper’s perplexity and wall-clock scaling setup.

Evidence And Results

The paper reports a dedicated SSD implementation that is 2-8x faster than the optimized Mamba selective scan and can use much larger recurrent state sizes with limited slowdown. It also reports that Mamba-2 with 2.7B parameters trained on 300B Pile tokens outperforms Mamba-2.8B, Pythia-2.8B, and Pythia-6.9B in the paper’s downstream comparison.

Relevance To This Wiki

Mamba-2 is the main mathematical and systems bridge between attention and recurrent state-space models. For time-series and world-model pages, it supplies a clean vocabulary for talking about compact latent-state mixers as structured matrices rather than only as RNNs, convolutions, or attention approximations.

It is also the immediate background for ParaRNN: Mamba-2 keeps efficient parallel training by staying in a linear recurrent family, while ParaRNN asks whether nonlinear recurrent cells can be made parallel enough to compete at large scale.

Limitations

  • SSD trades some transition expressivity for a more hardware-friendly semiseparable structure.
  • The paper is mostly about token-sequence language modeling and retrieval-style synthetic tasks, not direct numeric time-series forecasting or action-conditioned dynamics.
  • The state-space duality framing does not automatically cover arbitrary nonlinear recurrent dynamics; ParaRNN occupies that next step.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Streaming state, long context, and constant updatespartially closesGives recurrent, scan, and structured-matrix views of SSMs, plus context/sequence parallelism for long sequences.Evidence is language/retrieval centered; no always-on time-series state-maintenance benchmark.
Time-series scaling and efficiencypartially closesSSD uses semiseparable structure and matmul-friendly block algorithms, reporting faster kernels and larger state sizes than earlier Mamba scans.Efficiency gains do not prove better latent dynamics, multivariate channel handling, or control utility.
Native multivariate encoding and high-channel scalinginsufficient evidenceMultihead SSM patterns provide sequence-mixer design vocabulary.The paper does not model channel metadata, topology, or thousands of numeric series.

Open Questions

  • Which semiseparable-matrix constraints are harmless for time-series passive dynamics, and which prevent useful state tracking?
  • Can structured state space duality guide efficient non-causal or bidirectional time-series encoders?
  • Where is the boundary between SSD-style recurrent state and attention-style memory for long-context event streams?