Associative Recurrent Memory Transformer
Source
- Raw Markdown: paper_armt-2024.md
- PDF: paper_armt-2024.pdf
- Preprint: arXiv 2407.04841v1
- Latest arXiv version: arXiv 2407.04841
- Official code: RodkinIvan/associative-recurrent-memory-transformer
Version Note
Alex provided the explicit v1 arXiv URL, so the raw artifacts under papers/armt-2024/ preserve that version. arXiv currently lists a newer v2 from 2025-02-13; if the wiki later needs the latest revision, regenerate or add a version note rather than silently replacing this source.
Credibility
ARMT is a 2024 arXiv paper associated with the ICML 2024 Next Generation of Sequence Modeling Architectures Workshop. It is not a main-conference paper, but it comes from the RMT author lineage and has an official implementation.
Core Claim
ARMT extends RMT by adding layerwise associative memory. Instead of relying only on a fixed number of recurrent memory tokens, ARMT stores key/value associations derived from memory-token states and uses them to recall information across very long segmented contexts.
The wiki-relevant claim is memory capacity: RMT shows that memory tokens can carry state between segments; ARMT asks whether an associative matrix can make that carried state much more capacious while preserving local Transformer self-attention.
Key Contributions
- Adds associative memory blocks to an RMT-style segment-level recurrent Transformer.
- Uses layerwise association matrices updated from recurrent memory-token states.
- Evaluates memory capacity through associative retrieval remember/rewrite tasks.
- Reports strong BABILong results, including a single-fact QA result over 50M-token contexts.
- Compares against RMT and Mamba on long-context memory tasks.
Method Notes
At each segment and layer, ARMT updates an associative memory matrix from the previous segment’s memory tokens. In simplified form, memory tokens produce keys and values, and the association matrix stores a delta-style key/value update:
Tokens in the current segment can then query the layer’s associative memory:
This makes ARMT a hybrid of local full attention and recurrent associative memory. It is relevant to time-series work as an architectural memory primitive, not as direct evidence for numeric telemetry or action-conditioned dynamics.
Evidence And Results
The paper’s most important result is on long-context memory retention. It reports that ARMT, trained on 16k-token contexts, performs strongly up to 50M tokens on BABILong QA1 and up to 10M tokens on more complex BABILong tasks.
On associative retrieval, ARMT is used to test both remembering unique key/value pairs and rewriting values for repeated keys. The rewrite task matters because static memory is not enough for evolving systems: useful state needs overwrite behavior, not just accumulation.
Limitations
- Segment processing is sequential, and the paper notes that ARMT lacks an efficient parallel implementation.
- The paper reports that language-modeling training remains difficult; ARMT can tend to keep only the last segment in memory.
- Evidence is long-context language and synthetic associative retrieval, not multivariate time series, graph telemetry, or action-conditioned control.
- The strongest 50M-token result is narrow: single-fact QA, with broader multi-hop tasks evaluated at shorter effective lengths.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Streaming state, long context, and constant updates | adjacent | Adds recurrent associative memory for very long segmented contexts. | Needs always-on numeric state updates, bounded-latency serving, and missing-channel tests. |
| Native multivariate encoding and high-channel scaling | adjacent | Associative memory could store key/value summaries rather than raw attention context. | No evidence on high-channel multivariate time series or graph time series. |
| Event streams and mutable state | adjacent | Rewrite-task evaluation directly probes overwriting old values with newer values. | Needs typed event streams, actions, interventions, and state-validity checks. |
| Dynamic compute and memory efficiency | adjacent | Keeps local self-attention while moving long-range information into recurrent associative memory. | Needs matched compute comparisons against modern SSM/RWKV/Mamba and retrieval baselines. |
Links Into The Wiki
- Associative Recurrent Memory Transformer
- Recurrent Memory Transformer
- Efficient Recurrent Sequence Models
- Looped Transformers And Test-Time Memory
- Time-Series Scaling And Efficiency
- Foundation Time-Series Model Research Agenda
- Mamba
Open Questions
- Can associative memory track mutable service or business state without accumulating stale keys?
- What telemetry key/query design would let ARMT-like memory distinguish entities, channels, events, and interventions?
- Does recurrent associative memory beat retrieval or compact SSM state under matched latency and memory budgets?