Associative Recurrent Memory Transformer

Source

Raw Markdown: paper_armt-2024.md
PDF: paper_armt-2024.pdf
Preprint: arXiv 2407.04841v1
Latest arXiv version: arXiv 2407.04841
Official code: RodkinIvan/associative-recurrent-memory-transformer

Version Note

Alex provided the explicit v1 arXiv URL, so the raw artifacts under papers/armt-2024/ preserve that version. arXiv currently lists a newer v2 from 2025-02-13; if the wiki later needs the latest revision, regenerate or add a version note rather than silently replacing this source.

Credibility

ARMT is a 2024 arXiv paper associated with the ICML 2024 Next Generation of Sequence Modeling Architectures Workshop. It is not a main-conference paper, but it comes from the RMT author lineage and has an official implementation.

Core Claim

ARMT extends RMT by adding layerwise associative memory. Instead of relying only on a fixed number of recurrent memory tokens, ARMT stores key/value associations derived from memory-token states and uses them to recall information across very long segmented contexts.

The wiki-relevant claim is memory capacity: RMT shows that memory tokens can carry state between segments; ARMT asks whether an associative matrix can make that carried state much more capacious while preserving local Transformer self-attention.

Key Contributions

Adds associative memory blocks to an RMT-style segment-level recurrent Transformer.
Uses layerwise association matrices updated from recurrent memory-token states.
Evaluates memory capacity through associative retrieval remember/rewrite tasks.
Reports strong BABILong results, including a single-fact QA result over 50M-token contexts.
Compares against RMT and Mamba on long-context memory tasks.

Method Notes

At each segment and layer, ARMT updates an associative memory matrix from the previous segment’s memory tokens. In simplified form, memory tokens produce keys and values, and the association matrix stores a delta-style key/value update:

A_{i}^{l} = A_{i - 1}^{l} + β_{i} (v_{i} - \overline{v}_{i}) \otimes ϕ (k_{i}) .

Tokens in the current segment can then query the layer’s associative memory:

y_{j} = \frac{A _{s}^{l} ϕ ( q _{j} )}{( z _{s}^{l} ) ^{T} ϕ ( q _{j} )} .

This makes ARMT a hybrid of local full attention and recurrent associative memory. It is relevant to time-series work as an architectural memory primitive, not as direct evidence for numeric telemetry or action-conditioned dynamics.

Evidence And Results

The paper’s most important result is on long-context memory retention. It reports that ARMT, trained on 16k-token contexts, performs strongly up to 50M tokens on BABILong QA1 and up to 10M tokens on more complex BABILong tasks.

On associative retrieval, ARMT is used to test both remembering unique key/value pairs and rewriting values for repeated keys. The rewrite task matters because static memory is not enough for evolving systems: useful state needs overwrite behavior, not just accumulation.

Limitations

Segment processing is sequential, and the paper notes that ARMT lacks an efficient parallel implementation.
The paper reports that language-modeling training remains difficult; ARMT can tend to keep only the last segment in memory.
Evidence is long-context language and synthetic associative retrieval, not multivariate time series, graph telemetry, or action-conditioned control.
The strongest 50M-token result is narrow: single-fact QA, with broader multi-hop tasks evaluated at shorter effective lengths.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Streaming state, long context, and constant updates	adjacent	Adds recurrent associative memory for very long segmented contexts.	Needs always-on numeric state updates, bounded-latency serving, and missing-channel tests.
Native multivariate encoding and high-channel scaling	adjacent	Associative memory could store key/value summaries rather than raw attention context.	No evidence on high-channel multivariate time series or graph time series.
Event streams and mutable state	adjacent	Rewrite-task evaluation directly probes overwriting old values with newer values.	Needs typed event streams, actions, interventions, and state-validity checks.
Dynamic compute and memory efficiency	adjacent	Keeps local self-attention while moving long-range information into recurrent associative memory.	Needs matched compute comparisons against modern SSM/RWKV/Mamba and retrieval baselines.

Links Into The Wiki

Open Questions

Can associative memory track mutable service or business state without accumulating stale keys?
What telemetry key/query design would let ARMT-like memory distinguish entities, channels, events, and interventions?
Does recurrent associative memory beat retrieval or compact SSM state under matched latency and memory budgets?

Alex Open Research Wiki

Explorer

Associative Recurrent Memory Transformer

Associative Recurrent Memory Transformer

Source

Version Note

Credibility

Core Claim

Key Contributions

Method Notes

Evidence And Results

Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks