Recurrent Memory Transformer
Source
- Raw Markdown: paper_rmt-2022.md
- PDF: paper_rmt-2022.pdf
- Preprint: arXiv 2207.06881
- Official code: booydar/LM-RMT
- Newer official implementation: booydar/recurrent-memory-transformer
Credibility
RMT is a NeurIPS 2022 paper. It is older than one year, but it is the root source for the RMT memory-token mechanism used by later long-context and memory-augmented decision models in this ingest batch.
Core Claim
Recurrent Memory Transformer adds learnable memory tokens to ordinary Transformer inputs and outputs, then passes the updated memory tokens between sequence segments. This gives the Transformer a compact recurrent state without changing the backbone attention block.
The important distinction from Transformer-XL is what gets carried forward: Transformer-XL caches many hidden states from prior segments, while RMT carries a small learned memory vector block that is processed together with each new segment and can be trained through BPTT across segments.
Key Contributions
- Introduces a read/write memory-token layout for segment-level recurrent Transformers.
- Keeps the underlying Transformer architecture intact; recurrence is implemented by changing the input/output sequence contract.
- Compares RMT against Transformer and Transformer-XL on copy, reverse, associative retrieval, quadratic equations, WikiText-103, enwik8, and Hyperpartisan classification.
- Shows that RMT can match Transformer-XL language-modeling quality with fewer carried memory vectors in several settings.
- Demonstrates that RMT memory and Transformer-XL cache can be combined as long-term and short-term memory.
Method Notes
For causal decoder settings, RMT places memory tokens both before and after the segment. Prefix memory gives the segment read access to prior state; suffix memory can attend to the current segment and becomes the memory state for the next segment.
The segment update can be read as:
This is not a full world model by itself. It is a memory interface for processing long sequences with local Transformer compute and compact state propagation.
Evidence And Results
The paper reports that RMT solves segmented copy, reverse, and quadratic-equation tasks where non-recurrent Transformer baselines fail and where Transformer-XL degrades as segment count increases.
On WikiText-103 and enwik8, the headline result is efficiency of carried state: RMT can reach Transformer-XL-like performance with fewer memory vectors, and a combined Transformer-XL + RMT configuration improves language modeling in the reported setup.
The Hyperpartisan classification experiment shows the practical wrapper interpretation: memory tokens can be added around pretrained BERT/RoBERTa/DeBERTa/T5-style models to extend effective text length.
Limitations
- Segment processing is still sequential at inference time.
- BPTT through multiple prior segments is memory-intensive and can be unstable at larger memory sizes.
- Evidence is mostly language, synthetic algorithmic tasks, and document classification, not numeric time series, event streams, actions, or control inputs.
- The paper does not evaluate a persistent serving contract for multivariate telemetry or action-conditioned rollouts.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Streaming state, long context, and constant updates | adjacent | Carries compact learned memory tokens between segments and reduces the need to attend over all prior tokens. | Needs numeric time-series, topology, event-stream, and serving-state evaluations. |
| Native multivariate encoding and high-channel scaling | insufficient evidence | Memory tokens could store compressed cross-segment information. | No high-channel numeric or graph time-series experiment. |
| Causal structure, counterfactuals, and control | insufficient evidence | Later RATE adapts the RMT interface to action trajectories. | RMT itself has no action, control input, reward, or counterfactual dynamics interface. |
| Dynamic compute and memory efficiency | adjacent | Shows a tradeoff between memory-token count, BPTT depth, and Transformer-XL cache size. | Needs matched FLOPs, latency, and serving-memory comparisons on modern baselines. |
Links Into The Wiki
- Recurrent Memory Transformer
- Efficient Recurrent Sequence Models
- Looped Transformers And Test-Time Memory
- Time-Series Scaling And Efficiency
- Foundation Time-Series Model Research Agenda
- Recurrent Action Transformer with Memory
- Associative Recurrent Memory Transformer
Open Questions
- When should a time-series model use compact recurrent memory tokens rather than an SSM/RWKV/Mamba-style recurrent state?
- Can memory-token BPTT be made stable enough for long numeric trajectories or action-conditioned rollouts?
- Which telemetry state belongs in explicit memory tokens versus ordinary cached hidden state?