Recurrent Memory Transformer

Source

Raw Markdown: paper_rmt-2022.md
PDF: paper_rmt-2022.pdf
Preprint: arXiv 2207.06881
Official code: booydar/LM-RMT
Newer official implementation: booydar/recurrent-memory-transformer

Credibility

RMT is a NeurIPS 2022 paper. It is older than one year, but it is the root source for the RMT memory-token mechanism used by later long-context and memory-augmented decision models in this ingest batch.

Core Claim

Recurrent Memory Transformer adds learnable memory tokens to ordinary Transformer inputs and outputs, then passes the updated memory tokens between sequence segments. This gives the Transformer a compact recurrent state without changing the backbone attention block.

The important distinction from Transformer-XL is what gets carried forward: Transformer-XL caches many hidden states from prior segments, while RMT carries a small learned memory vector block that is processed together with each new segment and can be trained through BPTT across segments.

Key Contributions

Introduces a read/write memory-token layout for segment-level recurrent Transformers.
Keeps the underlying Transformer architecture intact; recurrence is implemented by changing the input/output sequence contract.
Compares RMT against Transformer and Transformer-XL on copy, reverse, associative retrieval, quadratic equations, WikiText-103, enwik8, and Hyperpartisan classification.
Shows that RMT can match Transformer-XL language-modeling quality with fewer carried memory vectors in several settings.
Demonstrates that RMT memory and Transformer-XL cache can be combined as long-term and short-term memory.

Method Notes

For causal decoder settings, RMT places memory tokens both before and after the segment. Prefix memory gives the segment read access to prior state; suffix memory can attend to the current segment and becomes the memory state for the next segment.

The segment update can be read as:

\tilde{H}_{τ}^{0} = [H_{τ}^{m e m}; H_{τ}^{0}; H_{τ}^{m e m}], [H_{τ}^{r e a d}; H_{τ}^{N}; H_{τ}^{w r i t e}] = Transformer (\tilde{H}_{τ}^{0}), H_{τ + 1}^{m e m} := H_{τ}^{w r i t e} .

This is not a full world model by itself. It is a memory interface for processing long sequences with local Transformer compute and compact state propagation.

Evidence And Results

The paper reports that RMT solves segmented copy, reverse, and quadratic-equation tasks where non-recurrent Transformer baselines fail and where Transformer-XL degrades as segment count increases.

On WikiText-103 and enwik8, the headline result is efficiency of carried state: RMT can reach Transformer-XL-like performance with fewer memory vectors, and a combined Transformer-XL + RMT configuration improves language modeling in the reported setup.

The Hyperpartisan classification experiment shows the practical wrapper interpretation: memory tokens can be added around pretrained BERT/RoBERTa/DeBERTa/T5-style models to extend effective text length.

Limitations

Segment processing is still sequential at inference time.
BPTT through multiple prior segments is memory-intensive and can be unstable at larger memory sizes.
Evidence is mostly language, synthetic algorithmic tasks, and document classification, not numeric time series, event streams, actions, or control inputs.
The paper does not evaluate a persistent serving contract for multivariate telemetry or action-conditioned rollouts.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Streaming state, long context, and constant updates	adjacent	Carries compact learned memory tokens between segments and reduces the need to attend over all prior tokens.	Needs numeric time-series, topology, event-stream, and serving-state evaluations.
Native multivariate encoding and high-channel scaling	insufficient evidence	Memory tokens could store compressed cross-segment information.	No high-channel numeric or graph time-series experiment.
Causal structure, counterfactuals, and control	insufficient evidence	Later RATE adapts the RMT interface to action trajectories.	RMT itself has no action, control input, reward, or counterfactual dynamics interface.
Dynamic compute and memory efficiency	adjacent	Shows a tradeoff between memory-token count, BPTT depth, and Transformer-XL cache size.	Needs matched FLOPs, latency, and serving-memory comparisons on modern baselines.

Links Into The Wiki

Open Questions

When should a time-series model use compact recurrent memory tokens rather than an SSM/RWKV/Mamba-style recurrent state?
Can memory-token BPTT be made stable enough for long numeric trajectories or action-conditioned rollouts?
Which telemetry state belongs in explicit memory tokens versus ordinary cached hidden state?

Alex Open Research Wiki

Explorer

Recurrent Memory Transformer

Recurrent Memory Transformer

Source

Credibility

Core Claim

Key Contributions

Method Notes

Evidence And Results

Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks