The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

Source

Raw Markdown: paper_recurrent-transformer-2026.md
PDF: paper_recurrent-transformer-2026.pdf
Preprint: arXiv 2604.21215
Official code: geniucos/recurrent-transformer

Core Claim

The Recurrent Transformer lets each layer attend to key-value pairs computed from its own activations, increasing effective temporal depth while preserving standard autoregressive decoding cost.

Relevance To This Wiki

It is an alternative to block-looping: recurrence is placed layerwise inside attention memory rather than by repeatedly applying the same full block.

Limitations

It is still upstream language-model architecture evidence and needs domain-specific testing for numeric time-series or trajectory state.

Foundation TSFM Relevance

Relevant to memory-latency tradeoffs because recurrence may reduce KV cache footprint or layer count at fixed parameter budget.

Links Into The Wiki

Open Questions

What matched-budget baseline should this source be compared against: unique-depth Transformer layers, recurrent state, explicit memory, or extra inference steps?
Which claims transfer from token-sequence reasoning to multivariate time-series state tracking, event streams, or action-conditioned world models?
How do the KV-cache and HBM-traffic gains compare against ordinary attention variants under real long-context serving workloads?

Alex Open Research Wiki

Explorer

The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

Source

Core Claim

Relevance To This Wiki

Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks