The Recurrent Transformer: Greater Effective Depth and Efficient Decoding
Source
- Raw Markdown: paper_recurrent-transformer-2026.md
- PDF: paper_recurrent-transformer-2026.pdf
- Preprint: arXiv 2604.21215
- Official code: geniucos/recurrent-transformer
Core Claim
The Recurrent Transformer lets each layer attend to key-value pairs computed from its own activations, increasing effective temporal depth while preserving standard autoregressive decoding cost.
Relevance To This Wiki
It is an alternative to block-looping: recurrence is placed layerwise inside attention memory rather than by repeatedly applying the same full block.
Limitations
It is still upstream language-model architecture evidence and needs domain-specific testing for numeric time-series or trajectory state.
Foundation TSFM Relevance
Relevant to memory-latency tradeoffs because recurrence may reduce KV cache footprint or layer count at fixed parameter budget.
Links Into The Wiki
- Recurrent Transformer
- Looped Transformers And Test-Time Memory
- Efficient Recurrent Sequence Models
- Time-Series Scaling And Efficiency
- Foundation Time-Series Model Research Agenda
Open Questions
- What matched-budget baseline should this source be compared against: unique-depth Transformer layers, recurrent state, explicit memory, or extra inference steps?
- Which claims transfer from token-sequence reasoning to multivariate time-series state tracking, event streams, or action-conditioned world models?
- How do the KV-cache and HBM-traffic gains compare against ordinary attention variants under real long-context serving workloads?