The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

Source

Core Claim

The Recurrent Transformer lets each layer attend to key-value pairs computed from its own activations, increasing effective temporal depth while preserving standard autoregressive decoding cost.

Relevance To This Wiki

It is an alternative to block-looping: recurrence is placed layerwise inside attention memory rather than by repeatedly applying the same full block.

Limitations

It is still upstream language-model architecture evidence and needs domain-specific testing for numeric time-series or trajectory state.

Foundation TSFM Relevance

Relevant to memory-latency tradeoffs because recurrence may reduce KV cache footprint or layer count at fixed parameter budget.

Open Questions

  • What matched-budget baseline should this source be compared against: unique-depth Transformer layers, recurrent state, explicit memory, or extra inference steps?
  • Which claims transfer from token-sequence reasoning to multivariate time-series state tracking, event streams, or action-conditioned world models?
  • How do the KV-cache and HBM-traffic gains compare against ordinary attention variants under real long-context serving workloads?