Efficient Recurrent Sequence Models

Summary

Efficient recurrent sequence models try to recover the serving advantages of compact latent state while avoiding the historical training bottleneck of sequential RNN unrolling. The RMT line carries explicit memory tokens across Transformer segments; the Mamba line keeps recurrence linear in hidden state and wins parallelism through scans or structured matrix algorithms; ParaRNN keeps nonlinear recurrent cells and makes the hidden trajectory parallelizable through Newton linearization plus parallel reduction; SMT/DMT tries a different route by replacing recurrent credit propagation during pretraining with predictive-memory supervision.

Linear Recurrent State Path

Mamba introduces selective SSMs: input-dependent state-space parameters let the model decide what to remember or forget, while the hidden-state recurrence remains linear enough to compute with a hardware-aware parallel scan. It is the core source for the idea that compact recurrent state can compete with attention on token sequences when selectivity and kernel design are strong enough.

Mamba-2 reframes SSMs as semiseparable matrix mixers. The structured state space duality view turns the recurrent update into a matrix-algorithm problem and yields SSD, which improves training efficiency and allows larger state sizes.

Mamba-3 keeps the structured SSM family but adds more expressive state dynamics: exponential-trapezoidal discretization, complex-valued state transitions, and MIMO updates. Its relevance is that richer state tracking can be added while staying close to the efficient SSM inference contract.

Gated DeltaNet is the ICLR 2025 fast-weight / linear-attention predecessor to Gated DeltaNet-2. It combines scalar decay gating with the delta rule so a fixed-size associative state can both clear stale information and perform targeted key-value updates. Its importance here is historical and architectural: it is older than one year, but it remains a current backbone because it is used in Oryx, appears in HAM baselines, and motivates the later erase/write decoupling.

Gated DeltaNet-2 updates the linear-attention / fast-weight branch rather than the SSM branch. It keeps a fixed-size recurrent key-value state but separates the active memory edit into channel-wise key-side erase and value-side write gates. The useful time-series analogy is selective state editing: an always-on model may need to remove stale associations without committing every new observation with the same scalar strength. The evidence is still language modeling and retrieval, so this remains architecture background until numeric streams and action-conditioned trajectories are tested.

Oryx, Hybrid Associative Memories, and HOLA add hybrid exact-memory variants of the same recurrent-state question. Oryx switches between attention and linear recurrent mixers across the sequence while sharing key/value representations. HAM routes hard-to-predict tokens into a KV scratchpad with thresholded or learned selection. HOLA keeps a fixed top- $w$ exact cache using the GDN update magnitude $β ∥ e ∥$ and sharpens its cache read separately from the state path. Together they make mixer mode, cache growth, and fixed exact-memory budget explicit, but remain language-model evidence until a time-series benchmark tests numeric state preservation under matched serving budgets.

Comparing Transformers and Hybrid Models at the Token Level adds a diagnostic rather than a new mixer. In the closely matched Olmo 3 versus Olmo Hybrid comparison, recurrent/hybrid structure helps most on meaning-bearing and context-dependent predictions, while attention remains strongest on repeated $n$ -grams and closing delimiters. The useful lesson for this page is that recurrent/attention hybrids should be compared by capability slices—state-conditioned readout versus visible-prefix retrieval—not only aggregate language-model loss.

Language Models Need Sleep adds a consolidation-time variant of the same compact-state question. Instead of asking only whether SSM fast weights can store evicted context, it asks whether the model has enough compute to transform that context into a useful fast-weight state before the KV cache is cleared. The paper’s sleep phase loops an SSM-attention hybrid over the current window before eviction, then keeps wake-time prediction to a single forward pass.

Dragon Hatchling adds a different fast-state architecture: a practical attention-based state-space sequence model with a large n x d recurrent state, sparse positive activations, and synapse-like update probes. It should be read as language/translation architecture evidence, not as numeric time-series evidence, but it is directly relevant to the question of whether a sequence model can carry a large mutable state during inference without relying only on a fixed attention window.

Nonlinear Recurrent State Path

ParaRNN is the key new source because it challenges the assumption that nonlinear recurrent hidden-state updates are inherently impractical at training time. It solves the all-time-step hidden-state trajectory as a nonlinear system; each Newton step becomes a linear recurrence over Jacobians and residuals, which is solved by parallel reduction.

The important distinction is that ParaRNN’s recurrent cell is genuinely nonlinear in the hidden state, while Mamba-family models preserve a linear hidden-state recurrence with input-dependent parameters. This makes ParaRNN especially relevant for domains where nonlinear latent dynamics are the modeling prior, including numeric time series and action-conditioned world models, but the paper itself evaluates token-sequence language modeling.

Pretraining Recurrent Networks without Recurrence adds the SMT/DMT route. Instead of solving the actual nonlinear hidden trajectory in parallel, SMT trains a Transformer encoder-decoder to produce predictive memory states and then trains the RNN on one-step memory transitions. This removes BPTT during pretraining and gives an $O (1)$ credit path, but it introduces two caveats that matter for this topic: one-step memory imitation drifts under rollout unless DMT or another post-training stage corrects it, and a bounded-depth Transformer teacher may be an expressivity ceiling on tasks where nonlinear recurrence is exactly the needed computation class.

The useful comparison is therefore not “SMT replaces ParaRNN.” ParaRNN is currently the stronger large-scale nonlinear-RNN language-model evidence, while SMT/DMT is a candidate pretraining or memory-geometry initializer. A hybrid worth tracking is SMT for predictive-state initialization followed by ParaRNN/DEER-style parallel trajectory solving for end-to-end recurrent fine-tuning.

Memory-Token Transformer Path

Recurrent Memory Transformer is the root memory-token source for this branch. It wraps a Transformer segment with read/write memory tokens and passes the updated memory block to the next segment. The serving intuition is compact recurrent state; the training caveat is BPTT through segments, which becomes expensive and can be unstable as memory size and unroll depth grow.

Associative Recurrent Memory Transformer extends RMT with layerwise associative memory. It is useful here because it tests a different bottleneck: not just whether a small memory-token block can persist across segments, but whether the memory can store and rewrite many key/value associations over very long contexts. The evidence is long-context language and associative retrieval, so it remains architecture background for time series.

Recurrent Action Transformer with Memory adapts the RMT read/write memory contract to offline RL trajectories. It is the action-trajectory bridge in this topic: observations, actions, and rewards are present, and the Memory Retention Valve tries to prevent important sparse cues from leaking out of memory. It should be cited as policy/decision-model evidence, not as an explicit action-conditioned dynamics model.

RATE also sharpens that memory tokens, MRV filtering, and Transformer-XL-style cached hidden states are different resources. Cached hidden states can help some continuous-feedback or image-like settings, but in sparse T-Maze-style tasks they can interfere with retaining the rare cue.

Time-Series RWKV Bridge

RWKV-TS is the direct numeric time-series bridge in this topic. It adapts RWKV-style recurrence to passive time-series tasks with patching, time mixing, channel mixing, and a multi-head WKV operator. The WKV computation is attention-like in effect but can be written as recurrent state with linear sequence-length cost.

RWKV-TS should be treated as architecture evidence rather than as a pretrained TSFM checkpoint source: the paper trains models from scratch across forecasting, imputation, anomaly detection, classification, and few-shot settings.

Recurrent-Depth Transformer Branch

Universal Transformers and the newer looped-model line are adjacent rather than identical to compact recurrent-state models. They reuse Transformer computation across depth, so recurrence is mainly over representation refinement rather than a fixed-size hidden state that advances one token or time step at a time.

Huginn, Parcae, and The Recurrent Transformer make this branch more relevant to efficient sequence modeling because they discuss large-scale recurrent-depth pretraining, stable looped scaling, layerwise recurrent memory, and efficient decoding. Latent Thoughts, LoopFormer, and Sparse Looped LMs fill in the theory, elastic-depth interface, and sparse-capacity variant; they remain repeated-depth and dynamic-compute evidence rather than compact recurrent-state evidence. LT2 is the cost-model update for this branch: the repeated block can use linear-attention recurrent state, sparse attention, or a hybrid with a small number of full-attention layers, so loop count need not multiply full KV-cache cost. GRAM adds trained stochastic recursive reasoning, while PTRM adds training-free recurrent-noise width on a pretrained deterministic TRM. Together they show that repeated latent-state refinement can scale through parallel trajectories as well as depth, but PTRM’s Maze-Hard gap also shows that proposal coverage and selector quality can diverge. Both remain controlled puzzle evidence rather than numeric time-series or action-conditioned trajectory evidence. mHC and Hyperloop Transformers add a residual-stream-capacity variant: instead of only looping a block or retrieving depth KV, a looped model can use parallel residual streams to make repeated passes less representation-constrained. Efficient Parallel Samplers for Recurrent-Depth Models adds the inference question: extra latent-state refinement needs a sampler that preserves throughput.

DiffusionBlocks adds a training-side bridge for this branch: a Huginn-style recurrent-depth model can be trained with a single-pass denoising objective instead of recurrent-depth BPTT, but that does not turn looped depth into compact recurrent state or prove numeric time-series state preservation.

MesaNet sits at the boundary between recurrent sequence models and test-time optimization. It adds the locally optimal fast-weight branch, where the model spends dynamic conjugate-gradient compute to solve an in-context regression problem, but current evidence remains language/synthetic and should be read as serving-budget evidence rather than direct time-series evidence.

For this page, the distinction to preserve is compact recurrent state versus repeated depth. Compact recurrent state is primarily a serving-memory contract; looped depth is primarily a dynamic-compute contract. A time-series model may need both, but they should not be treated as the same mechanism.

Matrix-valued residual streams are a third contract: they add state capacity across depth or loop boundaries, but their cost lives in residual-stream memory traffic and specialized kernels rather than sequence-length recurrence alone.

Overlap With Test-Time Memory

Looped Transformers And Test-Time Memory owns the broader memory/dynamic-compute map. This page owns the efficient recurrent-state slice. The practical split is:

cite RMT, ARMT, RATE here when the question is explicit memory-token state carried across segments or trajectories;
cite Mamba, RWKV-TS, and ParaRNN here when the question is compact hidden recurrent state and parallel training or serving;
cite Titans, ATLAS, MIRAS, and MesaNet from the looped/test-time-memory topic when the question is inference-time memory updates, retention objectives, or local optimization;
compare all of them only under explicit state size, update cost, latency, BPTT depth, and benchmark hygiene.

For learned context-compression boundary cases such as Compress & Attend Transformers and Latent Context Language Models, route detailed comparisons through Looped Transformers And Test-Time Memory. CAT keeps decoder-attended compressed chunk states, and LCLM keeps prefill-time latent context, rather than a constant-size recurrent hidden state.

Relevance For Time-Series And World Models

For this wiki, RMT, ARMT, RATE, Mamba, Mamba-2, Mamba-3, Gated DeltaNet, Gated DeltaNet-2, Oryx, HAM, HOLA, ParaRNN, and the recurrent-depth Transformer sources are architectural background rather than direct forecasting evidence. RWKV-TS is direct time-series evidence, but not broad pretrained TSFM evidence. Together they matter because many time-series and trajectory models need a compact latent state, long context, and efficient inference. RMT-style memory tokens expose an explicit state block; ARMT-style associative memory raises the capacity/overwrite question; RATE shows that action trajectories can benefit from memory retention under partial observability; Mamba/Gated-Delta-style selective state can be a strong passive dynamics backbone; Oryx/HAM-style hybrid routing asks which spans deserve exact attention or explicit cache; RWKV-style WKV recurrence is a tested numeric time-series alternative; ParaRNN suggests that nonlinear latent-state dynamics might become practical at scale if the solver remains stable and structured; SMT/DMT suggests that predictive-state supervision could initialize recurrent memory without long BPTT paths; recurrent-depth Transformers suggest a separate route where extra compute refines state under a budget.

When writing about numeric time-series models, cite these sources as sequence-model primitives and then separately cite the concrete forecasting or trajectory paper for empirical claims. Do not treat language-model perplexity as evidence for multivariate time-series forecasting, event-stream modeling, or action-conditioned world modeling without a bridging experiment.

Relation To Foundation TSFM Agenda

This page is adjacent to the Foundation Time-Series Model Research Agenda through the streaming-state and long-context slots. Compact recurrent state is a plausible serving substrate for always-on time-series models, but the page should be treated as architecture background unless a concrete source evaluates numeric time-series state maintenance, multivariate telemetry, or action-conditioned trajectories.

Open Questions

Which time-series or trajectory regimes actually need nonlinear hidden-state updates rather than selective linear recurrent state?
When is explicit memory-token state preferable to an SSM/RWKV/Mamba state for telemetry or action trajectories?
Can associative memory or MRV-style memory filtering preserve rare interventions and stale-state overwrites better than ordinary recurrent hidden state?
When does RWKV-style time mixing outperform SSM or xLSTM-style recurrent state under the same time-series benchmark hygiene?
Can ParaRNN-style solvers handle explicit actions, control inputs, or interventions while preserving sparse or block-diagonal Jacobian structure?
Can SMT-style predictive-memory pretraining initialize a nonlinear recurrent model that is then finished with ParaRNN/DEER-style end-to-end trajectory solving?
Do Mamba-3 complex state transitions help periodic, rotational, or conservation-like dynamics in numeric time series?
Does Gated DeltaNet-2-style decoupled erase/write state editing preserve rare channel interactions better than scalar retention gates in multivariate streams?
What are the time-series equivalents of Olmo Hybrid’s token-level filters: state-conditioned non-copy targets, exact recent-value recall, repeated normal spans, and structural-constraint closure?
What benchmark tests open-world state tracking where entities, channels, variables, scopes, or topology nodes are introduced over time, updated relationally, and later queried under matched attention/recurrent serving budgets?
When should compact recurrent-state models spend extra consolidation compute before eviction rather than carrying a larger state or looping at prediction time?
Can BDH-style sparse positive fast state preserve rare regimes and cross-channel relationships in numeric streams, or does it only give interpretable language concepts in current evidence?
Can Oryx-style mixer switching, HAM-style selective cache growth, or HOLA-style fixed top- $w$ update-magnitude retention preserve rare time-series events better than a larger fixed recurrent state at the same latency and memory budget?
When is repeated Transformer depth a better serving tradeoff than a compact recurrent state with a larger hidden dimension?
What is the right benchmark for comparing attention, SSMs, and nonlinear RNNs when serving cost, context length, channel count, and state tracking all matter?

Alex Open Research Wiki

Explorer

Efficient Recurrent Sequence Models

Efficient Recurrent Sequence Models

Summary

Linear Recurrent State Path

Nonlinear Recurrent State Path

Memory-Token Transformer Path

Time-Series RWKV Bridge

Recurrent-Depth Transformer Branch

Overlap With Test-Time Memory

Relevance For Time-Series And World Models

Relation To Foundation TSFM Agenda

Open Questions

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

Efficient Recurrent Sequence Models

Efficient Recurrent Sequence Models

Summary

Linear Recurrent State Path

Nonlinear Recurrent State Path

Memory-Token Transformer Path

Time-Series RWKV Bridge

Recurrent-Depth Transformer Branch

Overlap With Test-Time Memory

Relevance For Time-Series And World Models

Relation To Foundation TSFM Agenda

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks