Efficient Recurrent Sequence Models
Summary
Efficient recurrent sequence models try to recover the serving advantages of compact latent state while avoiding the historical training bottleneck of sequential RNN unrolling. The RMT line carries explicit memory tokens across Transformer segments; the Mamba line keeps recurrence linear in hidden state and wins parallelism through scans or structured matrix algorithms; ParaRNN keeps nonlinear recurrent cells and makes the hidden trajectory parallelizable through Newton linearization plus parallel reduction.
Linear Recurrent State Path
Mamba introduces selective SSMs: input-dependent state-space parameters let the model decide what to remember or forget, while the hidden-state recurrence remains linear enough to compute with a hardware-aware parallel scan. It is the core source for the idea that compact recurrent state can compete with attention on token sequences when selectivity and kernel design are strong enough.
Mamba-2 reframes SSMs as semiseparable matrix mixers. The structured state space duality view turns the recurrent update into a matrix-algorithm problem and yields SSD, which improves training efficiency and allows larger state sizes.
Mamba-3 keeps the structured SSM family but adds more expressive state dynamics: exponential-trapezoidal discretization, complex-valued state transitions, and MIMO updates. Its relevance is that richer state tracking can be added while staying close to the efficient SSM inference contract.
Language Models Need Sleep adds a consolidation-time variant of the same compact-state question. Instead of asking only whether SSM fast weights can store evicted context, it asks whether the model has enough compute to transform that context into a useful fast-weight state before the KV cache is cleared. The paper’s sleep phase loops an SSM-attention hybrid over the current window before eviction, then keeps wake-time prediction to a single forward pass.
Dragon Hatchling adds a different fast-state architecture: a practical attention-based state-space sequence model with a large n x d recurrent state, sparse positive activations, and synapse-like update probes. It should be read as language/translation architecture evidence, not as numeric time-series evidence, but it is directly relevant to the question of whether a sequence model can carry a large mutable state during inference without relying only on a fixed attention window.
Nonlinear Recurrent State Path
ParaRNN is the key new source because it challenges the assumption that nonlinear recurrent hidden-state updates are inherently impractical at training time. It solves the all-time-step hidden-state trajectory as a nonlinear system; each Newton step becomes a linear recurrence over Jacobians and residuals, which is solved by parallel reduction.
The important distinction is that ParaRNN’s recurrent cell is genuinely nonlinear in the hidden state, while Mamba-family models preserve a linear hidden-state recurrence with input-dependent parameters. This makes ParaRNN especially relevant for domains where nonlinear latent dynamics are the modeling prior, including numeric time series and action-conditioned world models, but the paper itself evaluates token-sequence language modeling.
Memory-Token Transformer Path
Recurrent Memory Transformer is the root memory-token source for this branch. It wraps a Transformer segment with read/write memory tokens and passes the updated memory block to the next segment. The serving intuition is compact recurrent state; the training caveat is BPTT through segments, which becomes expensive and can be unstable as memory size and unroll depth grow.
Associative Recurrent Memory Transformer extends RMT with layerwise associative memory. It is useful here because it tests a different bottleneck: not just whether a small memory-token block can persist across segments, but whether the memory can store and rewrite many key/value associations over very long contexts. The evidence is long-context language and associative retrieval, so it remains architecture background for time series.
Recurrent Action Transformer with Memory adapts the RMT read/write memory contract to offline RL trajectories. It is the action-trajectory bridge in this topic: observations, actions, and rewards are present, and the Memory Retention Valve tries to prevent important sparse cues from leaking out of memory. It should be cited as policy/decision-model evidence, not as an explicit action-conditioned dynamics model.
RATE also sharpens that memory tokens, MRV filtering, and Transformer-XL-style cached hidden states are different resources. Cached hidden states can help some continuous-feedback or image-like settings, but in sparse T-Maze-style tasks they can interfere with retaining the rare cue.
Time-Series RWKV Bridge
RWKV-TS is the direct numeric time-series bridge in this topic. It adapts RWKV-style recurrence to passive time-series tasks with patching, time mixing, channel mixing, and a multi-head WKV operator. The WKV computation is attention-like in effect but can be written as recurrent state with linear sequence-length cost.
RWKV-TS should be treated as architecture evidence rather than as a pretrained TSFM checkpoint source: the paper trains models from scratch across forecasting, imputation, anomaly detection, classification, and few-shot settings.
Recurrent-Depth Transformer Branch
Universal Transformers and the newer looped-model line are adjacent rather than identical to compact recurrent-state models. They reuse Transformer computation across depth, so recurrence is mainly over representation refinement rather than a fixed-size hidden state that advances one token or time step at a time.
Huginn, Parcae, and The Recurrent Transformer make this branch more relevant to efficient sequence modeling because they discuss large-scale recurrent-depth pretraining, stable looped scaling, layerwise recurrent memory, and efficient decoding. Latent Thoughts, LoopFormer, and Sparse Looped LMs fill in the theory, elastic-depth interface, and sparse-capacity variant; they remain repeated-depth and dynamic-compute evidence rather than compact recurrent-state evidence. mHC and Hyperloop Transformers add a residual-stream-capacity variant: instead of only looping a block or retrieving depth KV, a looped model can use parallel residual streams to make repeated passes less representation-constrained. Efficient Parallel Samplers for Recurrent-Depth Models adds the inference question: extra latent-state refinement needs a sampler that preserves throughput.
DiffusionBlocks adds a training-side bridge for this branch: a Huginn-style recurrent-depth model can be trained with a single-pass denoising objective instead of recurrent-depth BPTT, but that does not turn looped depth into compact recurrent state or prove numeric time-series state preservation.
MesaNet sits at the boundary between recurrent sequence models and test-time optimization. It adds the locally optimal fast-weight branch, where the model spends dynamic conjugate-gradient compute to solve an in-context regression problem, but current evidence remains language/synthetic and should be read as serving-budget evidence rather than direct time-series evidence.
For this page, the distinction to preserve is compact recurrent state versus repeated depth. Compact recurrent state is primarily a serving-memory contract; looped depth is primarily a dynamic-compute contract. A time-series model may need both, but they should not be treated as the same mechanism.
Matrix-valued residual streams are a third contract: they add state capacity across depth or loop boundaries, but their cost lives in residual-stream memory traffic and specialized kernels rather than sequence-length recurrence alone.
Overlap With Test-Time Memory
Looped Transformers And Test-Time Memory owns the broader memory/dynamic-compute map. This page owns the efficient recurrent-state slice. The practical split is:
- cite RMT, ARMT, RATE here when the question is explicit memory-token state carried across segments or trajectories;
- cite Mamba, RWKV-TS, and ParaRNN here when the question is compact hidden recurrent state and parallel training or serving;
- cite Titans, ATLAS, MIRAS, and MesaNet from the looped/test-time-memory topic when the question is inference-time memory updates, retention objectives, or local optimization;
- compare all of them only under explicit state size, update cost, latency, BPTT depth, and benchmark hygiene.
Relevance For Time-Series And World Models
For this wiki, RMT, ARMT, RATE, Mamba, Mamba-2, Mamba-3, ParaRNN, and the recurrent-depth Transformer sources are architectural background rather than direct forecasting evidence. RWKV-TS is direct time-series evidence, but not broad pretrained TSFM evidence. Together they matter because many time-series and trajectory models need a compact latent state, long context, and efficient inference. RMT-style memory tokens expose an explicit state block; ARMT-style associative memory raises the capacity/overwrite question; RATE shows that action trajectories can benefit from memory retention under partial observability; Mamba-style selective state can be a strong passive dynamics backbone; RWKV-style WKV recurrence is a tested numeric time-series alternative; ParaRNN suggests that nonlinear latent-state dynamics might become practical at scale if the solver remains stable and structured; recurrent-depth Transformers suggest a separate route where extra compute refines state under a budget.
When writing about numeric time-series models, cite these sources as sequence-model primitives and then separately cite the concrete forecasting or trajectory paper for empirical claims. Do not treat language-model perplexity as evidence for multivariate time-series forecasting, event-stream modeling, or action-conditioned world modeling without a bridging experiment.
Relation To Foundation TSFM Agenda
This page is adjacent to the Foundation Time-Series Model Research Agenda through the streaming-state and long-context slots. Compact recurrent state is a plausible serving substrate for always-on time-series models, but the page should be treated as architecture background unless a concrete source evaluates numeric time-series state maintenance, multivariate telemetry, or action-conditioned trajectories.
Open Questions
- Which time-series or trajectory regimes actually need nonlinear hidden-state updates rather than selective linear recurrent state?
- When is explicit memory-token state preferable to an SSM/RWKV/Mamba state for telemetry or action trajectories?
- Can associative memory or MRV-style memory filtering preserve rare interventions and stale-state overwrites better than ordinary recurrent hidden state?
- When does RWKV-style time mixing outperform SSM or xLSTM-style recurrent state under the same time-series benchmark hygiene?
- Can ParaRNN-style solvers handle explicit actions, control inputs, or interventions while preserving sparse or block-diagonal Jacobian structure?
- Do Mamba-3 complex state transitions help periodic, rotational, or conservation-like dynamics in numeric time series?
- When should compact recurrent-state models spend extra consolidation compute before eviction rather than carrying a larger state or looping at prediction time?
- Can BDH-style sparse positive fast state preserve rare regimes and cross-channel relationships in numeric streams, or does it only give interpretable language concepts in current evidence?
- When is repeated Transformer depth a better serving tradeoff than a compact recurrent state with a larger hidden dimension?
- What is the right benchmark for comparing attention, SSMs, and nonlinear RNNs when serving cost, context length, channel count, and state tracking all matter?