Extra-Long Context For Time Series
Summary
Extra-long context for time series is not just a larger attention window. For this wiki, the target is a model or system that can use very long histories of numeric observations, event streams, context, topology, and action or control-input history while still meeting a serving budget.
The current toolbox is plural. MiniMax Sparse Attention is one important tool: it keeps exact attention over a learned subset of long context instead of compressing selected tokens. It should sit next to learned context compression, compact recurrent state, segment memory, adaptive patching, channel hierarchy, KV-cache compression, and serving emulation.
A mechanism only counts as useful for time-series extra-long context if it answers two questions at once:
- What state is retained? Raw patches, selected blocks, compressed latent tokens, recurrent hidden state, associative memory, channel hierarchy, or external summaries.
- What must not be lost? Rare regimes, dense numeric detail, channel-specific deviations, event timing, exogenous variables, topology, and action or intervention history.
Toolbox Map
mindmap root((extra-long TS context)) Raw and patched windows TimesFM Toto TiRex Sparse reads MiniMax Sparse Attention LT2 sparse/hybrid mixers Learned compression LCLM CAT ConceptMoE/H-Net analogies Compressed vector state TurboQuant KV/retrieval memory Compact recurrent state RWKV-TS Mamba/Gated DeltaNet FlowState ParaRNN/SMT Dragon Hatchling Segment and test-time memory RMT/ARMT/RATE Titans/ATLAS/MIRAS LLM Sleep Time/channel hierarchy ReinPatch U-Cast Serving evaluation LLM-Emu GPU inference benchmarks
Mechanism Families
| Family | What It Keeps Cheap | Local Anchors | Candidate TSFM Use | Main Preservation Risk |
|---|---|---|---|---|
| Raw or patched finite windows | Use a large but explicit observation window; reduce token count through patches or multi-patch decoding. | TimesFM, Toto, Toto 2.0, TiRex | Strong baseline for forecasting and long-horizon rollouts before adding more exotic memory. | Still finite-window and mostly passive; long context is not always-on state maintenance. |
| Content-dependent sparse attention | Attend exactly to selected long-range blocks instead of all blocks. | MiniMax Sparse Attention, LT2 sparse/hybrid looped mixers | Read long histories when only a small subset matters, with lower prefill/decode attention cost. | Selection failure: unselected rare events, delayed exogenous variables, or action history are invisible to that layer. |
| Learned context compression before or during decoding | Replace older or longer raw context with learned latent tokens or compressed chunk state. | Latent Context Language Models, Compress & Attend Transformers, H-Net, ConceptMoE | Keep recent observations high resolution while making older history compact. | Compression can erase dense numeric detail or retrieval-critical small events unless expansion or reconstruction probes catch it. |
| KV-cache or retrieval-memory compression | Reduce stored vector-state bytes while preserving inner-product or retrieval geometry. | TurboQuant | Compress cached attention state, latent-state stores, or retrieval memories under memory pressure. | Storage wins may vanish after dequantization, hardware-native FP8 baselines, latency, throughput, and rare-state probes are counted. |
| Compact recurrent state and linear-time mixers | Update a fixed-size or bounded hidden state instead of retaining all prior tokens. | RWKV-TS, Mamba-3, Gated DeltaNet-2, FlowState, TiRex | Always-on updates for long streams where replaying history is impossible. | Fixed state can overwrite rare regimes or stale-but-action-relevant associations. |
| Nonlinear recurrent-state training | Keep nonlinear latent dynamics but make training or pretraining more parallel. | ParaRNN, Pretraining Recurrent Networks without Recurrence | Candidate route when time-series dynamics need nonlinear state transitions rather than linear SSM-style recurrence. | Current evidence is language/synthetic; solver convergence, teacher ceiling, rollout drift, and action channels remain open. |
| Segment memory and associative memory | Carry an explicit memory block or associative key/value memory across chunks. | RMT, ARMT, RATE | Retain state across long observation segments or trajectories without full attention over all prior samples. | Sequential segment processing, memory capacity, rewrite semantics, and BPTT depth become first-order costs. |
| Test-time memory and consolidation | Update a memory system at inference or spend compute before context eviction. | Titans, ATLAS, Language Models Need Sleep, Dragon Hatchling | Convert recent high-resolution windows into persistent state before raw samples roll off. | Update cost, memory objective, and consolidation schedule may dominate the serving budget. |
| Adaptive temporal or channel hierarchy | Reduce the active sequence or channel set through learned patches, latent channel queries, or hierarchical abstractions. | ReinPatch, U-Cast | Allocate resolution to high-information spans and compress redundant channel/time regions. | Learned boundaries or channel bottlenecks may erase small local deviations or cross-channel anomalies. |
| Serving-native evaluation | Run or emulate enough of the actual serving stack to test latency, memory, queueing, and cache behavior. | LLM-Emu, GPU Inference Optimization | Test whether a long-context mechanism still wins under batching, prefix caching, bursty load, and output-length variation. | Architecture-only wins can disappear when kernels, cache layout, queue state, and workload distribution are included. |
Reading Rules
- Treat extra-long context as a state-retention and serving problem, not only as a maximum context-length number.
- Separate raw access from compressed state. MSA-style sparse attention keeps selected tokens exact; LCLM/CAT-style methods replace parts of history with learned state; recurrent methods do not keep raw history directly attendable.
- Separate temporal length from channel count. A million time steps with few channels and a thousand channels over a shorter window stress different mechanisms.
- Separate passive histories from action-conditioned histories. A deployment, rollback, treatment, recommendation, autoscaling command, or remediation is an action, control input, or intervention only when it is logged with timing and outcome semantics.
- Require preservation probes for rare regimes and low-salience signals before accepting any compression or sparse-selection win.
- Require whole-serving measurements before accepting latency or memory claims: prefill, decode, update cost, KV/cache footprint, recurrent-state size, dequantization, batching, and burst behavior can trade off differently.
How To Use This Toolbox
A practical extra-long-context TSFM design will probably combine several branches rather than choose one:
- Keep a recent high-resolution window for local numeric fidelity.
- Use content-dependent sparse reads for exact access to selected older blocks when selection confidence is high.
- Convert older windows into compressed latent state or segment memory when raw attention becomes too expensive.
- Maintain a compact recurrent state for continuous updates between larger refresh or consolidation steps.
- Use adaptive patching and channel hierarchy so the model spends tokens on meaningful changes, not every equally sampled value.
- Compress cached vectors only when the full serving path still improves after hardware-native baselines and dequantization costs.
- Evaluate the full stack with held-out long histories, rare events, exogenous variables, action histories, and realistic request/workload traces.
Relation To Foundation TSFM Agenda
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Streaming state and long context | partially closes | The existing KB now has several mechanism families for long histories: finite windows, sparse selection, learned compression, compact recurrence, memory tokens, and consolidation. | Need direct benchmarks that combine long numeric histories, event streams, high channel count, and online updates. |
| Dynamic compute and serving | adjacent | MSA, LT2, LCLM, CAT, TurboQuant, LLM Sleep, and LLM-Emu make compute, memory, cache, and latency first-class. | Need TSFM serving studies with real update loops, batching, memory pressure, and workload traces. |
| Native multivariate encoding | partially closes | Toto and U-Cast show high-cardinality multivariate pressure; hierarchy and sparse selection suggest possible channel/group analogues. | Need mechanisms that preserve channel-specific deviations, topology, and cross-channel causal structure under compression. |
| Event streams and context interface | adjacent | LCLM/CAT/segment-memory mechanisms are plausible for logs, tickets, traces, and long context. | Need typed event schemas and context/action histories rather than untyped text or passive samples. |
| Control and counterfactuals | insufficient evidence | RATE is an action-trajectory memory analogue, and several sources mention actions as future work. | Need action-conditioned time-series rollouts with interventions, failed actions, exogenous variables, and decision utility. |
| Benchmark validity | warning | Current evidence mixes time-series forecasting, language long-context retrieval, GPU serving, and synthetic memory tasks. | Need benchmark hygiene that reports context length, channel count, state size, update cost, selection/compression recall, and serving latency together. |
Open Questions
- What is the right retained-history unit for time series: samples, patches, channel-time cells, events, topology neighborhoods, action spans, or learned latent chunks?
- Which information must remain exactly readable, and which information can be compressed into latent state?
- Can MSA-style sparse block selection, LCLM/CAT-style learned compression, recurrent state, and TurboQuant-style vector compression be composed without compounding preservation failures?
- How should sparse-selection recall be measured for rare regimes, delayed exogenous variables, and low-salience action history?
- Should compression be optimized for reconstruction, next-observation prediction, anomaly sensitivity, retrieval, control value, or downstream decision utility?
- Can a model learn when to consolidate or sleep before eviction instead of using fixed window boundaries?
- Which mechanisms still win after full serving constraints are counted: wall-clock latency, memory bandwidth, KV-cache layout, recurrent-state updates, batching, and bursty workload traces?
- What minimal public benchmark would compare dense attention, sparse attention, learned compression, recurrent state, and memory tokens under the same context length, channel count, action history, and serving budget?
Related Pages
- Foundation Time-Series Model Research Agenda
- Streaming Latent-State Updates
- Time-Series Scaling And Efficiency
- Efficient Recurrent Sequence Models
- Looped Transformers And Test-Time Memory
- Latent Tokenization
- High-Dimensional Time Series Forecasting
- GPU Inference Optimization
- Time-Series Benchmark Hygiene