Extra-Long Context For Time Series

Summary

Extra-long context for time series is not just a larger attention window. For this wiki, the target is a model or system that can use very long histories of numeric observations, event streams, context, topology, and action or control-input history while still meeting a serving budget.

The current toolbox is plural. MiniMax Sparse Attention is one important tool: it keeps exact attention over a learned subset of long context instead of compressing selected tokens. It should sit next to learned context compression, compact recurrent state, segment memory, adaptive patching, channel hierarchy, KV-cache compression, and serving emulation.

A mechanism only counts as useful for time-series extra-long context if it answers two questions at once:

  1. What state is retained? Raw patches, selected blocks, compressed latent tokens, recurrent hidden state, associative memory, channel hierarchy, or external summaries.
  2. What must not be lost? Rare regimes, dense numeric detail, channel-specific deviations, event timing, exogenous variables, topology, and action or intervention history.

Toolbox Map

mindmap
  root((extra-long TS context))
    Raw and patched windows
      TimesFM
      Toto
      TiRex
    Sparse reads
      MiniMax Sparse Attention
      LT2 sparse/hybrid mixers
    Learned compression
      LCLM
      CAT
      ConceptMoE/H-Net analogies
    Compressed vector state
      TurboQuant
      KV/retrieval memory
    Compact recurrent state
      RWKV-TS
      Mamba/Gated DeltaNet
      FlowState
      ParaRNN/SMT
      Dragon Hatchling
    Segment and test-time memory
      RMT/ARMT/RATE
      Titans/ATLAS/MIRAS
      LLM Sleep
    Time/channel hierarchy
      ReinPatch
      U-Cast
    Serving evaluation
      LLM-Emu
      GPU inference benchmarks

Mechanism Families

FamilyWhat It Keeps CheapLocal AnchorsCandidate TSFM UseMain Preservation Risk
Raw or patched finite windowsUse a large but explicit observation window; reduce token count through patches or multi-patch decoding.TimesFM, Toto, Toto 2.0, TiRexStrong baseline for forecasting and long-horizon rollouts before adding more exotic memory.Still finite-window and mostly passive; long context is not always-on state maintenance.
Content-dependent sparse attentionAttend exactly to selected long-range blocks instead of all blocks.MiniMax Sparse Attention, LT2 sparse/hybrid looped mixersRead long histories when only a small subset matters, with lower prefill/decode attention cost.Selection failure: unselected rare events, delayed exogenous variables, or action history are invisible to that layer.
Learned context compression before or during decodingReplace older or longer raw context with learned latent tokens or compressed chunk state.Latent Context Language Models, Compress & Attend Transformers, H-Net, ConceptMoEKeep recent observations high resolution while making older history compact.Compression can erase dense numeric detail or retrieval-critical small events unless expansion or reconstruction probes catch it.
KV-cache or retrieval-memory compressionReduce stored vector-state bytes while preserving inner-product or retrieval geometry.TurboQuantCompress cached attention state, latent-state stores, or retrieval memories under memory pressure.Storage wins may vanish after dequantization, hardware-native FP8 baselines, latency, throughput, and rare-state probes are counted.
Compact recurrent state and linear-time mixersUpdate a fixed-size or bounded hidden state instead of retaining all prior tokens.RWKV-TS, Mamba-3, Gated DeltaNet-2, FlowState, TiRexAlways-on updates for long streams where replaying history is impossible.Fixed state can overwrite rare regimes or stale-but-action-relevant associations.
Nonlinear recurrent-state trainingKeep nonlinear latent dynamics but make training or pretraining more parallel.ParaRNN, Pretraining Recurrent Networks without RecurrenceCandidate route when time-series dynamics need nonlinear state transitions rather than linear SSM-style recurrence.Current evidence is language/synthetic; solver convergence, teacher ceiling, rollout drift, and action channels remain open.
Segment memory and associative memoryCarry an explicit memory block or associative key/value memory across chunks.RMT, ARMT, RATERetain state across long observation segments or trajectories without full attention over all prior samples.Sequential segment processing, memory capacity, rewrite semantics, and BPTT depth become first-order costs.
Test-time memory and consolidationUpdate a memory system at inference or spend compute before context eviction.Titans, ATLAS, Language Models Need Sleep, Dragon HatchlingConvert recent high-resolution windows into persistent state before raw samples roll off.Update cost, memory objective, and consolidation schedule may dominate the serving budget.
Adaptive temporal or channel hierarchyReduce the active sequence or channel set through learned patches, latent channel queries, or hierarchical abstractions.ReinPatch, U-CastAllocate resolution to high-information spans and compress redundant channel/time regions.Learned boundaries or channel bottlenecks may erase small local deviations or cross-channel anomalies.
Serving-native evaluationRun or emulate enough of the actual serving stack to test latency, memory, queueing, and cache behavior.LLM-Emu, GPU Inference OptimizationTest whether a long-context mechanism still wins under batching, prefix caching, bursty load, and output-length variation.Architecture-only wins can disappear when kernels, cache layout, queue state, and workload distribution are included.

Reading Rules

  • Treat extra-long context as a state-retention and serving problem, not only as a maximum context-length number.
  • Separate raw access from compressed state. MSA-style sparse attention keeps selected tokens exact; LCLM/CAT-style methods replace parts of history with learned state; recurrent methods do not keep raw history directly attendable.
  • Separate temporal length from channel count. A million time steps with few channels and a thousand channels over a shorter window stress different mechanisms.
  • Separate passive histories from action-conditioned histories. A deployment, rollback, treatment, recommendation, autoscaling command, or remediation is an action, control input, or intervention only when it is logged with timing and outcome semantics.
  • Require preservation probes for rare regimes and low-salience signals before accepting any compression or sparse-selection win.
  • Require whole-serving measurements before accepting latency or memory claims: prefill, decode, update cost, KV/cache footprint, recurrent-state size, dequantization, batching, and burst behavior can trade off differently.

How To Use This Toolbox

A practical extra-long-context TSFM design will probably combine several branches rather than choose one:

  1. Keep a recent high-resolution window for local numeric fidelity.
  2. Use content-dependent sparse reads for exact access to selected older blocks when selection confidence is high.
  3. Convert older windows into compressed latent state or segment memory when raw attention becomes too expensive.
  4. Maintain a compact recurrent state for continuous updates between larger refresh or consolidation steps.
  5. Use adaptive patching and channel hierarchy so the model spends tokens on meaningful changes, not every equally sampled value.
  6. Compress cached vectors only when the full serving path still improves after hardware-native baselines and dequantization costs.
  7. Evaluate the full stack with held-out long histories, rare events, exogenous variables, action histories, and realistic request/workload traces.

Relation To Foundation TSFM Agenda

Agenda slotVerdictEvidenceMissing pieces
Streaming state and long contextpartially closesThe existing KB now has several mechanism families for long histories: finite windows, sparse selection, learned compression, compact recurrence, memory tokens, and consolidation.Need direct benchmarks that combine long numeric histories, event streams, high channel count, and online updates.
Dynamic compute and servingadjacentMSA, LT2, LCLM, CAT, TurboQuant, LLM Sleep, and LLM-Emu make compute, memory, cache, and latency first-class.Need TSFM serving studies with real update loops, batching, memory pressure, and workload traces.
Native multivariate encodingpartially closesToto and U-Cast show high-cardinality multivariate pressure; hierarchy and sparse selection suggest possible channel/group analogues.Need mechanisms that preserve channel-specific deviations, topology, and cross-channel causal structure under compression.
Event streams and context interfaceadjacentLCLM/CAT/segment-memory mechanisms are plausible for logs, tickets, traces, and long context.Need typed event schemas and context/action histories rather than untyped text or passive samples.
Control and counterfactualsinsufficient evidenceRATE is an action-trajectory memory analogue, and several sources mention actions as future work.Need action-conditioned time-series rollouts with interventions, failed actions, exogenous variables, and decision utility.
Benchmark validitywarningCurrent evidence mixes time-series forecasting, language long-context retrieval, GPU serving, and synthetic memory tasks.Need benchmark hygiene that reports context length, channel count, state size, update cost, selection/compression recall, and serving latency together.

Open Questions

  • What is the right retained-history unit for time series: samples, patches, channel-time cells, events, topology neighborhoods, action spans, or learned latent chunks?
  • Which information must remain exactly readable, and which information can be compressed into latent state?
  • Can MSA-style sparse block selection, LCLM/CAT-style learned compression, recurrent state, and TurboQuant-style vector compression be composed without compounding preservation failures?
  • How should sparse-selection recall be measured for rare regimes, delayed exogenous variables, and low-salience action history?
  • Should compression be optimized for reconstruction, next-observation prediction, anomaly sensitivity, retrieval, control value, or downstream decision utility?
  • Can a model learn when to consolidate or sleep before eviction instead of using fixed window boundaries?
  • Which mechanisms still win after full serving constraints are counted: wall-clock latency, memory bandwidth, KV-cache layout, recurrent-state updates, batching, and bursty workload traces?
  • What minimal public benchmark would compare dense attention, sparse attention, learned compression, recurrent state, and memory tokens under the same context length, channel count, action history, and serving budget?