Time-Series Scaling And Efficiency

Summary

The time-series foundation model cluster does not have one settled scaling path. Some papers argue for larger dense or sparse models, while others argue that compact backbones, adaptive tokenization, recurrent state, or compression can match larger systems under the right benchmark and horizon.

The 2024 scaling-law sources now make the positive case sharper. Scaling-laws for Large Time-series Models reports power-law behavior for decoder-only time-series Transformers with respect to parameter count, dataset size, and compute. Scaling Law for Time Series Forecasting adds a theory and experiment thread where look-back horizon is also a scaling variable.

stable-worldmodel adds an adjacent systems-scaling reminder: for action-conditioned world models, data layout, streaming throughput, solver interfaces, and factor-of-variation evaluation can become first-order scaling variables alongside parameter count.

Large-Model And Sparse-Capacity Direction

Toto 2.0 explicitly frames forecasting as entering a scaling era, with open-weights checkpoints from 4M to 2.5B parameters and reported monotonic gains through the largest released size. The TSALM presentation adds the training-recipe framing: NormMuon plus AdamW for pinball-loss optimization, UMuP hyperparameter transfer across sizes, REX-style staged sweeps, observability-plus-synthetic data, contiguous patch masking, and fast long-horizon inference. The talk also states that long-horizon stability improves with model size and improves further with block decoding, making stability itself part of the scaling claim rather than only short-horizon error.

Time-MoE scales total capacity with sparse temporal mixture-of-experts layers, keeping activated parameters lower than total parameters. Moirai-MoE uses token-level expert routing inside the Moirai family and argues that learned pattern specialization is better than hand-assigned frequency buckets.

Sundial, Timer, and TimesFM continue the dense sequence-modeling direction through decoder-only Transformers, continuous flow-matching forecast heads, segment generation, patching, and large pretraining corpora.

T2S uses a related flow-matching mechanism but lands in a different part of the architecture trade space. Sundial’s TimeFlow head is a forecast head conditioned on numeric history representations. T2S uses a length-adaptive VAE plus a text-conditioned Diffusion Transformer to generate synthetic time series from captions. Both use velocity prediction from Gaussian noise toward a data sample, but T2S pays an additional autoencoding and text-conditioning cost to support arbitrary requested lengths and natural-language conditioning.

Scaling-laws for Large Time-series Models is the direct empirical support for this direction: it argues that broad heterogeneous pretraining produces predictable power-law loss improvements. Scaling Law for Time Series Forecasting makes the warning equally important: horizon should not be a fixed benchmark knob because longer context raises the intrinsic forecasting problem size under finite data.

Compact And Specialized Direction

TSMixer is the non-pretrained all-MLP root of the mixer direction: it alternates time mixing and feature mixing rather than using self-attention. Tiny Time Mixers then shows that 1M to 5M parameter mixer-style forecasters can be strong zero-shot and few-shot baselines, especially when the backbone is pretrained channel-independently and the head handles target adaptation and exogenous variables.

Reverso pushes the compact argument further with a 550K-parameter small model built from long convolutions, DeltaNet-style linear recurrence, an attention decoder, flip equivariance, and FFT-guided downsampling.

Kairos argues for adaptive temporal abstraction rather than parameter count alone. Its mixture-of-size encoder, dynamic RoPE, and multi-patch decoder let small models adapt token granularity to local sequence structure.

ReinPatch adds a stronger learned-tokenization variant: instead of choosing patch sizes from periodicity, entropy, or a fixed recipe, it trains a detachable boundary policy against downstream forecasting loss and then reuses that policy as a frozen foundation patcher.

TiRex uses xLSTM recurrent state plus contiguous patch masking to preserve long-horizon forecast state without relying only on larger Transformer capacity.

RWKV-TS adds another recurrent-style branch: an RWKV backbone with time mixing, channel mixing, and multi-head WKV recurrence for time-series tasks. It is trained from scratch in the paper, so it is architecture evidence rather than a released pretrained TSFM checkpoint.

FlowState uses an SSM encoder, functional basis decoder, and parallel forecasts to make small models adapt context length, target length, and sampling rate without relying on patching or large Transformer capacity.

Moirai 2.0 is another efficiency counterpoint: the released small model is reported stronger than larger Moirai 2.0 variants on the paper’s GIFT-Eval aggregate, while also simplifying the original Moirai interface.

CHARM is compact by parameter count, about 7M, but not automatically cheap: its description-aware channel-time attention scales as O(C^2 T^2). Its efficiency lesson is that semantic channel conditioning can improve representation transfer, but the channel-pair interface itself becomes the scaling bottleneck.

TabPFN-3 is mostly a static tabular model, but it is useful as a scaling analogy because its report combines row compression, reduced KV caching, row chunking, and a specialized TabPFN-TS-3 checkpoint. The lesson to port carefully is context compression for large structured inputs, not that static table rows are temporal tokens.

Compression And Rank Structure

FlowRanks studies low-rank structure in time-series Transformers and uses that structure to compress Chronos-style models. It suggests that time-series representations may have stronger rank decay than language or vision features, which would make after-the-fact compression or rank-aware architecture design unusually valuable.

Compute Optimal Tokenization is upstream language-model evidence, but it sharpens how this wiki should talk about compression. If token granularity changes, token counts are not a stable scaling unit; the paper argues for bytes per parameter in text. A time-series analogue should name its own information-density unit, such as samples, channel-time cells, events, compressed bits, or another task-grounded measure, before claiming a scaling law.

TurboQuant adds a serving-memory version of the same warning. It does not change tokenization or model size; it compresses high-dimensional KV-cache and vector-search state while trying to preserve the geometry used by downstream inner-product scoring. The vLLM critique makes the serving caveat sharper: lower KV-cache storage cost does not automatically translate into lower latency or higher throughput when the inference engine must dequantize back to BF16 and when FP8 provides a hardware-native baseline. For time-series foundation models, the transfer question is whether latent-state, retrieval-memory, or trajectory-embedding compression can preserve rare regimes, channel-specific deviations, exogenous variables, and action history while improving the full serving budget rather than only average reconstruction quality.

U-Cast adds a channel-dimension scaling case: full channel attention can become impractical when a multivariate time series has thousands of channels, so the model uses hierarchical latent queries and upsampling to reduce channel compute while retaining cross-channel structure.

Learning from Leading Indicators adds a lighter-weight multivariate branch: instead of dense all-channel mixing, estimate local lead-lag relationships and let lagging channels use leading indicators. This is an efficiency idea as much as an accuracy idea because it asks for sparse, asymmetric, time-varying channel dependence rather than full cross-channel attention.

Hierarchical Modeling with a Fixed FLOPs Budget

H-Net, ConceptMoE, ReinPatch, and U-Cast point toward a stronger scaling hypothesis: do not choose temporal patch sizes, concept counts, channel bottlenecks, or per-layer compression ratios by hand. Train a router under a global compute budget and let the model allocate FLOPs to tokens, channels, time spans, or modality fragments that preserve downstream state.

This is different from sparse attention. Sparse attention usually reduces communication cost while keeping the same representation resolution. Budgeted hierarchy changes the number or granularity of active representations, then may decode or upsample back to the fine level for dense outputs, forecasting, anomaly localization, or action-conditioned world-model state updates.

The local research note is Hierarchical Modeling with a Fixed FLOPs Budget. Rank diagnostics from FlowRanks may help inspect redundancy, but they are not the central mechanism.

Two recent LLM-agent and LLM-training sources strengthen the compression intuition while staying outside direct TSFM evidence. Scaling Test-Time Compute for Agentic Coding shows that raw long-horizon traces can be worse than structured summaries for selection and reuse. Learning is Forgetting frames training itself as lossy compression toward objective-relevant information. For time series, the transfer claim should stay conditional: compression is useful only if the learned state preserves rare regimes, dense numeric detail, topology, actions, and delayed effects.

EBT adds a separate dynamic-compute branch. It does not compress tokens or route experts; it spends extra computation by optimizing candidate predictions under a learned energy and by selecting low-energy candidates. For time-series systems, this is a candidate mechanism for high-uncertainty windows, rare regimes, or intervention rollouts, but the current evidence is text/video/image rather than numeric time series.

DiffusionBlocks adds a training-memory branch rather than a token-compression branch. It partitions residual networks into independently trainable denoising blocks, so only one block needs gradients, optimizer state, and activations at a time. For TSFMs, this is upstream evidence for memory-bounded training and possible private/on-prem adaptation splits, but it is not yet numeric time-series evidence and does not solve privacy by itself.

Upstream Recurrent Architecture Background

Mamba, Mamba-2, and Mamba-3 are not time-series foundation-model papers, but they are important architecture background for compact latent-state sequence mixers. The progression moves from selective SSMs and fused scans, to semiseparable-matrix SSD algorithms, to richer discretization, complex state transitions, and MIMO updates.

ParaRNN changes the architecture question by showing that nonlinear GRU/LSTM-style recurrence can be trained in parallel at billion-parameter language-model scale when the hidden trajectory is solved with Newton iterations and parallel reduction. For time-series models, this suggests a possible path beyond the usual tradeoff between expressive nonlinear state updates and parallel training, but it remains a transfer hypothesis until tested on numeric time-series or trajectory benchmarks.

Dragon Hatchling adds a fast-state route outside the standard SSM/RNN split: BDH-GPU is framed as an attention-based state-space sequence model with a large n x d recurrent state, sparse positive activations, and GPU-friendly low-rank implementation. Its relevance to time-series scaling is architectural: mutable state might scale context without quadratic attention, but current evidence is language/translation and does not test numeric channels, irregular event streams, or actions.

Language Models Need Sleep adds a serving-budget twist for SSM-attention hybrids: compact fast-weight memory may need extra consolidation compute before old context rolls out of the attention cache. For time-series models, this is an adjacent design pattern for infinite-context always-on streams: spend compute at window boundaries to update latent state, then keep normal prediction cheap. It remains language/synthetic evidence until tested on numeric observations, event streams, and action histories.

Looped Transformers And Test-Time Memory now tracks the Universal Transformer successor branch and the explicit memory-token branch. Universal Transformers anchor shared recurrent depth and adaptive halting; Huginn, Latent Thoughts, LoopFormer, Parcae, Sparse Looped LMs, ELT, and DiffusionBlocks update that line with modern looped-depth, stability, sparse-capacity, early-exit, loop-boundary supervision, and BPTT-avoidance evidence. RMT, ARMT, and RATE add a different scaling route: carry a learned memory block or associative memory between segments instead of paying full attention over all prior tokens. Titans, ATLAS, MIRAS, and MesaNet add the test-time-memory and online-optimization side.

MoDA adds a depth-communication branch. It is not a recurrent or looped model: it spends extra attention work to retrieve from prior layer key/value memories inside a single forward pass. The time-series-relevant lesson is that depth scaling should be evaluated as both compute and communication. A deeper or looped model may underuse earlier states if the only path is residual accumulation; a depth-retrieval model may preserve useful intermediate state but pay in cache size and memory bandwidth.

mHC and Hyperloop Transformers add a residual-stream-width branch. mHC stabilizes multi-stream residual connections with constrained mixing; Hyperloop uses loop-level hyper-connections so a parameter-shared middle block can carry richer residual state across recurrent passes. For time-series efficiency, this is promising only if the memory-access and kernel requirements are counted alongside parameter savings.

ELT adds a visual-generation version of the same budget question. It trains loop-boundary exits with Intra-Loop Self Distillation, so a single model can vary loop count at inference time. For TSFMs, this is an analogy for adaptive compute over hard windows or candidate futures, but it still needs direct numeric time-series evidence and calibrated stopping rules.

WavSpA is another upstream long-sequence background source. It performs attention in wavelet coefficient space, preserving position/frequency structure with linear-time transforms. It is not forecasting evidence yet, but it is a plausible candidate for long, nonstationary numeric sequences where Fourier-only global mixing can be too blunt.

The local JEPA-curriculum discussion adds a practical caution for recurrent Transformer depth. Looping a block can reduce memory and support early exit, and representation-convergence speed may act as an uncertainty or failure signal. But recurrent depth should be scored as dynamic compute under a constraint, not as automatic research progress: without a memory, latency, or fixed-FLOPs budget, a wider or deeper non-recurrent baseline with unique weights may be the cleaner comparison. Titans Revisited and the sparse-looped-model results are useful reminders that memory and looping claims can reverse when chunking, routing, or baseline strength changes.

Architecture Tradeoffs To Track

  • Dense decoder-only Transformers scale naturally but can be costly at long context and long horizon.
  • Sparse MoE models increase total capacity while keeping activated compute lower, but memory, routing stability, and serving complexity remain.
  • Mixer, convolution, xLSTM, RWKV-style, and linear-RNN hybrids can be much smaller, but may need carefully matched training and inference recipes.
  • Selective SSMs and structured linear recurrent models can offer compact-state inference and parallel training, but their hidden-state update remains constrained by linear or semiseparable structure.
  • ParaRNN-style nonlinear recurrent models may recover richer latent-state dynamics while retaining parallel training, but only when the Newton solver converges quickly and the hidden-state Jacobian structure remains cheap.
  • Recurrent Transformer depth can trade unique weights for loop compute and early-exit signals, and ELT-style loop-boundary supervision may make intermediate exits more useful, but should be compared at matched memory, latency, or expected FLOPs.
  • Depth-KV retrieval can make previous layer state directly accessible, but the cache and memory-bandwidth budget must be counted before treating it as efficient depth scaling.
  • Matrix-valued residual streams can add state capacity to looped or deep models, but they move part of the cost into memory access, recomputation, kernel fusion, and communication schedules.
  • Segment-level recurrent memory can reduce effective context cost, but it moves the burden to memory capacity, overwrite behavior, sequential segment processing, and BPTT stability.
  • Sleep-time consolidation can move recurrent compute to window boundaries before cache eviction, but it makes consolidation scheduling, training stability, and matched wall-clock latency part of the efficiency claim.
  • Test-time memory can extend retained context without quadratic attention, but memory capacity, update objective, update cost, and cross-variate retention become part of the serving contract.
  • Core fast-state architectures can make mutable memory part of the forward state itself, but the state size, update bandwidth, and rare-regime retention need matched serving tests.
  • Continuous basis decoders can expose flexible sampling rates and horizons, but they make the coefficient-to-observation interface part of the modeling contract.
  • Latent diffusion or flow decoders can expose text-conditioned generation, but the VAE bottleneck and guidance schedule become part of the numerical fidelity contract.
  • Energy-based prediction can expose per-candidate compatibility and variable inference effort, but second-order training, candidate-optimization cost, and many-mode energy landscapes become part of the serving contract.
  • Block-wise denoising training can reduce training memory and make per-block updates parallel or local, but pretrained conversion, cross-block coordination, privacy leakage through gradients, and task-dependent block count become part of the contract.
  • Adaptive tokenization can reduce wasted tokens, but it complicates position encoding, batching, and multivariate alignment. ReinPatch adds the question of whether a learned policy should be optimized end to end on each dataset or pretrained once as a reusable patcher.
  • Compression-aware scaling should declare its unit. Compute Optimal Tokenization can use bytes for text, but TSFMs need an equivalent unit that respects sample rate, channel count, missingness, event density, and intervention effects.
  • Fixed-FLOPs hierarchy could turn adaptive tokenization into a budgeted compute-allocation problem, but it adds router stability, hard-routing, and preservation-probe requirements.
  • Point-wise numeric value embeddings preserve temporal resolution, but they may increase token count relative to patching and need careful treatment of exogenous variables and control inputs.
  • Channel-independent univariate modeling improves corpus unification and serving simplicity, but it can miss native multivariate dynamics.
  • Description-conditioned channel-time attention can model native multivariate semantics, but may need channel sparsification, grouping, retrieval, or hierarchy for high-channel settings.
  • Hierarchical channel compression can reduce cost in high-dimensional multivariate forecasting, but it must preserve channel-specific deviations rather than only global shared trends.
  • Direct multi-patch, contiguous-mask, or one-pass horizon prediction reduces sequential decoding cost, but may trade off long-horizon uncertainty propagation.

Evidence

The evidence is no longer “no scaling laws.” The two 2024 scaling-law papers make a strong positive case that TSFMs can scale in LLM-like ways, while also showing that temporal horizon is a domain-specific scaling variable. Toto 2.0 reports monotonic parameter scaling in its article, and the TSALM talk gives a recipe-level account of how Datadog made the sweep transferable across sizes; Time-MoE and Moirai-MoE report sparse-routing gains; TSMixer, TTM, Reverso, Kairos, ReinPatch, RWKV-TS, TiRex, FlowState, Moirai 2.0, and TabPFN-3 argue that architecture, learned tokenization, recurrent state, and inference design can beat raw parameter count in specific benchmark regimes. Compute Optimal Tokenization adds an upstream language-model warning that compression changes the scaling unit itself, and TurboQuant adds an upstream serving warning that memory compression should preserve scoring geometry and beat hardware-native baselines under real latency and throughput measurements, not only average reconstruction. Mamba, Mamba-2, Mamba-3, ParaRNN, MoDA, WavSpA, EBT, DiffusionBlocks, ELF, and stable-worldmodel add upstream sequence-model, training-system, and data-system evidence that compact state, alternative attention, nonlinear recurrent solving, depth-state retrieval, explicit energy minimization, independent block-wise training, continuous language-embedding flows, or high-throughput trajectory storage remain active, but their language/long-sequence/text-video/robotics results should not be treated as numeric forecasting results. U-Cast and LIFT add that the channel dimension can be the scaling bottleneck even before parameter count dominates. CHARM adds that channel descriptions can improve native multivariate representations but can make channel-pair attention the cost center. EIDOS adds that point-wise scalar tokenization and latent prediction can improve representation geometry. For action-conditioned world models, scaling may also need to be fitted per data stream, not only per parameter or token budget: transition-model error and reward-model error can have different sample exponents and unit costs. On Training in Imagination is adjacent theory evidence rather than TSFM evidence. Cross-paper comparisons should be routed through Time-Series Benchmark Hygiene before treating any rank as settled.

Relation To Foundation TSFM Agenda

This page maps to the Foundation Time-Series Model Research Agenda’s scaling, streaming-state, adaptive tokenization, native multivariate, and dynamic-compute slots. Parameter scaling and horizon scaling partially close the case that TSFMs can improve predictably, while compact state, adaptive patching, MoE, hierarchy, and rank compression are adjacent mechanisms. The agenda-relevant test is whether efficiency mechanisms preserve rare regimes, dense numeric detail, context, channel interactions, and action effects at serving time.

Open Questions

  • Where does parameter scaling saturate for forecasting once benchmark leakage and fine-tuned or ensemble entries are separated?
  • Which compact architectures keep their advantage when native multivariate coupling and known future exogenous variables are required?
  • Can nonlinear recurrent-state training from ParaRNN or memory-token recurrence from RMT/RATE transfer from token sequences to numeric time series, trajectories, or action-conditioned world models?
  • Can BDH-style sparse positive recurrent state transfer from language concepts to numeric regimes, channel relationships, and event-driven updates?
  • Should adaptive time-series tokenization be learned through forecasting loss, entropy proxies, periodicity heuristics, or reusable pretrained patchers?
  • Which channel-compression mechanisms scale to tens of thousands of channels without erasing local deviations?
  • Can sparse expert routing specialize by regime, horizon, frequency, covariate structure, or incident phase in an interpretable way?
  • Are rank-aware designs better built into the model from the start, or applied as compression after pretraining?
  • Can a global FLOPs constraint replace hand-picked compression ratios across temporal, channel, and modality hierarchies?
  • What is the TSFM equivalent of bytes per parameter, and does it change across dense signals, sparse events, high-channel telemetry, and action logs?
  • Which latent-state compression objective best preserves operationally rare but action-relevant time-series state: MSE, inner products, downstream prediction loss, or control value?
  • Which compression mechanisms still win after hardware-native numeric formats, kernel availability, dequantization overhead, latency, throughput, and memory-pressure regimes are counted?
  • When should a TSFM spend extra optimization steps on high-energy predictions instead of using a fixed-depth forward pass for every window?
  • Can depth-KV retrieval or matrix-valued residual streams preserve useful intermediate temporal state under bounded cache, memory-bandwidth, and kernel budgets, or does recurrent/segment memory dominate for always-on streams?
  • Can block-wise denoising training support pretrained TSFM conversion, tenant-local adaptation, or privacy-bounded update protocols without losing rare regimes or cross-block coordination?
  • How should TSFM evaluations disentangle objective design, corpus scale and cleaning, and backbone or inference-engineering gains?