Time-Series Scaling And Efficiency

Summary

The time-series foundation model cluster does not have one settled scaling path. Some papers argue for larger dense or sparse models, while others argue that compact backbones, adaptive tokenization, recurrent state, or compression can match larger systems under the right benchmark and horizon.

The 2024 scaling-law sources now make the positive case sharper. Scaling-laws for Large Time-series Models reports power-law behavior for decoder-only time-series Transformers with respect to parameter count, dataset size, and compute. Scaling Law for Time Series Forecasting adds a theory and experiment thread where look-back horizon is also a scaling variable.

Scaling Laws, Carefully adds the upstream method-hygiene layer: compute-optimal claims should be fit as allocation frontiers, not read from one run, and the fitted optimum is sensitive to parameter counting, fit region, data regime, optimizer, tokenization, loss precision, and whether the only changed factor is scale. For time-series work this means scaling reports should name the data unit, effective-data regime, and cost model before extrapolating.

Deep Learning is Not So Mysterious or Different adds a complementary generalization lens: parameter count is a poor proxy for the complexity of the selected function, and a larger hypothesis space can coexist with a stronger simplicity or compression bias. A TSFM scaling frontier should therefore track not only loss, parameters, data, and FLOPs, but also effective dimensionality, compressibility, perturbation stability, and preservation of rare or decision-relevant state. The source does not show that larger TSFMs are automatically better or compute-optimal; its direct evidence is upstream neural-network, Gaussian-process, polynomial, and linear-model generalization.

LLMs as Noisy Channels adds an upstream warning to that positive case: monotonic power-law scaling may be a high-SNR special case. Once data noise, post-training perturbations, quantization, or model-interaction noise dominate, scaling curves can become U-shaped, so a TSFM scaling law should eventually include signal-to-noise variables rather than only parameter count, samples, horizon, or FLOPs.

Implicit Curriculum Hypothesis adds a capability-axis warning: a smooth aggregate scaling curve does not say which skills or latent-state capabilities have emerged. TSFM scaling-law work should therefore pair loss curves with checkpoint-time emergence probes for rare regimes, channel coupling, context use, event streams, and action-conditioned rollout.

VISReg adds an objective-level version of the same accounting problem. Its regularizer is nominally $O (N D K) + O (K N lo g N)$ in batch size $N$ , projection dimension $D$ , and random slices $K$ , but the paper’s own ablation finds that stable quality couples $K$ to $D$ . On one GPU, $K \approx C D$ makes the projection term effectively $O (C N D^{2})$ ; the claimed linear path depends on distributing independent slices across GPUs. For time-series representation learning, regularizer-scaling reports should therefore include realized wall-clock, communication, batch composition, and preservation probes rather than only symbolic complexity.

stable-worldmodel adds an adjacent systems-scaling reminder: for action-conditioned world models, data layout, streaming throughput, solver interfaces, and factor-of-variation evaluation can become first-order scaling variables alongside parameter count.

SensorFM adds a domain-specific TSFM scaling case for wearable sensors: co-scaling 2M-to-2B sensor-hours with 100K-to-100M parameters improves reconstruction and downstream health-task performance. The caveat is that this is a private wearable-health corpus with aggregate one-minute features, not a public compute-optimal TSFM frontier.

IO-Aware GNN Layers adds the graph-structured version of that systems reminder. If a time-series model consumes service topology, power-grid topology, or other graph context through direct message passing, the scaling bottleneck may be HBM traffic, edge-wise intermediate materialization, degree skew, and sparse-kernel choice rather than sequence-model parameter count. Graph time-series scaling claims should therefore report latency, peak memory, and kernel/backend choice alongside accuracy.

Large-Model And Sparse-Capacity Direction

Toto 2.0 explicitly frames forecasting as entering a scaling era, with open-weights checkpoints from 4M to 2.5B parameters and reported monotonic gains through the largest released size. The TSALM presentation adds the training-recipe framing: NormMuon plus AdamW for pinball-loss optimization, UMuP hyperparameter transfer across sizes, REX-style staged sweeps, observability-plus-synthetic data, contiguous patch masking, and fast long-horizon inference. The talk also states that long-horizon stability improves with model size and improves further with block decoding, making stability itself part of the scaling claim rather than only short-horizon error.

Time-MoE scales total capacity with sparse temporal mixture-of-experts layers, keeping activated parameters lower than total parameters. Moirai-MoE uses token-level expert routing inside the Moirai family and argues that learned pattern specialization is better than hand-assigned frequency buckets.

MIRA adds domain-specific sparse-capacity evidence: a Time-MoE-derived medical model with CT-RoPE, Neural ODE extrapolation, and frequency-specific MoE reports stronger clinical zero-shot forecasting than larger general-domain TSFMs. SensorFM adds the wearable-health counterpart: a much larger private corpus and parameter/data scaling curve for representation transfer rather than clinical forecasting. Scaling comparisons should therefore separate parameter count from corpus/domain match, feature granularity, missingness, and irregular-time support.

Diff-MN adds dense MoE-NCDE evidence for irregular generation, but its time-complexity analysis makes serving cost part of the contract. Treating MoE as an efficiency mechanism should distinguish sparse-capacity scaling from dense dynamics specialization.

Sundial, Timer, and TimesFM continue the dense sequence-modeling direction through decoder-only Transformers, continuous flow-matching forecast heads, segment generation, patching, and large pretraining corpora.

T2S uses a related flow-matching mechanism but lands in a different part of the architecture trade space. Sundial’s TimeFlow head is a forecast head conditioned on numeric history representations. T2S uses a length-adaptive VAE plus a text-conditioned Diffusion Transformer to generate synthetic time series from captions. Both use velocity prediction from Gaussian noise toward a data sample, but T2S pays an additional autoencoding and text-conditioning cost to support arbitrary requested lengths and natural-language conditioning.

Scaling-laws for Large Time-series Models is the direct empirical support for this direction: it argues that broad heterogeneous pretraining produces predictable power-law loss improvements. Scaling Law for Time Series Forecasting makes the warning equally important: horizon should not be a fixed benchmark knob because longer context raises the intrinsic forecasting problem size under finite data.

Compact And Specialized Direction

TSMixer is the non-pretrained all-MLP root of the mixer direction: it alternates time mixing and feature mixing rather than using self-attention. Tiny Time Mixers then shows that 1M to 5M parameter mixer-style forecasters can be strong zero-shot and few-shot baselines, especially when the backbone is pretrained channel-independently and the head handles target adaptation and exogenous variables.

Reverso pushes the compact argument further with a 550K-parameter small model built from long convolutions, DeltaNet-style linear recurrence, an attention decoder, flip equivariance, and FFT-guided downsampling.

Kairos argues for adaptive temporal abstraction rather than parameter count alone. Its mixture-of-size encoder, dynamic RoPE, and multi-patch decoder let small models adapt token granularity to local sequence structure.

ReinPatch adds a stronger learned-tokenization variant: instead of choosing patch sizes from periodicity, entropy, or a fixed recipe, it trains a detachable boundary policy against downstream forecasting loss and then reuses that policy as a frozen foundation patcher.

TiRex uses xLSTM recurrent state plus contiguous patch masking to preserve long-horizon forecast state without relying only on larger Transformer capacity.

RWKV-TS adds another recurrent-style branch: an RWKV backbone with time mixing, channel mixing, and multi-head WKV recurrence for time-series tasks. It is trained from scratch in the paper, so it is architecture evidence rather than a released pretrained TSFM checkpoint.

FlowState uses an SSM encoder, functional basis decoder, and parallel forecasts to make small models adapt context length, target length, and sampling rate without relying on patching or large Transformer capacity.

Moirai 2.0 is another efficiency counterpoint: the released small model is reported stronger than larger Moirai 2.0 variants on the paper’s GIFT-Eval aggregate, while also simplifying the original Moirai interface.

CHARM is compact by parameter count, about 7M, but not automatically cheap: its description-aware channel-time attention scales as O(C^2 T^2). Its efficiency lesson is that semantic channel conditioning can improve representation transfer, but the channel-pair interface itself becomes the scaling bottleneck.

TabPFN-3 is mostly a static tabular model, but it is useful as a scaling analogy because its report combines row compression, reduced KV caching, row chunking, and a specialized TabPFN-TS-3 checkpoint. The lesson to port carefully is context compression for large structured inputs, not that static table rows are temporal tokens.

Compression And Rank Structure

FlowRanks studies low-rank structure in time-series Transformers and uses that structure to compress Chronos-style models. It suggests that time-series representations may have stronger rank decay than language or vision features, which would make after-the-fact compression or rank-aware architecture design unusually valuable.

Compute Optimal Tokenization is upstream language-model evidence, but it sharpens how this wiki should talk about compression. If token granularity changes, token counts are not a stable scaling unit; the paper argues for bytes per parameter in text. A time-series analogue should name its own information-density unit, such as samples, channel-time cells, events, compressed bits, or another task-grounded measure, before claiming a scaling law.

LLMs as Noisy Channels adds the complementary SNR constraint. Even with a fixed scaling unit, more data or larger models can become harmful when accumulated noise, quantization, or post-training perturbations dominate the useful signal. The TSFM analogue should therefore report not only scaling exponents but also which noise source is being amplified: corrupt observations, repeated normal-state data, missingness, channel interference, quantized retained state, long-horizon rollout error, or post-training drift.

Learn From Your Own Latents And Not From Tokens is upstream synthetic theory rather than TSFM evidence, but it adds a sample-complexity warning: latent targets can change the data-scaling unit itself when hidden hierarchy is recoverable. A time-series analogue would need matched-compute tests on irregular, long-tailed, multivariate data.

TurboQuant adds a serving-memory version of the same warning. It does not change tokenization or model size; it compresses high-dimensional KV-cache and vector-search state while trying to preserve the geometry used by downstream inner-product scoring. The vLLM critique makes the serving caveat sharper: lower KV-cache storage cost does not automatically translate into lower latency or higher throughput when the inference engine must dequantize back to BF16 and when FP8 provides a hardware-native baseline. For time-series foundation models, the transfer question is whether latent-state, retrieval-memory, or trajectory-embedding compression can preserve rare regimes, channel-specific deviations, exogenous variables, and action history while improving the full serving budget rather than only average reconstruction quality.

Latent Context Language Models add a learned soft-token context-compression route. Instead of paying for full decoder prefill and then compressing the KV cache, LCLM compresses long prompt spans with a smaller encoder before decoder prefill. This is upstream language evidence, but it is directly relevant to the serving question for always-on time-series systems: can long observation histories be compressed before expensive state update or decoding while preserving rare regimes and control-relevant details?

Hybrid Associative Memories and HOLA add two selective explicit-memory routes between sparse attention and learned context compression. HAM keeps predictable context in a recurrent state and routes hard-to-predict tokens into a KV scratchpad under a target cache fraction. HOLA uses a fixed top- $w$ exact cache ranked by the delta-rule update magnitude and a separate sharpened read. For TSFM scaling, this turns compression into a budgeted selection question: exact memory should be spent on observations whose downstream value exceeds their cost, while the benchmark must test whether prediction error or update magnitude is an adequate proxy for that value.

MiniMax Sparse Attention adds a content-dependent sparse-attention branch rather than a compression branch: the layer keeps exact softmax attention on selected key-value blocks and hides the rest. For time-series scaling this is a useful contrast with LCLM/CAT/TurboQuant. It saves attention compute without compressing selected tokens, but it makes selection recall the preservation bottleneck for rare regimes, delayed events, exogenous variables, and action history.

Compress & Attend Transformers add a chunk-compression architecture with a test-time chunk-size knob. The useful analogy is not that text chunks are temporal patches, but that retained history resolution can be budgeted. Older context is represented by compressed chunk state while the current chunk remains high resolution. For time-series systems, that suggests a possible tradeoff among raw recent observations, compressed older state, and memory/latency, with the caveat that fixed-size chunk representations can erase retrieval-critical detail.

The Universal Weight Subspace Hypothesis is upstream non-time-series evidence that adapter or checkpoint families may share low-rank weight-update structure. Use it as an adaptation and compression diagnostic, not as evidence that rare regimes, dense numeric detail, or action effects survive compression.

Reinforcement Learning Finetunes Small Subnetworks is an upstream LLM counterpoint: an update can be sparse and still nearly full-rank. TSFM adaptation claims should therefore report sparsity and rank separately before assuming LoRA-style low-rank structure is the right compression interface.

Exploration: Fine-Tuning With Parameter Decomposition adds a decomposed-component route: an update can be tiny only after an expensive weight-space decomposition has exposed candidate causal subcomponents. A TSFM analogue would need matched decomposition-plus-edit cost and rare-regime retention tests.

U-Cast adds a channel-dimension scaling case: full channel attention can become impractical when a multivariate time series has thousands of channels, so the model uses hierarchical latent queries and upsampling to reduce channel compute while retaining cross-channel structure.

Learning from Leading Indicators adds a lighter-weight multivariate branch: instead of dense all-channel mixing, estimate local lead-lag relationships and let lagging channels use leading indicators. This is an efficiency idea as much as an accuracy idea because it asks for sparse, asymmetric, time-varying channel dependence rather than full cross-channel attention.

Hierarchical Modeling with a Fixed FLOPs Budget

H-Net, ConceptMoE, ReinPatch, U-Cast, Oryx, Hybrid Associative Memories, and HOLA point toward a stronger scaling hypothesis: do not choose temporal patch sizes, concept counts, channel bottlenecks, per-layer compression ratios, mixer modes, or explicit-memory budgets by hand. Train a router under a global compute budget and let the model allocate FLOPs to tokens, channels, time spans, mixer choices, or modality fragments that preserve downstream state. Scaling Laws, Carefully makes the evidence requirement stricter: the router should move an IsoFLOP frontier across several budgets and effective-data regimes, not merely beat a baseline at one selected compute point.

This is different from sparse attention. Sparse attention usually reduces communication cost while keeping the same representation resolution. Budgeted hierarchy changes the number or granularity of active representations, then may decode or upsample back to the fine level for dense outputs, forecasting, anomaly localization, or action-conditioned world-model state updates.

The local research note is Hierarchical Modeling with a Fixed FLOPs Budget. Rank diagnostics from FlowRanks may help inspect redundancy, but they are not the central mechanism.

Variable-Width Transformers adds a static hidden-width version of this hypothesis. It does not learn token, channel, or event compression, but it demonstrates that a bowtie layer-width schedule can beat a uniform Transformer under matched parameters while reducing fitted loss-matched FLOPs and average layer width. For TSFM work, that makes static variable width a baseline that learned fixed-FLOPs routers should beat, not a replacement for data-dependent temporal or channel allocation.

Two recent LLM-agent and LLM-training sources strengthen the compression intuition while staying outside direct TSFM evidence. Scaling Test-Time Compute for Agentic Coding shows that raw long-horizon traces can be worse than structured summaries for selection and reuse. Learning is Forgetting frames training itself as lossy compression toward objective-relevant information. For time series, the transfer claim should stay conditional: compression is useful only if the learned state preserves rare regimes, dense numeric detail, topology, actions, and delayed effects.

EBT adds a separate dynamic-compute branch. It does not compress tokens or route experts; it spends extra computation by optimizing candidate predictions under a learned energy and by selecting low-energy candidates. For time-series systems, this is a candidate mechanism for high-uncertainty windows, rare regimes, or intervention rollouts, but the current evidence is text/video/image rather than numeric time series.

DiffusionBlocks adds a training-memory branch rather than a token-compression branch. It partitions residual networks into independently trainable denoising blocks, so only one block needs gradients, optimizer state, and activations at a time. For TSFMs, this is upstream evidence for memory-bounded training and possible private/on-prem adaptation splits, but it is not yet numeric time-series evidence and does not solve privacy by itself.

DMax adds an upstream decoding-efficiency branch for diffusion language models. Its useful transfer is the combination of on-policy self-correction and soft intermediate states: if a model generates a horizon, scenario, or candidate rollout in parallel, tentative positions should remain revisable until confidence/convergence is high enough to commit. For TSFMs this remains only an analogy until tested on numeric trajectories, event streams, exogenous variables, and action-conditioned rollouts under matched wall-clock budgets.

iLLaDA adds a scaling-recipe signal for the same diffusion-language branch. It reports an 8B masked diffusion language model trained from scratch on 12T tokens with GQA, tied embeddings, 8192-token context, variable-length generation, and confidence-based benchmark scoring. For TSFM scaling this is upstream evidence that non-autoregressive denoising sequence models deserve continued tracking, but the paper does not establish compute-optimal superiority over autoregressive LMs, deployed serving wins, or direct transfer to dense numeric streams.

The Flexibility Trap adds a phase-specific efficiency trade-off. Exact left-to-right GRPO requires separate next-position evaluations in a bidirectional dLLM, while JustGRPO-Fast evaluates policy ratios only at the top 25% highest-entropy positions; inference can still return to parallel sampling. A TSFM analogue should report the full training-plus-serving frontier rather than crediting fewer inference steps while ignoring rollout-generation and credit-assignment cost.

The Thinking Pixel adds a visual continuous-latent version of the same budget question: extra sparse latent steps can improve alignment in multimodal diffusion, but the reported gains need wall-clock, memory, and halting comparisons before they count as efficient scaling. The TSFM analogue would route additional latent updates to hard spans, channels, regimes, or candidate futures rather than applying uniform compute to every patch.

The Illusion of Superposition adds the negative latent-compute test: if extra hidden steps are useful, removing them should hurt, and probes should show maintained uncertainty or intermediate state rather than early commitment. For TSFMs, this means matched-cost dynamic-compute claims need no-loop/no-latent ablations and state probes, not only better aggregate loss.

Latent Thought Flow adds the positive latent-trajectory counterpart to that warning. It explicitly trains a reward-proportional sampler over variable-length hidden trajectories, so a time-series analogue would be to sample compact candidate latent-state updates or candidate futures under a quality/cost reward. That remains a transfer hypothesis until tested on numeric trajectories with preservation probes, reward-noise controls, and wall-clock serving budgets.

Looped World Models adds the world-model-specific version of dynamic latent compute: spend recurrent depth inside an action-conditioned transition, stop early when the latent update is easy, and defer decoding during multi-step rollouts. For TSFM scaling this is a stronger analogy than language-only latent reasoning, but it still needs public artifacts, numeric/action-conditioned benchmarks, and realized latency accounting.

Upstream Recurrent Architecture Background

Mamba, Mamba-2, and Mamba-3 are not time-series foundation-model papers, but they are important architecture background for compact latent-state sequence mixers. The progression moves from selective SSMs and fused scans, to semiseparable-matrix SSD algorithms, to richer discretization, complex state transitions, and MIMO updates. Gated DeltaNet and Gated DeltaNet-2 add the adjacent linear-attention branch: instead of changing SSM transition dynamics, they change the fast-weight memory edit, first through scalar gating plus delta updates and then through decoupled key-side erase and value-side write. For scaling and efficiency, that makes memory interference a first-class cost-model dimension alongside state size, context length, and kernel availability. Oryx then asks whether attention and recurrent mixers can share representations while switching mode across spans, which turns mixer selection itself into a cost-allocation variable.

Comparing Transformers and Hybrid Models at the Token Level adds a capability-aware evaluation pattern for hybrid architectures. Its filtered token losses separate state-oriented non-copy targets from copy-only retrieval targets, showing that aggregate loss can hide which component of a transformer—RNN hybrid is doing the useful work. The time-series analogue is to report filtered validation slices for rare regimes, cross-channel state, event-conditioned transitions, repeated normal spans, exact recent-value recall, and known structural constraints under the same serving budget.

ParaRNN changes the architecture question by showing that nonlinear GRU/LSTM-style recurrence can be trained in parallel at billion-parameter language-model scale when the hidden trajectory is solved with Newton iterations and parallel reduction. Pretraining Recurrent Networks without Recurrence adds a different training-side route: use a Transformer encoder-decoder to learn predictive memory states, train the RNN to imitate one-step memory transitions, and then use DMT to reduce rollout drift. For time-series models, these suggest two separable hypotheses: nonlinear latent-state dynamics may be worth keeping, and predictive-state pretraining may be a useful initializer. Both remain transfer hypotheses until tested on numeric time-series or trajectory benchmarks.

Dragon Hatchling adds a fast-state route outside the standard SSM/RNN split: BDH-GPU is framed as an attention-based state-space sequence model with a large n x d recurrent state, sparse positive activations, and GPU-friendly low-rank implementation. Its relevance to time-series scaling is architectural: mutable state might scale context without quadratic attention, but current evidence is language/translation and does not test numeric channels, irregular event streams, or actions.

Language Models Need Sleep adds a serving-budget twist for SSM-attention hybrids: compact fast-weight memory may need extra consolidation compute before old context rolls out of the attention cache. For time-series models, this is an adjacent design pattern for infinite-context always-on streams: spend compute at window boundaries to update latent state, then keep normal prediction cheap. It remains language/synthetic evidence until tested on numeric observations, event streams, and action histories.

Looped Transformers And Test-Time Memory now tracks the Universal Transformer successor branch and the explicit memory-token branch. Universal Transformers anchor shared recurrent depth and adaptive halting; Huginn, Latent Thoughts, The Illusion of Superposition, Latent Thought Flow, LT2, LoopFormer, FPRM, Parcae, Sparse Looped LMs, ELT, The Thinking Pixel, and DiffusionBlocks update that line with modern looped-depth, linear/sparse mixer efficiency, fixed-point halting, stability, sparse-capacity, GFlowNet-trained latent trajectory sampling, early-exit, loop-boundary supervision, recursive visual-latent routing, and BPTT-avoidance evidence. FPRM is the strongest current warning that recursive hierarchy should not get credit until signal propagation and stopping-rule baselines are controlled: pre-norm plus residual scaling and a fixed-point residual can remove the need for HRM/TRM-style fast/slow loops on the reported symbolic reasoning tasks. LT2 is especially relevant to serving cost because it shows the repeated block’s attention choice controls whether loop count multiplies full KV-cache cost or stays closer to compact recurrent/sparse-state inference. GRAM adds a probabilistic recursive-reasoning branch where test-time scaling is not only more depth but also more parallel latent trajectories selected by a reward/value head. RMT, ARMT, and RATE add the segment-memory branch; Titans, ATLAS, MIRAS, and MesaNet add optimized test-time memory; Language Models Need Sleep adds recurrent consolidation before cache eviction; and TurboQuant adds serving-state quantization with a concrete warning that memory compression can lose to hardware-native FP8 after dequantization and kernel overhead are counted.

Flow Reasoning Models adds a different compute-control signal: after self-conditioned flow refinement, re-noise and re-solve a completed candidate, then restart when the return-to-candidate score is unstable. This is strong neural-function-evaluation evidence on Sudoku, but it should not be read as an 8× wall-clock result or as calibrated time-series uncertainty. Proposal coverage, verifier cost, batching, and the risk of stable wrong or averaged trajectories all belong in the serving budget.

Probabilistic Tiny Recursive Model adds parallel stochastic width without retraining the base TRM. Its useful efficiency lesson is not only that $K = 100$ rollouts can be batched, but that pass@ $K$ proposal coverage and best-Q@ $K$ selected accuracy must be reported separately. On Maze-Hard, noise finds many additional correct candidates while the inherited Q head leaves most of that coverage unused. A TSFM analogue therefore needs an adaptive rollout budget, selector calibration, and full wall-clock/memory accounting rather than a fixed large $K$ and one aggregate score.

MoDA adds a depth-communication branch. It is not a recurrent or looped model: it spends extra attention work to retrieve from prior layer key/value memories inside a single forward pass. The time-series-relevant lesson is that depth scaling should be evaluated as both compute and communication. A deeper or looped model may underuse earlier states if the only path is residual accumulation; a depth-retrieval model may preserve useful intermediate state but pay in cache size and memory bandwidth.

mHC and Hyperloop Transformers add a residual-stream-width branch. mHC stabilizes multi-stream residual connections with constrained mixing; Hyperloop uses loop-level hyper-connections so a parameter-shared middle block can carry richer residual state across recurrent passes. For time-series efficiency, this is promising only if the memory-access and kernel requirements are counted alongside parameter savings.

ELT adds a visual-generation version of the same budget question. It trains loop-boundary exits with Intra-Loop Self Distillation, so a single model can vary loop count at inference time. For TSFMs, this is an analogy for adaptive compute over hard windows or candidate futures, but it still needs direct numeric time-series evidence and calibrated stopping rules.

WavSpA is another upstream long-sequence background source. It performs attention in wavelet coefficient space, preserving position/frequency structure with linear-time transforms. It is not forecasting evidence yet, but it is a plausible candidate for long, nonstationary numeric sequences where Fourier-only global mixing can be too blunt.

The local JEPA-curriculum discussion adds a practical caution for recurrent Transformer depth. Looping a block can reduce memory and support early exit, and representation-convergence speed may act as an uncertainty or failure signal. But recurrent depth should be scored as dynamic compute under a constraint, not as automatic research progress: without a memory, latency, or fixed-FLOPs budget, a wider or deeper non-recurrent baseline with unique weights may be the cleaner comparison. Titans Revisited and the sparse-looped-model results are useful reminders that memory and looping claims can reverse when chunking, routing, or baseline strength changes.

Architecture Tradeoffs To Track

Dense decoder-only Transformers scale naturally but can be costly at long context and long horizon.
Sparse MoE models increase total capacity while keeping activated compute lower, but memory, routing stability, and serving complexity remain.
Mixer, convolution, xLSTM, RWKV-style, and linear-RNN hybrids can be much smaller, but may need carefully matched training and inference recipes.
Selective SSMs and structured linear recurrent models can offer compact-state inference and parallel training, but their hidden-state update remains constrained by linear or semiseparable structure.
Decoupled erase/write linear attention can make recurrent memory editing more precise, but it must prove dense numeric preservation under matched state size and report constant gate/kernel overhead.
Sequence-axis mixer routing and selective KV-cache growth can allocate exact attention only to some spans, but they must report routing overhead, dual-state maintenance cost, cache layout, and missed-span preservation failures.
ParaRNN-style nonlinear recurrent models may recover richer latent-state dynamics while retaining parallel training, but only when the Newton solver converges quickly and the hidden-state Jacobian structure remains cheap. SMT/DMT-style predictive-memory pretraining removes BPTT during pretraining, but must prove that one-step memory labels do not drift, that post-training remains cheap, and that the Transformer teacher does not cap tasks requiring deep nonlinear recurrence.
Recurrent Transformer depth can trade unique weights for loop compute and early-exit signals. FPRM-style fixed-point residuals, FRM-style perturb-and-resolve stability, and ELT-style loop-boundary supervision may make stopping, verification, and intermediate exits more useful, but they should be compared at matched memory, latency, expected FLOPs, and calibration.
Looped world-model transitions can spend variable recurrent depth per action step and defer decoding, but should report rollout latency, decoder-call savings, hidden-state drift, no-loop ablations, and simulator-transfer robustness before being counted as efficient world simulation.
Latent thinking should prove that hidden steps are causally used through no-latent/no-loop ablations and probes for maintained uncertainty; continuous states can still collapse to shortcuts.
Content-dependent sparse attention can make long context cheaper without compressing selected tokens, but MSA-style block selection must report selection recall and preservation probes for rare, low-salience, off-window details before being counted as TSFM state retention.
Linear/sparse looped mixers can make recurrent depth practical at longer contexts, but their advantage must include kernel availability, KV-cache or recurrent-state memory, batch-size OOM frontiers, and whether dense numeric state survives the cheaper mixer.
Depth-KV retrieval can make previous layer state directly accessible, but the cache and memory-bandwidth budget must be counted before treating it as efficient depth scaling.
Matrix-valued residual streams can add state capacity to looped or deep models, but they move part of the cost into memory access, recomputation, kernel fusion, and communication schedules.
Static variable-width Transformers can lower average layer width and KV-cache cost under matched parameters, but they make heterogeneous-shape kernels, tensor parallelism, and realized latency part of the evidence contract.
Segment-level recurrent memory can reduce effective context cost, but it moves the burden to memory capacity, overwrite behavior, sequential segment processing, and BPTT stability.
Sleep-time consolidation can move recurrent compute to window boundaries before cache eviction, but it makes consolidation scheduling, training stability, and matched wall-clock latency part of the efficiency claim.
Soft-token context compression can reduce decoder-side sequence length before prefill, but it must prove that compressed latents preserve dense numeric detail, rare events, event timing, exogenous variables, and action history.
Chunk-compressed decoding can expose a test-time memory/quality knob, but fixed chunks and fixed-size chunk state may erase local details unless boundaries, chunk size, and expansion policies are learned or carefully validated.
Test-time memory can extend retained context without quadratic attention, but memory capacity, update objective, update cost, and cross-variate retention become part of the serving contract.
Core fast-state architectures can make mutable memory part of the forward state itself, but the state size, update bandwidth, and rare-regime retention need matched serving tests.
Continuous basis decoders can expose flexible sampling rates and horizons, but they make the coefficient-to-observation interface part of the modeling contract.
Latent diffusion or flow decoders can expose text-conditioned generation, but the VAE bottleneck and guidance schedule become part of the numerical fidelity contract.
Energy-based prediction can expose per-candidate compatibility and variable inference effort, but second-order training, candidate-optimization cost, and many-mode energy landscapes become part of the serving contract.
Block-wise denoising training can reduce training memory and make per-block updates parallel or local, but pretrained conversion, cross-block coordination, privacy leakage through gradients, and task-dependent block count become part of the contract.
Adaptive tokenization can reduce wasted tokens, but it complicates position encoding, batching, and multivariate alignment. ReinPatch adds the question of whether a learned policy should be optimized end to end on each dataset or pretrained once as a reusable patcher.
Compression-aware scaling should declare its unit. Compute Optimal Tokenization can use bytes for text, but TSFMs need an equivalent unit that respects sample rate, channel count, missingness, event density, and intervention effects.
Overparametrization claims should separate raw capacity from selected-solution complexity. A larger model earns a generalization claim only when its effective dimensionality or compressibility improves without losing rare regimes, dense numeric detail, context, exogenous variables, or action effects.
SNR-aware scaling should declare its perturbation model. LLMs as Noisy Channels is only language evidence, but the TSFM analogue should test whether more observations, bigger models, lower precision, or heavier post-training amplify noise beyond useful signal.
Capability-aware scaling should declare its probe suite. Implicit Curriculum Hypothesis is only language evidence, but the TSFM analogue should test whether loss improvements correspond to emergence of local numeric fidelity, rare-regime sensitivity, context use, channel coupling, event-stream parsing, and action-conditioned rollout.
Hybrid-architecture scaling should separate state-conditioned targets from copy-like or constraint-closure targets before crediting recurrence or attention for aggregate gains.
Latent-recursion claims should report realized wall-clock latency, memory bandwidth, and stopping rules, not only nominal extra steps or sparse parameter counts.
Fixed-FLOPs hierarchy could turn adaptive tokenization into a budgeted compute-allocation problem, but it adds router stability, hard-routing, and preservation-probe requirements.
Point-wise numeric value embeddings preserve temporal resolution, but they may increase token count relative to patching and need careful treatment of exogenous variables and control inputs.
Channel-independent univariate modeling improves corpus unification and serving simplicity, but it can miss native multivariate dynamics.
Description-conditioned channel-time attention can model native multivariate semantics, but may need channel sparsification, grouping, retrieval, or hierarchy for high-channel settings.
Hierarchical channel compression can reduce cost in high-dimensional multivariate forecasting, but it must preserve channel-specific deviations rather than only global shared trends.
Direct multi-patch, contiguous-mask, or one-pass horizon prediction reduces sequential decoding cost, but may trade off long-horizon uncertainty propagation.
Aggressive parallel diffusion decoding can reduce serial generation steps, but DMax-style self-correction, soft states, and convergence criteria must be validated on numeric horizons rather than assumed from language TPF gains.
Training-time sequential commitment can improve proposal coverage without forcing sequential inference, but JustGRPO-style claims require matched training forward passes, wall-clock serving, memory, and retention measurements.
Stability-gated flow refinement can allocate extra test-time compute to rejected candidates, but FRM-style neural-function-evaluation gains must survive wall-clock serving tests and stochastic/multi-modal trajectory calibration.
Stochastic recursive width can turn a pretrained deterministic model into a parallel candidate generator, but PTRM-style pass@ $K$ gains count only when a calibrated selector converts them into useful accuracy without erasing rare valid futures or exhausting the total compute budget.

Evidence

The evidence is no longer “no scaling laws.” The two 2024 scaling-law papers make a strong positive case that TSFMs can scale in LLM-like ways, while also showing that temporal horizon is a domain-specific scaling variable. Toto 2.0 reports monotonic parameter scaling in its article, and the TSALM talk gives a recipe-level account of how Datadog made the sweep transferable across sizes; Time-MoE and Moirai-MoE report sparse-routing gains; TSMixer, TTM, Reverso, Kairos, ReinPatch, RWKV-TS, TiRex, FlowState, Moirai 2.0, and TabPFN-3 argue that architecture, learned tokenization, recurrent state, and inference design can beat raw parameter count in specific benchmark regimes. Compute Optimal Tokenization adds an upstream language-model warning that compression changes the scaling unit itself; LLMs as Noisy Channels adds an upstream warning that monotonic scaling can become U-shaped when noise, quantization, or post-training perturbations dominate signal; LCLM and CAT add upstream evidence that learned context compression can improve long-context language serving tradeoffs; and TurboQuant adds an upstream serving warning that memory compression should preserve scoring geometry and beat hardware-native baselines under real latency and throughput measurements, not only average reconstruction. Mamba, Mamba-2, Mamba-3, Gated DeltaNet-2, MiniMax Sparse Attention, ParaRNN, MoDA, WavSpA, EBT, DiffusionBlocks, DMax, FRM, Thinking Pixel, ELF, and stable-worldmodel add upstream sequence-model, sparse-attention, training-system, decoding-system, dynamic-latent-compute, and data-system evidence that compact state, alternative attention, selective memory editing, nonlinear recurrent solving, depth-state retrieval, explicit energy minimization, independent block-wise training, self-revising parallel decoding, continuous language-embedding flows, or high-throughput trajectory storage remain active, but their language/long-sequence/text-video/robotics results should not be treated as numeric forecasting results. U-Cast and LIFT add that the channel dimension can be the scaling bottleneck even before parameter count dominates. CHARM adds that channel descriptions can improve native multivariate representations but can make channel-pair attention the cost center. EIDOS adds that point-wise scalar tokenization and latent prediction can improve representation geometry. For action-conditioned world models, scaling may also need to be fitted per data stream, not only per parameter or token budget: transition-model error and reward-model error can have different sample exponents and unit costs. On Training in Imagination is adjacent theory evidence rather than TSFM evidence. Cross-paper comparisons should be routed through Time-Series Benchmark Hygiene before treating any rank as settled.

Relation To Foundation TSFM Agenda

This page maps to the Foundation Time-Series Model Research Agenda’s scaling, streaming-state, adaptive tokenization, native multivariate, and dynamic-compute slots. Parameter scaling and horizon scaling partially close the case that TSFMs can improve predictably, while compact state, adaptive patching, MoE, hierarchy, and rank compression are adjacent mechanisms. The agenda-relevant test is whether efficiency mechanisms preserve rare regimes, dense numeric detail, context, channel interactions, and action effects at serving time.

Open Questions

Can effective dimensionality, description length, or PAC-Bayes-style diagnostics predict which TSFM scale will generalize best beyond one validation distribution?
Where does parameter scaling saturate for forecasting once benchmark leakage and fine-tuned or ensemble entries are separated?
Which compact architectures keep their advantage when native multivariate coupling and known future exogenous variables are required?
Can nonlinear recurrent-state training from ParaRNN, SMT/DMT-style predictive-memory pretraining, or memory-token recurrence from RMT/RATE transfer from token sequences to numeric time series, trajectories, or action-conditioned world models?
Can BDH-style sparse positive recurrent state transfer from language concepts to numeric regimes, channel relationships, and event-driven updates?
Should adaptive time-series tokenization be learned through forecasting loss, entropy proxies, periodicity heuristics, or reusable pretrained patchers?
Which channel-compression mechanisms scale to tens of thousands of channels without erasing local deviations?
Can sparse expert routing specialize by regime, horizon, frequency, covariate structure, or incident phase in an interpretable way?
Can Gated DeltaNet-2-style erase/write separation improve long-context multivariate state without paying more than an equivalent larger recurrent state or full-attention window?
Can Oryx-style mode routing, HAM-style selective cache growth, or HOLA-style fixed top- $w$ retention shift a TSFM IsoFLOP frontier, or are they only stronger at one hand-picked cache/mixer budget?
Which capability-filtered validation slices should be standard for transformer—RNN time-series hybrids: rare-regime readout, cross-channel binding, event-conditioned transitions, exact recent recall, repeated normal spans, and structural constraints?
Are rank-aware designs better built into the model from the start, or applied as compression after pretraining?
Can a global FLOPs constraint replace hand-picked compression ratios across temporal, channel, and modality hierarchies?
What is the TSFM equivalent of bytes per parameter, and does it change across dense signals, sparse events, high-channel telemetry, and action logs?
What is the TSFM equivalent of signal-to-noise ratio, and which perturbations produce U-shaped scaling curves instead of monotonic gains?
Which latent-state compression objective best preserves operationally rare but action-relevant time-series state: MSE, inner products, downstream prediction loss, or control value?
Which compression mechanisms still win after hardware-native numeric formats, kernel availability, dequantization overhead, latency, throughput, and memory-pressure regimes are counted?
When should a TSFM spend extra optimization steps on high-energy predictions instead of using a fixed-depth forward pass for every window?
What is the TSFM analogue of DMax-style tokens per forward: generated channel-time cells per update, scenario horizon per denoising step, calibrated futures per wall-clock second, or downstream utility per serving dollar?
Can perturb-and-resolve stability become a useful compute-allocation signal for TSFMs without rejecting rare valid transitions or rewarding stable averaged futures?
Can stochastic latent rollout width be allocated adaptively from marginal coverage gain, selector uncertainty, or candidate diversity rather than assigning every time-series window the same $K$ ?
Can depth-KV retrieval or matrix-valued residual streams preserve useful intermediate temporal state under bounded cache, memory-bandwidth, and kernel budgets, or does recurrent/segment memory dominate for always-on streams?
Can block-wise denoising training support pretrained TSFM conversion, tenant-local adaptation, or privacy-bounded update protocols without losing rare regimes or cross-block coordination?
How should TSFM evaluations disentangle objective design, corpus scale and cleaning, and backbone or inference-engineering gains?

Alex Open Research Wiki

Explorer

Time-Series Scaling And Efficiency

Time-Series Scaling And Efficiency

Summary

Large-Model And Sparse-Capacity Direction

Compact And Specialized Direction

Compression And Rank Structure

Hierarchical Modeling with a Fixed FLOPs Budget

Upstream Recurrent Architecture Background

Architecture Tradeoffs To Track

Evidence

Relation To Foundation TSFM Agenda

Open Questions

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

Time-Series Scaling And Efficiency

Time-Series Scaling And Efficiency

Summary

Large-Model And Sparse-Capacity Direction

Compact And Specialized Direction

Compression And Rank Structure

Hierarchical Modeling with a Fixed FLOPs Budget

Upstream Recurrent Architecture Background

Architecture Tradeoffs To Track

Evidence

Relation To Foundation TSFM Agenda

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks