Time-Series Foundation Models

Summary

The time-series cluster now covers forecasting, classification, representation learning, reasoning, generation, observability, scaling, and compression. These models should not be treated as interchangeable: temporal order, covariates, exogenous variables, event streams, and multivariate structure create task boundaries that are not present in static tabular data or ordinary text modeling.

Foundation Time-Series Model Research Agenda is the central frame for this page and for the broader time-series foundation-model wiki. Use it to classify whether a source closes part of the path toward general latent-state time-series foundation models, or whether it is only a local improvement on forecasting, classification, generation, or efficiency.

What’s Wrong With The Current Time-Series Deep Learning? is the landmark position source for the narrower question of why ordinary observation forecasting is not enough. Forecasting is an important output, but time-series foundation models should also be evaluated by whether they maintain useful system state, use context, track regimes, expose plausible futures, and support the path toward action-conditioned decisions.

Awesome Agentic Time Series is a current survey/list source for the agentic edge of the field. It is useful for discovering candidate papers and for separating time-series foundation models, LLM4TS translators, temporal reasoners, closed-loop agents, memory systems, and temporal world-model claims. The page should not be used as primary evidence for individual method performance until the relevant primary paper is ingested.

Most benchmarked models in this cluster are passive dynamics models, passive forecasting models, or representation models. They predict or encode future observations from observed histories, but they do not expose actions, control inputs, interventions, or counterfactual rollout channels as first-class interfaces. That boundary matters when comparing them with action-conditioned world models.

Context is now a first-class interface question too. Context is Key shows that some forecasting tasks are under-specified by numeric history alone and should be treated as text-conditioned time series. UniTime is an early version of this language-conditioned interface, using domain instructions before time-series tokens.

The strongest new positive evidence for the whole TSFM program is scaling-law evidence. Scaling-laws for Large Time-series Models reports LLM-like power-law behavior for decoder-only TSFMs with model size, data size, and compute. Scaling Law for Time Series Forecasting adds that look-back horizon itself must be treated as a scaling variable. Scaling Laws, Carefully adds the caution that the fitted compute-optimal frontier is sensitive to cost accounting, data regime, and fit protocol, so TSFM scale-up claims should report the data unit, effective-data assumptions, and fit sensitivity before extrapolating.

SensorFM adds the strongest current wearable-sensor scale-up case in this wiki: a private five-million-participant corpus, 100K-to-100M-parameter variants, and downstream health-task gains along a co-scaled data/model diagonal. Treat it as domain-specific representation-learning evidence rather than a public general TSFM benchmark, because the data, weights, and code were not verified as released.

What The Wiki Currently Believes

Forecasting Foundation Models

TimesFM, Timer, Sundial, Moirai 2.0, and Toto 2.0 treat forecasting as large sequence modeling over patches, quantiles, continuous flows, or masked forecast windows. The Toto 2.0 TSALM presentation is the local source for the spoken recipe details behind the scaling family and for Datadog’s move from passive metric forecasting toward multimodal observability world models. FlowState adds an SSM encoder and functional basis decoder path for sampling-rate-invariant continuous forecasts. Moirai, Chronos-2, Tiny Time Mixers, and Toto add stronger claims about multivariate structure, known future exogenous variables, probabilistic heads, or observability-domain deployment. TSMixer is not a pretrained TSFM, but it is the important all-MLP ancestor behind TTM’s compact mixer path. MOMENT sits between forecasting and representation learning because masked reconstruction supports forecasting, classification, anomaly detection, and imputation. UniTS pushes the unification question further by using task tokenization for forecasting, classification, imputation, and anomaly detection in one backbone. UniTime adds an early language-instruction version of cross-domain unification. Exploring Large Models for Time Series is a useful THUML overview source for the early LTM landscape, especially Timer, AutoTimes, Timer-XL, OpenLTM, and the limitations of single-series interfaces.

MIRA adds a domain-specific medical branch: continuous-time RoPE, frequency-specific MoE, Neural ODE extrapolation, and 454B reported medical time points for irregular clinical forecasting. It is strong evidence that domain-specific pretraining and irregular-time architecture matter, but it should not be counted as treatment-response or intervention modeling. TimeRAF adds a retrieval-augmented forecasting branch: external time-series examples become a knowledge base for zero-shot forecasting, which raises benchmark-overlap and retrieval-provenance hygiene questions.

The important split is interface, not only score. Some models are univariate-first and decompose multivariate time series into independent channels; others add covariates, group attention, decoder channel mixing, factorized variate attention, or LIFT-style leading-indicator plugins. A model that supports known future exogenous variables still is not necessarily action-conditioned: those variables condition forecasts, but they are not automatically modeled as controllable actions or interventions.

U-Cast sharpens the native multivariate question by naming high-dimensional time series forecasting as its own regime and releasing Time-HD. Its contribution is less “another forecaster” than a warning that low-channel benchmarks can make channel-independent and channel-dependent models look more interchangeable than they are in realistic high-dimensional systems.

Textual Context And Context-Aided Forecasting

Context is Key adds a benchmark boundary that ordinary forecasting leaderboards usually miss: the correct forecast may depend on essential natural-language context. CiK’s tasks pair numeric history with text that can specify process identity, hidden historical facts, future events, value constraints, covariates, or causal relationships.

This should be treated as a separate capability from raw numerical forecasting. Numeric-only time-series foundation models are not “bad” because they perform worse on CiK; they are being tested on an intentionally incomplete interface. The relevant research question is whether a model can estimate future observations from both history and context while preserving probabilistic calibration and avoiding catastrophic context misinterpretation.

CiK also sharpens the control boundary. Some text describes future events or causal relationships, but that does not make the benchmark action-conditioned. A time-series world model still needs explicit action, control-input, or intervention semantics before it can support planning over alternatives.

Classification And Representation Learning

Time-series classification and representation models form a parallel branch covered in Time-Series Classification Foundation Models. The Mantis lineage starts with Mantis, then branches into MantisV2 for synthetic pretraining plus test-time representation strategies and UTICA for self-distillation on the Mantis backbone. LeNEPA adds a no-augmentation next-latent-token SSL recipe with temporal SIGReg, fixed-recipe PTB-XL/Diag probes, and a CauKer-to-UCR frozen-encoder check. Aionoscope is the diagnostic benchmark companion: it separates categorical component accessibility from dense process-state accessibility. CHARM is the JEPA-style channel-description branch: it trains a native multivariate embedding model, then probes frozen representations for classification, anomaly detection, and forecasting. TS2Vec contributes hierarchical contrastive timestamp-level representations; SimMTM contributes multi-neighbor masked reconstruction; UniShape, NuTime, T-Loss, and TiViT show other routes through shape-aware adapters, numerical-scale embeddings, time-based triplet learning, and frozen vision-feature transfer.

ICLR 2026 Time-Series Classification Meta-Analysis is a useful field-map caveat for this branch. It suggests that representation learning is visible in the ICLR 2026 time-series slice, but a large part of the visible representation-learning cluster comes from EEG/ECG/neuro/physiology work. SensorFM makes that physiology branch much stronger by showing large-corpus wearable embeddings that transfer across 35 health tasks, but it also reinforces the need to keep searching for representation-learning evidence in observability, telecom, industrial, robotics, energy, and other non-biomedical systems.

T-Rep adds a time-embedding variant to the representation branch. It learns timestep-level representations while learning embeddings of time itself, so trend, periodicity, distribution shifts, and missingness can be encoded rather than treated as fixed positional metadata.

This branch should not be evaluated as if it were direct forecasting. Its core question is whether an embedding captures class-discriminative temporal shape, scale, frequency, and local structure well enough for frozen or lightly trained downstream classifiers.

Tabular And PFN Analogs

TabPFN-v2, TabPFN-3, and TabICL are static tabular-data models, not time-series models, but they are important analogs because they learn inference over structured context from synthetic tasks. TabPFN-3 adds a specialized TabPFN-TS-3 time-series checkpoint, while TempoPFN is the direct open time-series relative: it moves the prior-data fitted network idea into synthetic-pretrained zero-shot forecasting with a linear RNN backbone.

The portable idea is learned inference over a context, not the table interface itself. Static supervised tables do not expose temporal next-state dynamics, event streams, control inputs, or interventions by default.

Observability And Telemetry

Toto and Toto 2.0 make observability metrics a distinct time-series deployment surface. They emphasize high-cardinality, nonstationary multivariate metrics, operational benchmarks such as BOOM, and the possibility of future observability world models. The Toto 2.0 TSALM presentation adds ARFBench and Toto-1.0-QA-Experimental as a bridge from passive metric forecasting to incident-response time-series QA. ChronoGraph is a nearby graph-temporal telemetry dataset with incident labels, but without controllable operator-action channels. TelecomTS adds a multimodal telecom-observability benchmark with preserved scale, labels, anomaly/root-cause tasks, and language Q&A.

For this wiki, observability forecasting remains passive unless logs include deployments, rollbacks, autoscaling, remediation, traffic-control commands, or other operator actions as explicit action, control input, or intervention channels.

High-Dimensional Time Series Forecasting is the broader topic that connects observability metrics to other high-channel domains such as traffic, finance, energy, weather, web traffic, and telecom-like telemetry.

Scaling, Efficiency, And Architecture Tradeoffs

Time-Series Scaling And Efficiency tracks the architecture trade space. Scaling-laws for Large Time-series Models and Scaling Law for Time Series Forecasting are the key 2024 sources showing why TSFM scale-up is a real research program. Scaling Laws, Carefully is the upstream method-hygiene companion: compute-optimal claims should be fit over allocation frontiers, not inferred from a single run, and token/sample counts need effective-data and repetition caveats. Time-MoE and Moirai-MoE scale capacity through sparse routing; Toto 2.0 frames forecasting as a parameter-scaling regime; TSMixer, Tiny Time Mixers, Reverso, Kairos, ReinPatch, RWKV-TS, TiRex, and FlowState argue that compact backbones, adaptive tokenization, recurrent state, continuous basis decoding, and inference tricks can compete with larger Transformers. CHARM adds channel-description-conditioned TCNs and channel-time attention, trading semantic multivariate coupling for O(C^2 T^2) attention cost. Learning from Leading Indicators adds sparse local lead-lag channel dependence as a plugin path. EIDOS shifts prediction into latent space and uses point-wise SiGLU scalar tokenization, while FlowRanks studies rank structure and compression in time-series Transformers.

Looped Transformers And Test-Time Memory is the upstream page for elastic recurrent depth, latent reasoning, explicit test-time memory, and recursive small models. For this TSFM hub, that page is a source of candidate mechanisms, not a license to treat language or puzzle reasoning gains as forecasting evidence. Gated DeltaNet-2 is one such candidate mechanism on the compact-state side: its decoupled erase/write gates sharpen the question of how a streaming TSFM should edit latent state without overwriting rare or stale-but-relevant channel relationships.

ReinPatch should be treated as a patching-method source rather than a full foundation-model source. Its foundation-patcher experiment is still useful because it tests whether the learned tokenizer itself can transfer zero-shot across time-series forecasting datasets.

Number Tokenization tracks an adjacent interface question: whether scalar numeric observations, known future exogenous variables, control inputs, interventions, and metadata should be encoded through ordinary tokens, point-wise gated embeddings, Fourier features, bit-level number tokens, or continuous bases.

Reasoning, Generation, And Control Boundary

TimeOmni-1, TimeOmni-VL, and T2S are the reasoning and generation edge of the cluster. They ask for scenario understanding, causality discovery, event-aware forecasting, decision-making, time-series/image generation, and text-to-series generation. Those tasks are closer to world-model questions, but they still need careful separation from passive forecasting scores.

TimeCraft and the dedicated Time-Series Generation topic now own the broader generation taxonomy. TimeDP covers prototype-conditioned cross-domain generation, BRIDGE covers text-controlled generation, TarDiff covers target-aware EHR generation, OATS covers online TSFM augmentation, Diff-MN covers irregular continuous generation, and CaTSG covers causal/interventional/counterfactual generation. DiGA and MarS cover financial market simulation and what-if generation, but remain domain-specific rather than general action-conditioned TSFMs. These are TSFM-relevant because they stress generation, context, data quality, and causal slots, but most remain passive or condition-controlled rather than full action-conditioned world models.

Awesome Agentic Time Series expands this boundary from individual reasoning/generation models into a field map: perception agents, reasoning agents, planning/action agents, memory/knowledge agents, world-model/data agents, and reliability sources. That taxonomy is useful only if the wiki keeps the contracts distinct. A tool-using forecasting pipeline is not automatically a temporal world model; a memory-augmented agent is not automatically maintaining decision-relevant latent state; and a reasoning benchmark is not automatically a control benchmark.

T2S is especially useful as a boundary case. Like Sundial, it uses flow matching over continuous time-series representations. Unlike Sundial, it is not forecasting future observations from a numeric history; it generates synthetic time-series instances from natural-language captions through a length-adaptive VAE and text-conditioned Diffusion Transformer. This makes it a text-conditioned generation model, not a passive probabilistic forecaster.

ELF is the matching boundary case from the language side. It keeps language generation in continuous embedding-space flow matching until final token decoding, which makes a shared diffusion/flow substrate for text and time-series latents more plausible. The paper does not test time-series data, so the TSFM use is architectural: combine its text-side interface lesson with sources such as T2S and Sundial, then test numeric fidelity and calibration directly.

For diffusion/flow-style generation, Why Diffusion Models Don’t Memorize adds a checkpoint-selection warning: sample quality, numeric fidelity, and memorization probes may move on different training timescales.

Position: What Can LLMs Tell Us about Time Series Analysis is useful as a roadmap source for this branch. It frames time-series analysis as moving toward language-mediated reasoning, question answering, and modality switching, but it should not be treated as benchmark proof that LLMs solve numeric time-series modeling.

Time-Series Benchmark Hygiene records the evaluation caveats across datasets such as GIFT-Eval, TIME, Time-Series-Library, BOOM, and Time-HD: zero-shot, few-shot, fine-tuned, and ensemble entries should not be merged; benchmark overlap and pretraining leakage need explicit notes; and forecasting benchmarks should not be collapsed with UCR/UEA classification results.

Evidence

The expanded benchmark batch suggests that time series need their own representation assumptions. Temporal patching, point-wise numeric value embeddings, numerical scaling, textual channel descriptions, covariate handling, multivariate coupling, high-dimensional channel structure, shape-aware classification, latent dynamics, observability telemetry, dataset leakage controls, dense latent-state probes, and fixed-recipe SSL stress tests matter more explicitly than in standard language-model transfer.

The new scaling-law sources change the prior: TSFMs are no longer justified only by isolated benchmark wins. There is direct evidence for predictable scale-up in decoder-only time-series models, plus theory that explains why temporal horizon must be part of the scaling story. Scaling Laws, Carefully adds that this evidence should be reported as fitted, sensitivity-checked frontiers rather than as unqualified “bigger is better” or one-point matched-compute claims.

Relation To Foundation TSFM Agenda

This page is the broad topic hub for the Foundation Time-Series Model Research Agenda. It should route detailed slot judgments to the agenda and specialized pages rather than treating all TSFMs as one model class. Current evidence is uneven across slots: scaling, forecasting, context, representation learning, numeric tokenization, and high-dimensional forecasting have narrow partial answers, while streaming state, multi-modal future distributions, editing, explicit event streams, and action-conditioned counterfactual rollout remain open.

Streaming Latent-State Updates now owns the always-on serving contract for streaming state, retained memory, abstain/trigger behavior, and real-time update cost.

Open Questions

Which time-series tasks genuinely require reasoning rather than pattern matching?
Which models maintain useful latent state instead of only producing better forecast heads or coarse component labels?
Can one model support forecasting fidelity, shape-aware classification, causal reasoning, and natural-language interaction?
When is textual context essential to the forecast rather than helpful metadata or a post-hoc explanation?
When is text-to-series generation useful as data augmentation or simulation, and when does it need observed history to become forecasting?
Which covariate interfaces are merely exogenous-variable conditioning, and which could become action, control input, or intervention channels for world-model use?
How should synthetic pretraining be audited for leakage, unrealistic coupling, and benchmark-specific artifacts?
When does native multivariate modeling beat channel-independent univariate modeling enough to justify the extra architecture and serving cost?
When do textual channel descriptions improve transfer enough to justify curation and serving complexity?
Can a reusable learned patcher transfer across time-series domains without becoming another benchmark-specific preprocessing heuristic?
When does the channel count cross from ordinary multivariate forecasting into HDTSF, where benchmark and architecture conclusions change?
Are larger models, sparse experts, adaptive tokenization, recurrent state, or latent prediction the most reliable route for long-horizon stability?

Alex Open Research Wiki

Explorer

Time-Series Foundation Models

Time-Series Foundation Models

Summary

What The Wiki Currently Believes

Forecasting Foundation Models

Textual Context And Context-Aided Forecasting

Classification And Representation Learning

Tabular And PFN Analogs

Observability And Telemetry

Scaling, Efficiency, And Architecture Tradeoffs

Reasoning, Generation, And Control Boundary

Evidence

Relation To Foundation TSFM Agenda

Open Questions

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

Time-Series Foundation Models

Time-Series Foundation Models

Summary

What The Wiki Currently Believes

Forecasting Foundation Models

Textual Context And Context-Aided Forecasting

Classification And Representation Learning

Tabular And PFN Analogs

Observability And Telemetry

Scaling, Efficiency, And Architecture Tradeoffs

Reasoning, Generation, And Control Boundary

Evidence

Relation To Foundation TSFM Agenda

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks