Time-Series Classification Foundation Models

Summary

Time-series classification foundation models learn reusable embeddings for labeled time-series tasks rather than directly forecasting future observations. They are usually passive representation models: they encode observed time-series samples, then a downstream classifier uses the embeddings. They do not expose action, control input, intervention, or counterfactual rollout channels.

For the broader wiki agenda, this branch is important because it is one of the clearest places where time-series work optimizes representations rather than forecasts. ICLR 2026 Time-Series Classification Meta-Analysis adds the caution that visible representation-learning work may be concentrated in EEG/ECG/neuro/physiology settings, so cross-domain coverage should be checked explicitly.

Mantis Lineage

Mantis is the base lightweight calibrated classification model. It uses a Mantis token generator over normalized values, differentials, and patch statistics, then evaluates frozen features, fine-tuning, multivariate adapters, and calibration.

MantisV2 extends the same line with synthetic CauKer-style pretraining and test-time representation strategies. It introduces MantisPlus as the original Mantis architecture retrained on synthetic data, and MantisV2 as a refined smaller encoder with stronger zero-shot frozen-feature behavior.

UTICA is the self-distillation branch of the same family. It keeps the Mantis tokenizer and backbone, but replaces contrastive pretraining with DINO/iBOT-style multi-objective self-distillation, using global/local crop alignment, masked patch prediction, and a KoLeo regularizer.

The useful lineage distinction is:

Mantis: contrastive classification pretraining plus calibration and multivariate adapters.
MantisV2: synthetic data plus layer, token, scale, and feature-fusion choices at test time.
UTICA: non-contrastive self-distillation on the Mantis-style architecture.

Other Classification And Representation Routes

UniShape is classification-specific in a different way: it uses a multiscale shape-aware adapter and prototype objectives to preserve class-discriminative local shape. Its benchmarked entries separate fine-tuned classification from frozen-feature zero-shot extraction.

NuTime centers numerical scale. It separates local normalized shape from window mean and standard deviation through numerically multi-scaled embedding, then transfers to classification, few-shot learning, clustering, and anomaly detection.

T-Loss is an older unsupervised representation baseline. It trains causal convolutional encoders with a time-based triplet loss, showing that temporal proximity and subseries containment can produce useful embeddings before the foundation-model era.

TiViT tests a transfer route from frozen vision encoders. It renders numeric time series as images, extracts intermediate hidden-layer vision features, and trains a classifier. Its main lesson is that intermediate representation geometry can be useful even when the backbone was not trained on time series.

MOMENT is a broader time-series foundation model, but its classification evidence belongs here because masked-reconstruction representations can support downstream SVM classification even when the model was not designed only for labels.

LeNEPA is the local no-augmentation SSL result for this branch. It is not a classification-only model, but its evidence is reported through frozen probes on PTB-XL and Aionoscope Diag, plus a CauKer-pretrained UCR-128 Random-Forest check near Mantis/MOMENT protocol anchors. Treat the UCR result as a single-seed frozen-encoder check, not a leaderboard replacement for MantisV2.

CHARM is also broader than classification, but its UEA evidence belongs here because the frozen JEPA-trained encoder is probed with an SVM. Its distinguishing feature is not a new classifier head; it is a channel-description-conditioned representation model with native multivariate attention.

SensorFM is the large closed-corpus wearable-health branch. It uses missingness-aware masked reconstruction over 34 minute-level sensor features, then tests frozen embeddings and lightweight heads across 35 health and behavioral tasks. The result is strong label-scarce representation evidence, but the benchmark is private and health-specific rather than a public UCR/UEA-style comparison.

Aionoscope belongs here as a diagnostic benchmark rather than a classifier model. It uses categorical component probes alongside dense state probes, making clear that strong component classification does not imply dense latent-state accessibility.

TS2Vec is the main contrastive representation baseline to keep in this branch. It learns timestamp-level contextual embeddings with hierarchical temporal and instance contrast, then aggregates them for subseries or whole-series downstream classifiers.

T-Rep is the learned-time-embedding branch. Like TS2Vec, it cares about timestep-level representations, but it makes the time embedding itself learned and uses it in pretext tasks so temporal features such as trend, periodicity, distribution shifts, and missingness can be represented explicitly.

SimMTM is the masked-modeling contrast: it reconstructs original time series from multiple masked neighbors and transfers by fine-tuning into forecasting and classification tasks.

UniTS is a broader multi-task model rather than a classification-only model. Its classification relevance comes from using task tokens and shared weights to put classification beside forecasting, imputation, and anomaly detection in one interface.

Label Scarcity And Data Engines

Florence-2 is a useful cross-domain pattern for time-series classification: the data engine starts from imperfect labels, trains a seed model, then uses model predictions plus filtering to improve and expand the dataset. For temporal classification, the analogous loop would label real multivariate time series with class targets, event spans, regimes, or anomaly spans, then keep a human-audited split outside the loop for evaluation.

This matters because many time-series classification papers lean on UCR/UEA or synthetic labels. A Florence-style data engine suggests a third path: use real unlabeled temporal data, bootstrap the annotation layer iteratively, and make the label ontology itself a first-class artifact.

S4L is a historical vision caution for this branch. Auxiliary self-supervision should be judged only after strong supervised-only baselines and realistic validation protocols, because weak baselines can make unlabeled-data gains look larger than they are.

What To Compare

Classification papers should be compared on the evaluation mode, not only on average rank:

Frozen feature extraction tests whether the pretrained representation transfers with a lightweight downstream classifier.
Fine-tuning tests whether the pretrained backbone is a good initialization for a target dataset.
Zero-shot claims may still train a Random Forest, SVM, or logistic-regression head on target labels, so they are not label-free prediction.
Fusion results, such as MantisV2 plus TiViT features, should be separated from single-model entries.
Fixed-recipe SSL checks, such as LeNEPA’s PTB-XL/Diag protocol, should be separated from fully tuned classification leaderboards.
Categorical probes, such as Aionoscope component presence, should be separated from dense process-state probes.

UCR and UEA classification results should not be merged with forecasting benchmarks such as GIFT-Eval, BOOM, TIME, fev-bench, Monash, or LSF. They test different task surfaces.

Evidence

The classification branch repeatedly argues that shape, scale, calibration, channel semantics, feature geometry, and label coverage matter more than direct future-value prediction. Mantis and UTICA test contrastive versus self-distilled objectives on a shared family; MantisV2 and CauKer-style synthetic data test label coverage; LeNEPA tests no-augmentation next-latent SSL under fixed-recipe reuse and a frozen-encoder UCR check; SensorFM tests population-scale wearable masked reconstruction with health-task transfer; Aionoscope separates categorical component recovery from dense latent-state recovery; TS2Vec tests timestamp-level contrastive representation learning; T-Rep tests learned time-embeddings and explicit temporal pretext tasks; SimMTM tests masked reconstruction from multiple corrupted neighbors; UniTS tests whether classification can share a task-token interface with forecasting and imputation; Florence-2 contributes a cross-domain data-engine pattern for bootstrapped annotation layers; CHARM tests JEPA-style latent prediction with native multivariate channel descriptions; UniShape tests shape-aware class tokens; NuTime tests numerical-scale preservation; TiViT tests representation transfer from vision models.

Relation To Foundation TSFM Agenda

This page contributes to the Foundation Time-Series Model Research Agenda as representation-learning evidence, not as forecasting or control evidence. Its strongest slot is representation quality: classification can test whether embeddings preserve shape, scale, channel semantics, and labels. Its main warning is that passive classification embeddings do not prove streaming state maintenance, plausible future modeling, or action-conditioned control.

Open Questions

Does the Mantis lineage scale cleanly beyond the 4M to 8M parameter regime without losing the deployment advantage?
Which gains come from synthetic data diversity, objective choice, architecture, test-time layer selection, or downstream classifier choice?
Can real unlabeled time-series corpora be converted into useful classification corpora through iterative model-assisted labeling?
When does adding a self-supervised auxiliary loss improve label-scarce time-series classification after strong supervised-only tuning, and when is the apparent gain mostly validation or model-selection protocol?
Can a native multivariate classification model preserve cross-channel dynamics better than channel-wise encoding and concatenation?
Which SensorFM gains come from corpus scale, missingness-aware AIM pretraining, wearable-domain feature design, or downstream-head search?
Which classification representations transfer to forecasting, anomaly detection, or action-conditioned world models rather than only UCR/UEA labels?
Do textual channel descriptions improve classification transfer outside datasets with clean, human-readable sensor names?
Which non-biomedical domains can support representation-learning benchmarks with state, regime, or rare-event semantics?

Alex Open Research Wiki

Explorer

Time-Series Classification Foundation Models

Time-Series Classification Foundation Models

Summary

Mantis Lineage

Other Classification And Representation Routes

Label Scarcity And Data Engines

What To Compare

Evidence

Relation To Foundation TSFM Agenda

Open Questions

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

Time-Series Classification Foundation Models

Time-Series Classification Foundation Models

Summary

Mantis Lineage

Other Classification And Representation Routes

Label Scarcity And Data Engines

What To Compare

Evidence

Relation To Foundation TSFM Agenda

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks