Time-Series Classification Foundation Models

Summary

Time-series classification foundation models learn reusable embeddings for labeled time-series tasks rather than directly forecasting future observations. They are usually passive representation models: they encode observed time-series samples, then a downstream classifier uses the embeddings. They do not expose action, control input, intervention, or counterfactual rollout channels.

For the broader wiki agenda, this branch is important because it is one of the clearest places where time-series work optimizes representations rather than forecasts. ICLR 2026 Time-Series Classification Meta-Analysis adds the caution that visible representation-learning work may be concentrated in EEG/ECG/neuro/physiology settings, so cross-domain coverage should be checked explicitly.

Mantis Lineage

Mantis is the base lightweight calibrated classification model. It uses a Mantis token generator over normalized values, differentials, and patch statistics, then evaluates frozen features, fine-tuning, multivariate adapters, and calibration.

MantisV2 extends the same line with synthetic CauKer-style pretraining and test-time representation strategies. It introduces MantisPlus as the original Mantis architecture retrained on synthetic data, and MantisV2 as a refined smaller encoder with stronger zero-shot frozen-feature behavior.

UTICA is the self-distillation branch of the same family. It keeps the Mantis tokenizer and backbone, but replaces contrastive pretraining with DINO/iBOT-style multi-objective self-distillation, using global/local crop alignment, masked patch prediction, and a KoLeo regularizer.

The useful lineage distinction is:

  • Mantis: contrastive classification pretraining plus calibration and multivariate adapters.
  • MantisV2: synthetic data plus layer, token, scale, and feature-fusion choices at test time.
  • UTICA: non-contrastive self-distillation on the Mantis-style architecture.

Other Classification And Representation Routes

UniShape is classification-specific in a different way: it uses a multiscale shape-aware adapter and prototype objectives to preserve class-discriminative local shape. Its benchmarked entries separate fine-tuned classification from frozen-feature zero-shot extraction.

NuTime centers numerical scale. It separates local normalized shape from window mean and standard deviation through numerically multi-scaled embedding, then transfers to classification, few-shot learning, clustering, and anomaly detection.

T-Loss is an older unsupervised representation baseline. It trains causal convolutional encoders with a time-based triplet loss, showing that temporal proximity and subseries containment can produce useful embeddings before the foundation-model era.

TiViT tests a transfer route from frozen vision encoders. It renders numeric time series as images, extracts intermediate hidden-layer vision features, and trains a classifier. Its main lesson is that intermediate representation geometry can be useful even when the backbone was not trained on time series.

MOMENT is a broader time-series foundation model, but its classification evidence belongs here because masked-reconstruction representations can support downstream SVM classification even when the model was not designed only for labels.

CHARM is also broader than classification, but its UEA evidence belongs here because the frozen JEPA-trained encoder is probed with an SVM. Its distinguishing feature is not a new classifier head; it is a channel-description-conditioned representation model with native multivariate attention.

TS2Vec is the main contrastive representation baseline to keep in this branch. It learns timestamp-level contextual embeddings with hierarchical temporal and instance contrast, then aggregates them for subseries or whole-series downstream classifiers.

T-Rep is the learned-time-embedding branch. Like TS2Vec, it cares about timestep-level representations, but it makes the time embedding itself learned and uses it in pretext tasks so temporal features such as trend, periodicity, distribution shifts, and missingness can be represented explicitly.

SimMTM is the masked-modeling contrast: it reconstructs original time series from multiple masked neighbors and transfers by fine-tuning into forecasting and classification tasks.

UniTS is a broader multi-task model rather than a classification-only model. Its classification relevance comes from using task tokens and shared weights to put classification beside forecasting, imputation, and anomaly detection in one interface.

Label Scarcity And Data Engines

Florence-2 is a useful cross-domain pattern for time-series classification: the data engine starts from imperfect labels, trains a seed model, then uses model predictions plus filtering to improve and expand the dataset. For temporal classification, the analogous loop would label real multivariate time series with class targets, event spans, regimes, or anomaly spans, then keep a human-audited split outside the loop for evaluation.

This matters because many time-series classification papers lean on UCR/UEA or synthetic labels. A Florence-style data engine suggests a third path: use real unlabeled temporal data, bootstrap the annotation layer iteratively, and make the label ontology itself a first-class artifact.

What To Compare

Classification papers should be compared on the evaluation mode, not only on average rank:

  • Frozen feature extraction tests whether the pretrained representation transfers with a lightweight downstream classifier.
  • Fine-tuning tests whether the pretrained backbone is a good initialization for a target dataset.
  • Zero-shot claims may still train a Random Forest, SVM, or logistic-regression head on target labels, so they are not label-free prediction.
  • Fusion results, such as MantisV2 plus TiViT features, should be separated from single-model entries.

UCR and UEA classification results should not be merged with forecasting benchmarks such as GIFT-Eval, BOOM, TIME, fev-bench, Monash, or LSF. They test different task surfaces.

Evidence

The classification branch repeatedly argues that shape, scale, calibration, channel semantics, feature geometry, and label coverage matter more than direct future-value prediction. Mantis and UTICA test contrastive versus self-distilled objectives on a shared family; MantisV2 and CauKer-style synthetic data test label coverage; TS2Vec tests timestamp-level contrastive representation learning; T-Rep tests learned time-embeddings and explicit temporal pretext tasks; SimMTM tests masked reconstruction from multiple corrupted neighbors; UniTS tests whether classification can share a task-token interface with forecasting and imputation; Florence-2 contributes a cross-domain data-engine pattern for bootstrapped annotation layers; CHARM tests JEPA-style latent prediction with native multivariate channel descriptions; UniShape tests shape-aware class tokens; NuTime tests numerical-scale preservation; TiViT tests representation transfer from vision models.

Relation To Foundation TSFM Agenda

This page contributes to the Foundation Time-Series Model Research Agenda as representation-learning evidence, not as forecasting or control evidence. Its strongest slot is representation quality: classification can test whether embeddings preserve shape, scale, channel semantics, and labels. Its main warning is that passive classification embeddings do not prove streaming state maintenance, plausible future modeling, or action-conditioned control.

Open Questions

  • Does the Mantis lineage scale cleanly beyond the 4M to 8M parameter regime without losing the deployment advantage?
  • Which gains come from synthetic data diversity, objective choice, architecture, test-time layer selection, or downstream classifier choice?
  • Can real unlabeled time-series corpora be converted into useful classification corpora through iterative model-assisted labeling?
  • Can a native multivariate classification model preserve cross-channel dynamics better than channel-wise encoding and concatenation?
  • Which classification representations transfer to forecasting, anomaly detection, or action-conditioned world models rather than only UCR/UEA labels?
  • Do textual channel descriptions improve classification transfer outside datasets with clean, human-readable sensor names?
  • Which non-biomedical domains can support representation-learning benchmarks with state, regime, or rare-event semantics?