Unsupervised Scalable Representation Learning for Multivariate Time Series

Source

Core Claim

T-Loss argues that a causal dilated convolutional encoder trained with a fully unsupervised time-based triplet loss can learn transferable fixed-size representations for variable-length univariate and multivariate time series.

Key Contributions

  • Introduces a time-based triplet loss that samples a reference subseries, one contained positive subseries, and randomly selected negative subseries without using labels.
  • Uses an encoder built from exponentially dilated causal convolutions, residual connections, global max pooling, and a final linear projection so representation size is independent of input length.
  • Evaluates learned representations with simple downstream classifiers on UCR univariate classification and UEA multivariate classification benchmarks.
  • Demonstrates that the same representation-learning setup can scale to a long household-electricity time series and support downstream regression with large inference-time savings over raw-window features.

Benchmarked Models

ModelRole In PaperNotesOfficial Artifact
T-Loss-CricketXRepo-hosted benchmark checkpoint for the CricketX UCR datasetCausal CNN encoder trained with the T-Loss recipe; the paper uses CricketX to show classification accuracy improving during unsupervised training with K=10 negative samples.models/CricketX_CausalCNN_encoder.pth

Method Notes

T-Loss is a passive time-series representation model: it learns embeddings from observed time series and does not include an action, control input, intervention, or exogenous-variable channel. The model is still relevant to world-model work because it studies how far a generic latent state for time series can transfer across downstream tasks when trained without labels.

The training objective adapts the negative-sampling intuition from word2vec to time series. A reference subseries should have a representation close to one of its own subseries and far from random subseries sampled from another time series or another part of a long series.

The encoder choice matters for scalability. The paper favors causal convolutions over recurrent encoders because dilated convolutions can capture long-range dependencies with parallel hardware-friendly computation, while max pooling turns variable-length sequences into fixed-size representations.

Evidence And Results

  • On UCR univariate classification, the combined T-Loss representation outperforms the concurrent unsupervised baselines TimeNet and RWS on most datasets where comparisons are available.
  • Against supervised non-neural classifiers on the first 85 UCR datasets, the paper reports average rank 2.92 for T-Loss, behind HIVE-COTE and close to ST.
  • On CricketX, the appendix reports combined T-Loss accuracy 0.777; the learning-curve figure tracks the CricketX encoder with K=10 during training.
  • On UEA multivariate classification, T-Loss matches or outperforms dimension-dependent DTW on 69% of the datasets.
  • On the Individual Household Electric Power Consumption series, learned day- and quarter-window representations greatly reduce downstream regression wall time while preserving similar or slightly degraded error.

Limitations

  • The paper is a representation-learning result rather than a forecasting or action-conditioned world-model result; downstream prediction still depends on task-specific SVMs or linear regressors.
  • The main classification protocol trains an encoder per dataset, so it is not a single broad foundation model in the later time-series sense.
  • The UEA multivariate benchmark was new at the time, and the paper compares against DTW-D rather than a broad set of later multivariate baselines.
  • The method uses fixed hyperparameter choices per archive, but still relies on choices such as the number of negative samples and the SVM regularization grid.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Representation quality: semantic state vs dense detailadjacentLearns variable-length fixed-size representations with a causal dilated CNN and unsupervised time-based triplet loss.Representations are consumed by task-specific classifiers/regressors; no generative/editing fidelity or latent transition model.
Native multivariate encoding and high-channel scalingadjacentExtends the encoder to UEA multivariate classification by changing input filters and matches or outperforms DTW-D on most datasets.Does not model channel semantics, topology, known covariates, or high-channel operational systems.
Streaming state, long context, and constant updatesadjacentDemonstrates scalable representations on a 2M-point household-electricity series with large downstream speedups.Offline encoder windows are not online state updates or recurrent latent memory.
Control and counterfactualsinsufficient evidenceTime-based negative sampling learns passive temporal similarity.No action, intervention, treatment, or counterfactual supervision.

Open Questions

  • How much of T-Loss transfer comes from the triplet objective versus the causal CNN architecture?
  • Would a single encoder trained over many heterogeneous datasets retain the per-dataset performance reported here?
  • Can time-based negative sampling be adapted to action-conditioned trajectories without confusing passive temporal proximity with intervention effects?