Self-Supervised Representation Learning

Summary

The wiki’s SSL thread compares scaled visual representation learning with predictive embedding objectives that avoid raw reconstruction. The time-series branch adds a separate question: which objectives preserve temporal shape, numerical scale, local structure, and multivariate context before a downstream label is available? The newest layer-selection lesson is that the final embedding is often an objective-specific interface, not automatically the best reusable representation. The newest objective-design lesson is that anti-collapse and predictability constraints can smuggle in distribution priors or slow-feature shortcuts.

What The Wiki Currently Believes

  • A Cookbook of Self-Supervised Learning is the preferred beginner orientation source for the pre-2023 SSL landscape: method families, implementation knobs, and evaluation protocols rather than a new algorithm.
  • DINOv3 is the scaled vision-foundation-model reference point, with strong dense features and broad frozen transfer.
  • Guillotine Regularization shows that projectors can be useful during SSL training while their final outputs are bad downstream defaults; the best layer can depend on task, distribution, and optimization.
  • The Hidden Uniform Cluster Prior in Self-Supervised Learning shows that SSL collapse-prevention mechanisms can impose a uniform cluster prior that conflicts with long-tailed data.
  • Joint Embedding Predictive Architectures Focus on Slow Features shows that JEPA-style latent prediction can learn fixed slow distractors instead of action-relevant state.
  • Learning is Forgetting adds a language-pretraining compression lens: useful representations can forget input detail, but only relative to the training objective.
  • Perception Encoder shows the same hidden-feature effect in a large contrastive vision-language encoder, then uses alignment tuning to lift intermediate features to the output.
  • LeJEPA argues for a theory-grounded JEPA objective with SIGReg.
  • When Does LeJEPA Learn a World Model? adds an identifiability theorem for LeJEPA-style objectives, connecting Gaussian/whitened anti-collapse regularization to linearly recoverable latent state under explicit assumptions.
  • Self-Teaching Autoencoder adds a decoder-grounded latent-consistency experiment: reconstruction is trained through transformed encoder agreement rather than direct image-space loss.
  • NEPA shows next-embedding prediction can make strong vision learners without pixels, tokens, contrastive loss, or task-specific heads.
  • RAEv2 turns the best-layer question into a generative-model design knob: multi-layer sums preserve more local detail for reconstruction and rollouts, while REPA supplies complementary spatial regularization.
  • VL-JEPA applies predictive embedding learning to vision-language tasks.
  • CHARM brings JEPA-style latent prediction into multivariate time-series representation learning, using channel descriptions to condition temporal featurization and inter-channel attention.
  • T-Loss is an older scalable unsupervised baseline for multivariate time-series representation learning.
  • TS2Vec is the hierarchical contrastive branch: it learns timestamp-level contextual representations through augmented context views, temporal contrast, instance contrast, and multi-scale pooling.
  • T-Rep extends the timestep-level representation thread by learning embeddings of time itself, then using those embeddings in pretext tasks to capture trend, periodicity, distribution shifts, and missing-data structure.
  • SimMTM is the masked-modeling branch that reconstructs from multiple corrupted neighbors instead of a single masked series.
  • ATST is the audio SSL branch: a teacher-student Transformer setup that separates clip-level and frame-level representation learning, with masking and data augmentation for fine-grained frame-level sound events.
  • NuTime shows that numerical scale needs an explicit representation path rather than being erased by window normalization.
  • UniTS uses masked reconstruction as part of a broader task-tokenized multi-task time-series model, so it belongs here only for its representation-learning objective, not as a classification-only source.
  • Mantis, MantisV2, and UTICA define a classification-oriented lineage: contrastive pretraining, synthetic-data pretraining with test-time layer/token choices, and non-contrastive self-distillation on the Mantis-style backbone.
  • UniShape focuses on class-discriminative shape features, while TiViT shows that frozen vision encoders can provide useful intermediate representations after rendering time series as images.
  • Pretrained Transformers as Universal Computation Engines is not an SSL method, but it is important transfer evidence: a language-pretrained Transformer can provide useful frozen sequence computation for non-language tasks.

Evidence

The corpus suggests a spectrum from survey-level recipe knowledge to large-scale SSL systems and simpler predictive objectives. The Cookbook is the entry map for method families and practical knobs; DINOv3 shows the value of scale and careful training; Guillotine Regularization and Perception Encoder show that layer choice and objective alignment can dominate downstream transfer; Hidden Uniform Cluster Prior and JEPA Slow Features sharpen the objective-level gotchas around distribution priors and slow-feature shortcuts; LeJEPA, LeJEPA Identifiability, and NEPA ask whether the objective itself can be simpler, more principled, and more state-faithful; the time-series and audio sources ask whether representation learning should preserve shape, frequency, numerical scale, channel semantics, temporal resolution, and multivariate structure before any downstream label is available.

RAEv2 adds a practical warning for SSL consumers: even if a frozen encoder is strong, the downstream system still chooses which layers become the latent target. That choice can change reconstruction, generation, and action-conditioned rollout quality.

Self-Teaching Autoencoder adds the complementary decoder-grounding question. It is not strong evidence yet, but it makes a useful hypothesis concrete: a representation objective can train a decoder if transformed latent agreement makes decoded outputs stay tied to the input distribution rather than rewarding pixel averages directly.

The detailed time-series classification synthesis lives in Time-Series Classification Foundation Models so this page can stay focused on the broader SSL pattern.

Relation To Foundation TSFM Agenda

This page covers the self-supervised objective layer behind the Foundation Time-Series Model Research Agenda. It is relevant to augmentation-free learning, representation quality, anti-collapse, and rare-regime preservation. The agenda-relevant test is whether an SSL objective preserves temporal state, dense numeric detail, channel semantics, and intervention-sensitive variables, not only whether a frozen embedding transfers to an average downstream score.

Open Questions

  • Which predictive objective best preserves dense spatial structure?
  • How much of DINOv3’s performance comes from scale versus objective design?
  • Which self-supervised methods should expose intermediate layers or aligned outputs as first-class artifacts?
  • Which Cookbook-era rules still transfer cleanly to JEPA-style, multimodal, and time-series SSL?
  • Can transformed latent-consistency objectives train decoders without sacrificing downstream representation quality?
  • Which time-series SSL objectives produce reusable embeddings rather than narrow benchmark features?
  • Which anti-collapse priors are appropriate for naturally imbalanced temporal regimes, rare events, and interventions?
  • When does a self-supervised objective recover linearly usable latent state rather than only high-scoring downstream features?
  • Which temporal SSL tests expose slow-feature shortcuts before downstream fine-tuning hides them?
  • Can non-contrastive self-distillation reduce false-negative pressure in time-series datasets where different samples share similar temporal structure?
  • When are textual channel descriptions enough semantic context to improve SSL transfer rather than becoming another dataset-specific annotation bottleneck?