Self-Supervised Representation Learning

Summary

The wiki’s SSL thread compares scaled visual representation learning with predictive embedding objectives that avoid raw reconstruction. The time-series branch adds a separate question: which objectives preserve temporal shape, numerical scale, local structure, and multivariate context before a downstream label is available? The newest layer-selection lesson is that the final embedding is often an objective-specific interface, not automatically the best reusable representation. The newest objective-design lesson is that anti-collapse and predictability constraints can smuggle in distribution priors or slow-feature shortcuts.

What The Wiki Currently Believes

A Cookbook of Self-Supervised Learning is the preferred beginner orientation source for the pre-2023 SSL landscape: method families, implementation knobs, and evaluation protocols rather than a new algorithm.
S4L is older vision evidence that self-supervised auxiliary losses can become semi-supervised losses under label scarcity. Its durable lesson for this wiki is less about rotation/exemplar pretext tasks and more about strong baselines, validation protocol, and multi-loss unlabeled-data training.
Superhuman Adaptable Intelligence is a north-star terminology source, not an SSL method: it argues that fast specialization needs generic knowledge from unlabeled data and points toward SSL, latent prediction, world models, and architecture diversity.
DINOv3 is the scaled vision-foundation-model reference point, with strong dense features and broad frozen transfer.
Guillotine Regularization shows that projectors can be useful during SSL training while their final outputs are bad downstream defaults; the best layer can depend on task, distribution, and optimization.
The Hidden Uniform Cluster Prior in Self-Supervised Learning shows that SSL collapse-prevention mechanisms can impose a uniform cluster prior that conflicts with long-tailed data.
Joint Embedding Predictive Architectures Focus on Slow Features shows that JEPA-style latent prediction can learn fixed slow distractors instead of action-relevant state.
Learning is Forgetting adds a language-pretraining compression lens: useful representations can forget input detail, but only relative to the training objective.
Perception Encoder shows the same hidden-feature effect in a large contrastive vision-language encoder, then uses alignment tuning to lift intermediate features to the output.
Revisiting the Platonic Representation Hypothesis: An Aristotelian View adds a measurement-hygiene warning for SSL and multimodal representation comparisons: raw global similarity and max-over-layer trends can be artifacts of representation width and layer-search depth, while calibrated local-neighborhood agreement is the more robust surviving signal.
LeJEPA argues for a theory-grounded JEPA objective with SIGReg.
VISReg is the current regularization-based visual SSL pressure test: split SIGReg-family Gaussian matching into scale, Sliced-Wasserstein shape, and center terms, but treat the reported OOD, data-efficiency, collapse-recovery, and loss-correlation claims as author-reported preprint evidence with setting-specific controls.
When Does LeJEPA Learn a World Model? adds an identifiability theorem for LeJEPA-style objectives, connecting Gaussian/whitened anti-collapse regularization to linearly recoverable latent state under explicit assumptions.
LeNEPA is the local time-series SSL result that combines next-latent prediction with temporal SIGReg and no handcrafted augmentations in the tested recipe.
Learn From Your Own Latents And Not From Tokens adds a sample-efficiency theorem for own-latent self-supervision on a synthetic hidden hierarchy, while keeping the real-language and real-image transfer question open.
Self-Teaching Autoencoder adds a decoder-grounded latent-consistency experiment: reconstruction is trained through transformed encoder agreement rather than direct image-space loss.
NEPA shows next-embedding prediction can make strong vision learners without pixels, tokens, contrastive loss, or task-specific heads.
RAEv2 turns the best-layer question into a generative-model design knob: multi-layer sums preserve more local detail for reconstruction and rollouts, while REPA supplies complementary spatial regularization.
VL-JEPA applies predictive embedding learning to vision-language tasks.
LeVLJEPA is the current cross-modal non-contrastive SSL case: image/text prediction needs predictor plus stop-gradient asymmetry and per-modality SIGReg; marginal SIGReg alone is not sufficient in its ablations.
CHARM brings JEPA-style latent prediction into multivariate time-series representation learning, using channel descriptions to condition temporal featurization and inter-channel attention.
SensorFM brings missingness-aware masked reconstruction into population-scale wearable sensor pretraining, extending the LSM-2 AIM idea from incomplete day-long windows to a 5M-participant corpus and downstream health-task transfer.
Conditional Autoencoders for Electrical Consumption is an older CVAE-based energy time-series representation source: it uses expert-conditioned residual latents to expose rare calendar and weather regimes, but remains passive and small-scale.
T-Loss is an older scalable unsupervised baseline for multivariate time-series representation learning.
TS2Vec is the hierarchical contrastive branch: it learns timestamp-level contextual representations through augmented context views, temporal contrast, instance contrast, and multi-scale pooling.
T-Rep extends the timestep-level representation thread by learning embeddings of time itself, then using those embeddings in pretext tasks to capture trend, periodicity, distribution shifts, and missing-data structure.
SimMTM is the masked-modeling branch that reconstructs from multiple corrupted neighbors instead of a single masked series.
ATST is the audio SSL branch: a teacher-student Transformer setup that separates clip-level and frame-level representation learning, with masking and data augmentation for fine-grained frame-level sound events.
NuTime shows that numerical scale needs an explicit representation path rather than being erased by window normalization.
UniTS uses masked reconstruction as part of a broader task-tokenized multi-task time-series model, so it belongs here only for its representation-learning objective, not as a classification-only source.
Mantis, MantisV2, and UTICA define a classification-oriented lineage: contrastive pretraining, synthetic-data pretraining with test-time layer/token choices, and non-contrastive self-distillation on the Mantis-style backbone.
Aionoscope is a representation-debugging benchmark rather than an SSL method, but it belongs in this page because it makes dense latent-state accessibility a first-class diagnostic for frozen SSL-style encoders.
UniShape focuses on class-discriminative shape features, while TiViT shows that frozen vision encoders can provide useful intermediate representations after rendering time series as images.
Pretrained Transformers as Universal Computation Engines is not an SSL method, but it is important transfer evidence: a language-pretrained Transformer can provide useful frozen sequence computation for non-language tasks.

Evidence

The corpus suggests a spectrum from survey-level recipe knowledge to large-scale SSL systems and simpler predictive objectives. The Cookbook is the entry map for method families and practical knobs; DINOv3 shows the value of scale and careful training; Guillotine Regularization and Perception Encoder show that layer choice and objective alignment can dominate downstream transfer; Aristotelian Representation Hypothesis shows that representation-convergence claims need calibrated nulls before raw similarity is trusted; Hidden Uniform Cluster Prior and JEPA Slow Features sharpen the objective-level gotchas around distribution priors and slow-feature shortcuts; LeJEPA, VISReg, LeJEPA Identifiability, Own Latents, NEPA, LeVLJEPA, and LeNEPA ask whether the objective itself can be simpler, more principled, more sample-efficient, cross-modal without negatives, and more state-faithful; SensorFM asks whether missingness-aware masked reconstruction keeps scaling on real wearable streams; Aionoscope then asks whether such representations expose dense latent state rather than only coarse labels; the time-series and audio sources ask whether representation learning should preserve shape, frequency, numerical scale, channel semantics, temporal resolution, and multivariate structure before any downstream label is available.

Superhuman Adaptable Intelligence adds only framing evidence: if adaptation speed is the target, SSL is attractive because unlabeled structure can supply reusable knowledge before task-specific labels exist. Method claims still need to come from concrete SSL sources.

RAEv2 adds a practical warning for SSL consumers: even if a frozen encoder is strong, the downstream system still chooses which layers become the latent target. That choice can change reconstruction, generation, and action-conditioned rollout quality.

Self-Teaching Autoencoder adds the complementary decoder-grounding question. It is not strong evidence yet, but it makes a useful hypothesis concrete: a representation objective can train a decoder if transformed latent agreement makes decoded outputs stay tied to the input distribution rather than rewarding pixel averages directly.

The detailed time-series classification synthesis lives in Time-Series Classification Foundation Models so this page can stay focused on the broader SSL pattern.

Relation To Foundation TSFM Agenda

This page covers the self-supervised objective layer behind the Foundation Time-Series Model Research Agenda. It is relevant to augmentation-free learning, representation quality, anti-collapse, and rare-regime preservation. The agenda-relevant test is whether an SSL objective preserves temporal state, dense numeric detail, channel semantics, and intervention-sensitive variables, not only whether a frozen embedding transfers to an average downstream score.

Open Questions

Which predictive objective best preserves dense spatial structure?
How much of DINOv3’s performance comes from scale versus objective design?
Which self-supervised methods should expose intermediate layers or aligned outputs as first-class artifacts?
Which Cookbook-era rules still transfer cleanly to JEPA-style, multimodal, and time-series SSL?
Can transformed latent-consistency objectives train decoders without sacrificing downstream representation quality?
Which time-series SSL objectives produce reusable embeddings rather than narrow benchmark features?
When does adding a self-supervised auxiliary loss improve label-scarce time-series classification after strong supervised-only tuning, and when is the apparent gain mostly validation or model-selection protocol?
Which anti-collapse priors are appropriate for naturally imbalanced temporal regimes, rare events, and interventions?
Which calibrated similarity or neighborhood probes should accompany SSL scaling reports so width, layer-search, and local-versus-global metric artifacts do not masquerade as representation progress?
When does a self-supervised objective recover linearly usable latent state rather than only high-scoring downstream features?
When does own-latent self-supervision reduce sample complexity on real corpora, and when does it only create a self-consistent but task-misaligned latent code?
Which temporal SSL tests expose slow-feature shortcuts before downstream fine-tuning hides them?
Can non-contrastive self-distillation reduce false-negative pressure in time-series datasets where different samples share similar temporal structure?
Does LeVLJEPA-style cross-modal prediction with SIGReg transfer to time-series/text pairs without hiding dense numeric state behind caption-level alignment?
When are textual channel descriptions enough semantic context to improve SSL transfer rather than becoming another dataset-specific annotation bottleneck?
Can no-augmentation next-latent SSL with temporal SIGReg preserve dense state variables beyond the PTB-XL/Diag/UCR settings reported for LeNEPA?
Which diagnostic benchmarks should accompany SSL papers so coarse class transfer does not hide dense-state erasure?

Alex Open Research Wiki

Explorer

Self-Supervised Representation Learning

Self-Supervised Representation Learning

Summary

What The Wiki Currently Believes

Evidence

Relation To Foundation TSFM Agenda

Open Questions

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

Self-Supervised Representation Learning

Self-Supervised Representation Learning

Summary

What The Wiki Currently Believes

Evidence

Relation To Foundation TSFM Agenda

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks