Representation Collapse

Summary

Representation collapse is the failure mode where predictive representation learning maps inputs to uninformative or nearly identical embeddings. The wiki also tracks adjacent anti-collapse failures: a representation can avoid constant collapse while still encoding the wrong factors because of slow-feature shortcuts or a mismatched distribution prior.

For time-series JEPA and NEPA-style predictive representation learning, the collapse question also includes target construction. A target embedding can be non-constant but still erase the local patch, channel, or rare-event distinctions needed for useful state prediction.

What The Wiki Currently Believes

A Cookbook of Self-Supervised Learning is the beginner map for collapse terminology in visual SSL, including constant-output collapse, dimensional collapse, projector effects, and rank/eigenspectrum diagnostics.
The Hidden Uniform Cluster Prior in Self-Supervised Learning shows that some anti-collapse mechanisms impose a uniform cluster prior, which can suppress long-tailed semantic features.
Joint Embedding Predictive Architectures Focus on Slow Features shows a non-constant failure mode where a JEPA representation can encode fixed distractor noise while ignoring action-relevant state.
LeJEPA argues that a good JEPA objective should force embeddings toward an isotropic Gaussian target distribution.
VISReg refines that branch by arguing that SIGReg’s Epps-Pulley sketching gradient can vanish precisely at collapse, while a separate variance/scale term plus Sliced-Wasserstein shape loss keeps a stronger recovery signal. The gradient result is a controlled feature-scale simulation; the author’s later batch-size/ $λ$ troubleshooting advice is useful but is not a validated exhaustive collapse-cause taxonomy.
LeVLJEPA shows a cross-modal collapse case: direct symmetric image-text MSE collapses, SIGReg alone is insufficient, and the stable non-contrastive recipe needs predictor/stop-gradient asymmetry plus SIGReg.
When Does LeJEPA Learn a World Model? turns that target-distribution story into an identifiability claim under Gaussian/OU assumptions, while also warning that non-Gaussian or policy-shaped trajectories may produce distorted but non-collapsed representations.
LeNEPA is the local time-series test of the SIGReg path: temporal SIGReg stabilizes no-stop-gradient next-latent prediction in the published fixed-recipe experiments, but the paper still treats dense-state preservation and broader transfer as open.
VJEPA gives a conditional anti-collapse argument for probabilistic JEPA: target diversity and a sufficiently expressive predictor rule out constant context collapse at a global optimum, but the result does not guarantee finite-sample optimization stability or target-branch preservation.
Learning is Forgetting adds the positive counterpart: reducing input information can be healthy when it preserves target-relevant structure.
LeWorldModel uses Gaussian regularization to stabilize end-to-end pixel world-model training without EMA, pretrained encoders, or auxiliary supervision.
Sensorimotor World Models uses inverse dynamics regularization as the sole anti-collapse mechanism for an end-to-end JEPA world model, making partial collapse of action-irrelevant variation a deliberate bias rather than only a failure.
NEPA uses next-embedding prediction with causal masking and stop-gradient, showing a simpler visual predictive objective can work without pixel reconstruction or discrete tokens.
Learn From Your Own Latents And Not From Tokens adds a local-learning boundary: SLC preserves the RHM sample-complexity scaling with module stop-gradients, while some EMA-free gradient paths collapse when clustering loss can overpower prediction.
The Illusion of Superposition adds a latent-reasoning collapse variant: soft token mixtures and fine-tuned latent thoughts can become effectively discrete or shortcut to the final answer while still using non-constant hidden states.
Latent Thought Flow is the positive counterpart: entropy-weighted subtrajectory balance and reference-prior regularization try to keep latent reasoning in an effective entropy regime rather than collapsing to deterministic paths or drifting into unstructured noise.
Self-Teaching Autoencoder names a decoder-specific collapse-adjacent shortcut: encoder and decoder can invent a private language unless transformed views constrain the encoder’s equivalence classes.
EIDOS uses stop-gradient on the target branch plus observation-space grounding so latent predictions remain tied to the numeric forecasting objective.
Variable-Width Transformers adds a structural compression-valley case for language models: a static bowtie hidden-width bottleneck can improve residual-stream matrix entropy and MLP activation utilization in middle layers versus a constant-width Transformer.
Next-Embedding Prediction records the NEPA-style target-layer warning: patch-dependent or internal-layer targets degraded next-embedding prediction even when patch-independent embeddings were stable. This is unpublished evidence and not a pure-JEPA result, so it should guide ablations rather than serve as a settled claim.

Evidence

The sources agree collapse prevention is central, but they disagree in mechanism and even in failure-mode framing: Cookbook-era visual SSL emphasizes projector, predictor, EMA, covariance, and rank diagnostics; Hidden Uniform Cluster Prior shows that anti-collapse regularizers can encode unwanted distribution assumptions; JEPA Slow Features shows that non-collapsed embeddings can still ignore the intended state; JEPA-style sources emphasize distribution matching and Gaussian regularization; VISReg argues that SIGReg-family sketching needs a separate variance/scale recovery signal when collapse has already happened; LeVLJEPA shows that cross-modal prediction adds another failure mode because marginal SIGReg is not enough if the regression term is symmetric; LeJEPA Identifiability adds that the Gaussian prior can be a positive identifiability condition under the right world process and a mismatch risk when real trajectories violate it; Sensorimotor World Models adds a different axis: use the logged action as the grounding signal, which prevents full collapse but can still erase variables outside the action repertoire. Other temporal models use stop-gradient predictive training or explicit observation grounding.

The local curriculum notes add a time-series-specific hypothesis from NEPA-style experiments: when target embeddings are built by a context-mixing encoder, the model may learn a shortcut target that is easier to predict but less faithful to patch-level state. LeNEPA answers part of the stabilization question with temporal SIGReg, but it does not remove the preservation question: a non-collapsed next-latent representation can still need probes for dense state, event timing, rare regimes, and action history.

Learning-is-Forgetting sharpens the boundary between useful compression and harmful collapse. Forgetting input detail is not automatically a bug; the risk is objective mismatch, where compression removes rare, numeric, or action-relevant state that the downstream system needs. Illusion of Superposition adds that a representation can look continuous while functionally committing to one discrete interpretation or direct answer shortcut. LTF adds a candidate training mechanism for this boundary: do not merely increase entropy, but regulate latent-trajectory entropy with a reward-proportional objective and still verify causal use through ablations.

Self-Teaching Autoencoder adds a decoder-loop version of the same problem. Even if embeddings avoid constant collapse, an encoder-decoder pair can agree on latent codes that are self-consistent but not faithful reconstructions. The source’s proposed guardrail is to test agreement after transformations, so the acceptable equivalence class is narrowed by multiple views.

Variable-Width Transformers adds a different lesson: sometimes collapse-like underuse of the residual space can be improved by architectural capacity allocation rather than by an explicit anti-collapse loss. That should be treated as language-model evidence for structural regularization, not as proof that a bottleneck preserves rare or action-relevant time-series state.

Relation To Foundation TSFM Agenda

Representation collapse maps to the anti-collapse slot in the Foundation Time-Series Model Research Agenda. The local verdict is warning: avoiding constant collapse is necessary, but the agenda needs probes that also catch slow-feature shortcuts, long-tail prior mismatch, lost dense numeric detail, and missing action-relevant state.

Open Questions

Which collapse-prevention mechanism is most robust at frontier data/model scale?
When is partial collapse of action-irrelevant state healthy compression, and when does it erase state needed by future tasks?
Can a single target embedding distribution work across visual, temporal, and language modalities?
How can evaluation distinguish healthy high-variance embeddings from representations dominated by nuisance slow features or mismatched cluster priors?
Which transformations best expose private-language shortcuts in decoder-grounded latent objectives?
How should time-series JEPA and NEPA-style systems ablate patch-independent targets, contextual targets, and internal-layer targets to catch patch-dependence collapse?
Does temporal SIGReg remain a sufficient LeNEPA stabilizer when the target includes multivariate channels, irregular event streams, exogenous variables, or actions?
Does VISReg-style scale/shape decoupling improve collapse recovery for temporal embeddings without flattening rare or non-Gaussian state variables?
Which diagnostics can distinguish true collapse recovery from a batch-statistics artifact, especially when SWD quantiles, batch composition, outer $λ$ , and slice count change together?
In time-series/text prediction, is LeVLJEPA-style predictor/stop-gradient asymmetry required in addition to temporal SIGReg, or can a LeNEPA-style no-stop-gradient path remain stable?
Which collapse-prevention tests distinguish “non-collapsed but nonlinear/distorted” states from linearly identifiable states that a planner can safely use?
Which probes distinguish useful uncertainty over candidate futures from latent-state collapse into a single shortcut answer?

Alex Open Research Wiki

Explorer

Representation Collapse

Representation Collapse

Summary

What The Wiki Currently Believes

Evidence

Relation To Foundation TSFM Agenda

Open Questions

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

Representation Collapse

Representation Collapse

Summary

What The Wiki Currently Believes

Evidence

Relation To Foundation TSFM Agenda

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks