Learn From Your Own Latents And Not From Tokens: A Sample-Complexity Theory

Source

Raw Markdown: paper_own-latents-not-tokens-2026.md
PDF: paper_own-latents-not-tokens-2026.pdf
Preprint: arXiv:2605.27734v1
Official X announcement: Daniel Korchinski, with local oEmbed snapshot x-oembed-dankorchinski-2060344980749549607.json
Official X follow-up: Daniel Korchinski open question, with local oEmbed snapshot x-oembed-dankorchinski-2060345715914612849.json
Official X author posts: Alessandro Favero, with local oEmbed snapshot x-oembed-alesfav-2060277596089159734.json, and Matthieu Wyart, with local oEmbed snapshot x-oembed-matthieuwyart-2061317203857739846.json
Yann LeCun X comment: Yann LeCun, with local oEmbed snapshot x-oembed-ylecun-2061466116988289512.json
Community X comments provided by Alex: Gerard Sans, with local oEmbed snapshot x-oembed-gerardsans-2060784036495155632.json, and Paul Thompson, with local oEmbed snapshot x-oembed-ptenigma-2063116837832011889.json

No official code or project page was found during ingest.

Status And Credibility

This is a 2026-05-26 arXiv v1 preprint by Daniel J. Korchinski, Alessandro Favero, and Matthieu Wyart. It is credible enough to track as an important JEPA theory source because it builds on the Random Hierarchy Model literature, gives explicit sample-complexity arguments, validates the scaling with an iterative clustering algorithm, an end-to-end stacked latent-clustering network, and data2vec experiments, and was publicly highlighted by Yann LeCun on 2026-06-01 as theoretical support for JEPA-style prediction in representation space.

The paper is not peer reviewed and is not an empirical frontier-model result. Its strongest theorem and experiments are on a deliberately simplified probabilistic context-free grammar with fixed tree topology and unambiguous, non-recursive production rules.

Core Claim

On the Random Hierarchy Model, token-level self-supervised objectives must pay an exponential-in-depth sample-complexity cost to recover the latent hierarchy, while own-latent prediction can recover the non-root latent tree at the first-level clustering scale:

P_{tok} ≍ v m^{L + 1}, P_{own latent} ≍ v m^{3}

up to logarithmic factors and under the paper’s balanced, separated, stable-clustering assumptions. The mechanism is synonym clustering: once one latent level is recovered, both context and target are lifted to that level, so the next level has the same local statistical difficulty instead of re-paying a surface-token bottleneck.

Author Narrative Context

The author X posts frame the work as an explanation for why raw-token and pixel prediction can be data-hungry: predicting own abstract latents can yield an exponential sample-efficiency gain when the data has hidden hierarchy. Yann LeCun’s comment explicitly connects the result to the JEPA intuition that prediction in representation space should be theoretically justifiable.

The paper supports a narrower version of that narrative. It proves and tests the gap on the Random Hierarchy Model and data2vec trained on that synthetic grammar. Daniel Korchinski’s own follow-up keeps the main practical question open: whether own-latent generative models beat current scaling regimes on real language or images, especially in the small-data regime.

The two community comments Alex flagged are useful framing but not paper evidence. Gerard Sans names the token-level cost as a “Token Isomorphism Tax”; Paul Thompson connects the intuition to effective representation dimension rather than ambient dimension. The related Universal Weight Subspace Hypothesis is a separate weight-space source and should not be used as evidence that this paper’s RHM sample-complexity mechanism transfers to real data.

Key Contributions

Formalizes Iterative Latent Clustering (ILC), which recovers the non-root RHM latent hierarchy by clustering cousin-context vectors at each recovered latent level.
Proves an informal-to-formal sample-complexity result: under balanced and separated RHM grammars and stable clustering, ILC recovers $h^{(1)}, \dots, h^{(L - 1)}$ with $P ≳ v m^{3}$ up to log factors and constants.
Introduces a Stacked Latent-Clustering (SLC) neural architecture with predictor and clusterer modules that reaches the same $v m^{3}$ scaling under gradient training.
Shows that module-boundary stop-gradients do not destroy the observed SLC scaling, making the mechanism compatible with local-learning interpretations.
Gives a theory sketch for data2vec on the RHM: teacher targets carry learned latents, and each EMA refresh lifts the prediction target to the next latent scale.
Measures synonym clustering in data2vec encoders and reports scaling collapse when sample count is rescaled by $v m^{3}$ .

Method Notes

The paper’s core distinction is whether the target remains a visible token or is lifted into the representation already learned by the model:

flowchart LR
  X["visible tokens"] --> TSSL["token-level SSL target"]
  TSSL --> Cost1["higher levels still predict through leaves: P ~ v m^(L+1)"]
  X --> L1["recover level-1 latent by clustering"]
  L1 --> L2["predict own level-1 latents"]
  L2 --> L3["cluster next level"]
  L3 --> Cost2["same local threshold at each level: P ~ v m^3"]

The data2vec argument depends on two assumptions. First, teacher targets carry a linear component of the latents already learned by the encoder. Second, gradient descent extracts detectable correlations once they rise above sampling noise. Those assumptions are plausible in the RHM setup and checked with probes, but they are not general finite-sample guarantees for arbitrary language, image, or time-series corpora.

Where The Supervision Enters

The paper uses two different supervision patterns, and they should not be collapsed into one JEPA recipe.

In the explicit SLC model, supervision is layer-wise. The architecture has $L - 1$ predictor-clusterer modules. Module $ℓ$ receives the learned level- $ℓ$ tokens, predicts same-level cousin targets with a prediction loss, clusters similar prediction contexts with a clustering loss, and emits hard or soft cluster assignments $h^{(ℓ + 1)}$ that become the tokens for module $ℓ + 1$ . This is not a single top-layer representation loss. The stop-gradient ablations are important because they show that cutting gradient flow between SLC modules still preserves the $v m^{3}$ scaling: each level can be trained by its own local prediction and clustering losses.

In the data2vec experiment, supervision looks closer to ordinary teacher-student latent prediction. The student sees a masked input, the teacher sees the unmasked input, and each masked position is trained with a squared loss to match a target $Y_{i} (x)$ formed from the average of the last $K$ teacher block activations; the paper’s data2vec setting uses $K = 4$ . There is no manually assigned loss for each true RHM level. The hierarchy appears implicitly because the EMA teacher target carries whatever latents the student has already made linearly decodable. After the teacher refreshes, the next phase of student prediction is effectively lifted from visible tokens to learned level- $ℓ$ latent targets.

This is the reason for the paper’s H-JEPA claim. On RHM, a single data2vec-like own-latent objective can already perform multi-scale latent discovery, so naive explicit H-JEPA stacking is not automatically useful. The claim is scoped: newer multi-scale systems such as V-JEPA 2.1 and Bootleg use multi-loss, multi-level, or high-to-low target paths that are not simply the naive H-JEPA stack, so they still need separate matched-compute tests.

Evidence And Results

ILC reconstruction accuracy on synthetic RHM data collapses under the predicted $P / (v m^{3})$ rescaling.
SLC probe accuracy on the highest recovered latent also collapses under $P / (v m^{3})$ and remains approximately depth-independent in the reported $L \in {3, \dots, 7}$ sweep.
SLC with stop-gradients at module boundaries still reaches the same scaling, suggesting the statistical mechanism does not require end-to-end credit assignment across all levels.
data2vec pretrained on RHM strings shows downstream root-classification curves that collapse under $P / (v m^{3})$ in the online setting, while the paper reports the same scaling in the offline setting.
Synonym-swap probes show data2vec hidden states becoming invariant to same-parent tuples before becoming invariant to unrelated tuples, matching the proposed recursive clustering mechanism.
The paper reports about 1,000 H100 hours for the data2vec experiments, so the neural validation is nontrivial even though the data model is synthetic.

Limitations

The RHM is a simplified synthetic grammar with fixed topology, unambiguous rules, no recursive productions, and no explicitly context-dependent rules.
The theorem depends on balanced and separated grammar assumptions and a stable clustering module.
The data2vec analysis is a theory sketch under assumptions about teacher-target latent content and correlation learning, not a full optimization theorem.
The result does not show that JEPA, data2vec, or own-latent generative models beat token-level learning on real language, images, videos, robotics, or time series.
The “explicit H-JEPA stacking is redundant” conclusion is scoped to the RHM mechanism; the paper itself notes that newer multi-scale self-supervised systems such as V-JEPA 2.1, Bootleg, and LeWorldModel are not just naive stacked JEPAs.
The paper is about passive latent hierarchy recovery, not action-conditioned world modeling or counterfactual prediction.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Augmentation-free or dataset-aware self-supervision	adjacent	Gives a synthetic-theory reason why own-latent prediction can be more sample-efficient than raw-token prediction without hand-crafted augmentations.	Needs numeric time-series, event-stream, and multivariate benchmarks with real irregularity, noise, and exogenous variables.
Latent-state prediction	adjacent	Shows when latent targets can recursively reveal hidden hierarchical state in a PCFG-like data model.	No streaming state-maintenance evidence and no operational latent-state time-series benchmark.
Data diversity, curriculum, and long tail	warning	The result depends on recoverable synonym structure and balanced/separated grammar conditions.	Need tests where rare regimes, long-tailed symbols, and nonstationary hierarchy violate those assumptions.
Scaling and efficiency	adjacent	Predicts an exponential sample-complexity gap between token-level and own-latent objectives on RHM.	Need matched-compute comparisons on real corpora and serving-time costs for latent-target systems.
Control and counterfactuals	insufficient evidence	No action, control input, or intervention channel is modeled.	Needs action-conditioned latent prediction and candidate-intervention rollout tests.

Links Into The Wiki

Open Questions

Can own-latent prediction beat token-level prediction under matched compute on real language, images, videos, or multivariate time series?
Which real datasets have a recoverable hierarchy close enough to RHM synonym structure for the $v m^{3}$ mechanism to matter?
How should latent prediction targets stay grounded so they do not cluster away rare but decision-relevant state?
Can the mechanism be extended from passive hierarchical state recovery to action-conditioned world models with typed actions or interventions?
When are explicit hierarchical JEPA modules redundant, and when do multi-scale teacher targets, high-to-low predictions, or action-conditioned latents add something data2vec-like targets do not?

Alex Open Research Wiki

Explorer

Learn From Your Own Latents And Not From Tokens: A Sample-Complexity Theory

Learn From Your Own Latents And Not From Tokens: A Sample-Complexity Theory

Source

Status And Credibility

Core Claim

Author Narrative Context

Key Contributions

Method Notes

Where The Supervision Enters

Evidence And Results

Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks