Learn From Your Own Latents And Not From Tokens: A Sample-Complexity Theory
Source
- Raw Markdown: paper_own-latents-not-tokens-2026.md
- PDF: paper_own-latents-not-tokens-2026.pdf
- Preprint: arXiv:2605.27734v1
- Official X announcement: Daniel Korchinski, with local oEmbed snapshot
x-oembed-dankorchinski-2060344980749549607.json - Official X follow-up: Daniel Korchinski open question, with local oEmbed snapshot
x-oembed-dankorchinski-2060345715914612849.json - Official X author posts: Alessandro Favero, with local oEmbed snapshot
x-oembed-alesfav-2060277596089159734.json, and Matthieu Wyart, with local oEmbed snapshotx-oembed-matthieuwyart-2061317203857739846.json - Yann LeCun X comment: Yann LeCun, with local oEmbed snapshot
x-oembed-ylecun-2061466116988289512.json - Community X comments provided by Alex: Gerard Sans, with local oEmbed snapshot
x-oembed-gerardsans-2060784036495155632.json, and Paul Thompson, with local oEmbed snapshotx-oembed-ptenigma-2063116837832011889.json
No official code or project page was found during ingest.
Status And Credibility
This is a 2026-05-26 arXiv v1 preprint by Daniel J. Korchinski, Alessandro Favero, and Matthieu Wyart. It is credible enough to track as an important JEPA theory source because it builds on the Random Hierarchy Model literature, gives explicit sample-complexity arguments, validates the scaling with an iterative clustering algorithm, an end-to-end stacked latent-clustering network, and data2vec experiments, and was publicly highlighted by Yann LeCun on 2026-06-01 as theoretical support for JEPA-style prediction in representation space.
The paper is not peer reviewed and is not an empirical frontier-model result. Its strongest theorem and experiments are on a deliberately simplified probabilistic context-free grammar with fixed tree topology and unambiguous, non-recursive production rules.
Core Claim
On the Random Hierarchy Model, token-level self-supervised objectives must pay an exponential-in-depth sample-complexity cost to recover the latent hierarchy, while own-latent prediction can recover the non-root latent tree at the first-level clustering scale:
up to logarithmic factors and under the paper’s balanced, separated, stable-clustering assumptions. The mechanism is synonym clustering: once one latent level is recovered, both context and target are lifted to that level, so the next level has the same local statistical difficulty instead of re-paying a surface-token bottleneck.
Author Narrative Context
The author X posts frame the work as an explanation for why raw-token and pixel prediction can be data-hungry: predicting own abstract latents can yield an exponential sample-efficiency gain when the data has hidden hierarchy. Yann LeCun’s comment explicitly connects the result to the JEPA intuition that prediction in representation space should be theoretically justifiable.
The paper supports a narrower version of that narrative. It proves and tests the gap on the Random Hierarchy Model and data2vec trained on that synthetic grammar. Daniel Korchinski’s own follow-up keeps the main practical question open: whether own-latent generative models beat current scaling regimes on real language or images, especially in the small-data regime.
The two community comments Alex flagged are useful framing but not paper evidence. Gerard Sans names the token-level cost as a “Token Isomorphism Tax”; Paul Thompson connects the intuition to effective representation dimension rather than ambient dimension. The related Universal Weight Subspace Hypothesis is a separate weight-space source and should not be used as evidence that this paper’s RHM sample-complexity mechanism transfers to real data.
Key Contributions
- Formalizes Iterative Latent Clustering (ILC), which recovers the non-root RHM latent hierarchy by clustering cousin-context vectors at each recovered latent level.
- Proves an informal-to-formal sample-complexity result: under balanced and separated RHM grammars and stable clustering, ILC recovers with up to log factors and constants.
- Introduces a Stacked Latent-Clustering (SLC) neural architecture with predictor and clusterer modules that reaches the same scaling under gradient training.
- Shows that module-boundary stop-gradients do not destroy the observed SLC scaling, making the mechanism compatible with local-learning interpretations.
- Gives a theory sketch for data2vec on the RHM: teacher targets carry learned latents, and each EMA refresh lifts the prediction target to the next latent scale.
- Measures synonym clustering in data2vec encoders and reports scaling collapse when sample count is rescaled by .
Method Notes
The paper’s core distinction is whether the target remains a visible token or is lifted into the representation already learned by the model:
flowchart LR X["visible tokens"] --> TSSL["token-level SSL target"] TSSL --> Cost1["higher levels still predict through leaves: P ~ v m^(L+1)"] X --> L1["recover level-1 latent by clustering"] L1 --> L2["predict own level-1 latents"] L2 --> L3["cluster next level"] L3 --> Cost2["same local threshold at each level: P ~ v m^3"]
The data2vec argument depends on two assumptions. First, teacher targets carry a linear component of the latents already learned by the encoder. Second, gradient descent extracts detectable correlations once they rise above sampling noise. Those assumptions are plausible in the RHM setup and checked with probes, but they are not general finite-sample guarantees for arbitrary language, image, or time-series corpora.
Evidence And Results
- ILC reconstruction accuracy on synthetic RHM data collapses under the predicted rescaling.
- SLC probe accuracy on the highest recovered latent also collapses under and remains approximately depth-independent in the reported sweep.
- SLC with stop-gradients at module boundaries still reaches the same scaling, suggesting the statistical mechanism does not require end-to-end credit assignment across all levels.
- data2vec pretrained on RHM strings shows downstream root-classification curves that collapse under in the online setting, while the paper reports the same scaling in the offline setting.
- Synonym-swap probes show data2vec hidden states becoming invariant to same-parent tuples before becoming invariant to unrelated tuples, matching the proposed recursive clustering mechanism.
- The paper reports about 1,000 H100 hours for the data2vec experiments, so the neural validation is nontrivial even though the data model is synthetic.
Limitations
- The RHM is a simplified synthetic grammar with fixed topology, unambiguous rules, no recursive productions, and no explicitly context-dependent rules.
- The theorem depends on balanced and separated grammar assumptions and a stable clustering module.
- The data2vec analysis is a theory sketch under assumptions about teacher-target latent content and correlation learning, not a full optimization theorem.
- The result does not show that JEPA, data2vec, or own-latent generative models beat token-level learning on real language, images, videos, robotics, or time series.
- The “explicit H-JEPA stacking is redundant” conclusion is scoped to the RHM mechanism; the paper itself notes that newer multi-scale self-supervised systems such as V-JEPA 2.1, Bootleg, and LeWorldModel are not just naive stacked JEPAs.
- The paper is about passive latent hierarchy recovery, not action-conditioned world modeling or counterfactual prediction.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Augmentation-free or dataset-aware self-supervision | adjacent | Gives a synthetic-theory reason why own-latent prediction can be more sample-efficient than raw-token prediction without hand-crafted augmentations. | Needs numeric time-series, event-stream, and multivariate benchmarks with real irregularity, noise, and exogenous variables. |
| Latent-state prediction | adjacent | Shows when latent targets can recursively reveal hidden hierarchical state in a PCFG-like data model. | No streaming state-maintenance evidence and no operational latent-state time-series benchmark. |
| Data diversity, curriculum, and long tail | warning | The result depends on recoverable synonym structure and balanced/separated grammar conditions. | Need tests where rare regimes, long-tailed symbols, and nonstationary hierarchy violate those assumptions. |
| Scaling and efficiency | adjacent | Predicts an exponential sample-complexity gap between token-level and own-latent objectives on RHM. | Need matched-compute comparisons on real corpora and serving-time costs for latent-target systems. |
| Control and counterfactuals | insufficient evidence | No action, control input, or intervention channel is modeled. | Needs action-conditioned latent prediction and candidate-intervention rollout tests. |
Links Into The Wiki
- JEPA
- Latent-Space Predictive Learning
- Self-Supervised Representation Learning
- Representation Collapse
- Distribution Priors In Self-Supervised Learning
- Foundation Time-Series Model Research Agenda
- World Models
- LeJEPA Identifiability
- LeWorldModel
- Universal Weight Subspace Hypothesis
Open Questions
- Can own-latent prediction beat token-level prediction under matched compute on real language, images, videos, or multivariate time series?
- Which real datasets have a recoverable hierarchy close enough to RHM synonym structure for the mechanism to matter?
- How should latent prediction targets stay grounded so they do not cluster away rare but decision-relevant state?
- Can the mechanism be extended from passive hierarchical state recovery to action-conditioned world models with typed actions or interventions?
- When are explicit hierarchical JEPA modules redundant, and when do multi-scale teacher targets, high-to-low predictions, or action-conditioned latents add something data2vec-like targets do not?