---
abstract: |
  Generative models, from diffusion models to large language models, achieve remarkable performance but at a cost in training data orders of magnitude larger than what biological learners require. An alternative paradigm has emerged in which networks are trained to predict their *own* latent representations of related views or masked regions, as in data2vec and JEPA -- an idea related to predictive-coding accounts of the cortex. Despite strong empirical results, the theoretical understanding of these methods remains limited. Central questions include: by how much does latent prediction actually improve data efficiency? Is there a benefit to stacking such methods into multi-scale hierarchies? We answer both using as data a tractable probabilistic context-free grammar that captures the compositional structure of natural language and images. Such a grammar generates strings of visible tokens by recursively applying production rules along a tree of hidden symbols of depth $L$. For such data, supervised or token-level SSL require a number of samples *exponential* in $L$ to recover the latent tree; we prove that latent prediction achieves this with a number of samples *constant* in $L$, up to logarithmic factors. We confirm this bound with (i) a hierarchical clustering algorithm, (ii) an end-to-end neural network whose predictor-clusterer modules predict their own latents at each level via gradient descent, and (iii) the first sample-complexity analysis of data2vec, which we show implicitly performs hierarchical latent prediction. This suggests that explicit stacking such as H-JEPA is largely redundant.`\looseness=-1`{=latex}
author:
- |
  Daniel J. Korchinski [^1]\
  Institute of Physics\
  EPFL\
  Lausanne, VD Switzerland\
  `daniel.korchinski@epfl.ch`\
  `\And `{=latex} Alessandro Favero []{.footnote-mark note-num="1"}\
  Department of Applied Maths and Theoretical Physics\
  University of Cambridge\
  Cambridge, UK\
  `af940@cam.ac.uk`\
  `\And `{=latex} Matthieu Wyart\
  Department of Physics & Institute of Physics\
  Johns Hopkins University & EPFL\
  Baltimore, MD USA & Lausanne, VD Switzerland\
  `matthieu.wyart@epfl.ch`
bibliography:
- references.bib
title: |
  Learn from your own latents and not from tokens:\
  A sample-complexity theory
---

\maketitle

# Introduction {#sec:intro}

Generative models have reached striking empirical success. Diffusion models produce realistic images and video [@ho2020denoising; @rombach2022high; @brooks2024video], while large language models trained by next-token prediction master grammar, world knowledge and reasoning [@brown2020language; @openai2023gpt4; @deepseekai2025r1]. Both rest on a single recipe: predict masked or future fragments of the raw signal at massive scale. Yet this success is bought at a cost biological learners do not pay.

Frontier LLMs are trained on $10^{13}$--$10^{14}$ tokens [@hoffmann2022training; @touvron2023llama], more than five orders of magnitude beyond what a child encounters before reaching adult-level competence [@gilkerson2017mapping; @linzen2020accelerate; @frank2023bridging]; state-of-the-art diffusion models likewise rely on billions of images [@schuhmann2022laion]. This gap suggests that token-level pretraining is far from optimal, and several hypotheses have been advanced to explain it. One is that biological learning is multimodal and grounded [@smith2005development; @alayrac2022flamingo]. A second -- the one we pursue -- is that learning may not be most efficient at the level of raw tokens, but instead in a more abstract latent space [@rao1999predictive; @friston2010free; @lecun2022path]. Indeed, predicting semantic concepts or continuous latents has been shown to improve data efficiency [@tackLLMPretrainingContinuous2025; @liuNextConceptPrediction2026].

An alternative strategy to input reconstruction has gained traction in the last five years: the network is trained to predict its *own* latent representations of related views, masked regions or future signals. These targets are themselves the output of a learned encoder, lifting prediction into increasingly abstract spaces as learning progresses. The idea relates to computational neuroscience, where predictive coding posits that the cortex seeks to predict its own future activity [@rao1999predictive; @illingLocalPlasticityRules2021; @millidge2022predictive]. A growing family of self-supervised methods instantiates this principle, including BYOL [@grillBootstrapYourOwn2020], DINO  [@caronEmergingPropertiesSelfsupervised2021], data2vec [@baevskiData2vecGeneralFramework2022] and JEPA  [@lecun2022path].`\looseness=-1`{=latex}

Some advantages of these algorithms over input reconstruction approaches have been discussed in the literature, including the fact that they can be formulated as energy-based methods [@lecun2022path] or are more robust to the presence of irrelevant features in the data [@vanassel2025joint]. Yet understanding why and by how much these methods improve data efficiency remains a fundamental challenge in both machine learning and neuroscience. Another key question is the putative benefits of stacking such algorithms so as to obtain a multi-scale hierarchy of learned representations. This idea, proposed years ago  [@lecun2022path], was recently implemented  [@liHiTJEPAHierarchicalSelfsupervised2025; @girgisHierarchicalJEPAMeets2026] and appears to lead at best to moderate improvements, for reasons that remain currently unclear.

In this work, we address both questions by focusing on synthetic models of data with a hierarchical latent structure. Probabilistic context-free grammars (PCFGs) are a natural candidate: through recursive production rules they generate data compositionally from a tree of latent variables, and have long been influential as models of language  [@booth1973applying; @manning1999foundations] and natural images [@grenander1996elements; @mumford1994pattern]. General PCFGs are notoriously hard to treat. We therefore work with the Random Hierarchy Model (RHM) [@cagnetta2024deep], a simplified PCFG with fixed tree topology and random production rules. The RHM has produced quantitatively predictive theories, including relating neural scaling laws to the statistics of natural language [@cagnetta2024towards; @cagnettaDerivingNeuralScaling2026], and a theory of compositional generalization and memorization in diffusion models for language and images [@sclocchiPhaseTransitionDiffusion2025; @sclocchiProbingLatentHierarchical2025; @favero2025bigger]. The RHM is parameterized by the number of production rules per symbol, $m$, and the depth of the latent tree, $L$. Prior work shows that for deep architectures, supervised end-to-end learning requires an order of $m^L$ samples [@cagnetta2024deep], and token-level self-supervised objectives -- masked-language modelling and diffusion -- require of order $m^{L+1}$ [@cagnetta2024towards; @favero2025compositional; @cagnetta2025scaling]. Both costs grow *exponentially* with the depth $L$ of the latent hierarchy (and thus polynomially in the input dimension, i.e., sequence length). Against this baseline, we establish three results:`\looseness=-1`{=latex}

1.  **Efficient algorithm for representation learning** (`\Cref{sec:latent-clustering}`{=latex})**.** We show that a hierarchical clustering algorithm recovers the full non-root latent tree from order $m^3$ samples -- *independent* of the depth $L$, and hence exponentially fewer than those required by token-level SSL.`\looseness=-1`{=latex}

2.  **An end-to-end stacked architecture with the same scaling** (`\Cref{sec:neuralclustering}`{=latex})**.** We propose a new architecture where a stack of small predictor-clusterer modules predicts its own latents at each level. Trained end-to-end with gradient descent, it reaches the $m^3$ scaling on the RHM.

3.  **A theory of data2vec** (`\Cref{sec:data2vec}`{=latex})**.** We show empirically and explain theoretically with reasonable assumptions that data2vec [@baevskiData2vecGeneralFramework2022] implicitly performs hierarchical latent prediction, again leading to a sample complexity of order $m^3$. To our knowledge, this is the first sample-complexity analysis of such methods.

*These results establish that learning from one's own latents can yield huge sample-complexity gains over token-level objectives.* They also show that a single latent-prediction module such as data2vec is in effect already performing multi-scale latent discovery. Such methods may already be implicitly hierarchical, weakening the case for explicit stacking such as H-JEPA.

#### Related work.

The compositional and hierarchical structure of natural data has long been put forward as a key feature making it learnable by neural networks. Deep networks represent hierarchically compositional functions with exponentially fewer parameters than shallow ones [@poggio2017why], yielding nonparametric sample-complexity bounds polynomial in the input dimension for both supervised [@schmidt2020nonparametric] and self-supervised [@mei2024unets] tasks. These results characterize expressivity and information-theoretic statistics, but do not describe what gradient descent can actually learn from finite data. A complementary line shows that transformers can approximately implement parsing-style inference on context-free grammars [@allenzhu2024cfg; @zhao2023transformers; @garnierbrun2024transformers], characterizing the algorithm trained models implement, but not its sample complexity. The RHM [@cagnetta2024deep], building on @mossel2016deep, @malach2018provably, and @degiuli2019random, fills this gap by enabling sample-complexity predictions that match observations in deep nets. Existing RHM analyses cover supervised learning [@cagnetta2024deep] and token-level SSL [@cagnetta2024towards; @favero2025compositional], but not learning from one's own latents, which is the focus of this work.

# Preliminaries {#sec:rhm}

<figure data-latex-placement="t">
<div class="center">

</div>
</figure>

#### The Random Hierarchy Model (RHM).

The RHM [@cagnetta2024deep] is a probabilistic context-free grammar on a fixed regular tree. The tree has depth $L$, branching factor $s$, and vocabularies $\mathcal V_0,\mathcal V_1,\ldots,\mathcal V_L$, all of size $v$. Level $0$ is visible and levels $1,\ldots,L$ are latent. At level $\ell$, there are $s^{L-\ell}$ variables $h^{(\ell)}_1,\ldots,h^{(\ell)}_{s^{L-\ell}}$, with $h^{(\ell)}_u\in\mathcal V_\ell$. The visible sequence is $x_i=h^{(0)}_i$, $i=1,\ldots,s^L$.

For each level $\ell=0,\ldots,L-1$, a grammar instance is sampled by choosing $vm$ distinct tuples from $\mathcal V_\ell^s$ uniformly without replacement and partitioning them into $v$ labeled groups of size $m$, one group $\mathcal R_{\ell,a}$ for each parent symbol $a\in\mathcal V_{\ell+1}$. A rule $\nu=(\nu_1,\ldots,\nu_s)\in\mathcal R_{\ell,a}$ means that parent $a$ may generate the tuple of children $\nu$. The set of grammatical tuples at level $\ell$ is $\mathcal S_\ell:=\bigsqcup_{a\in\mathcal V_{\ell+1}}\mathcal R_{\ell,a},$ so $|\mathcal S_\ell|=vm$. We write $f:=m/v^{s-1}$ for the fraction of valid $s$-tuples, since $vm$ out of the $v^s$ possible tuples are grammatical. Because the rule map is injective, the grammar is unambiguous: each grammatical tuple has a unique parent. We denote it by $\mathrm{par}_\ell(\nu)=a
    \;\Longleftrightarrow\;
    \nu\in\mathcal R_{\ell,a}.$ Two grammatical tuples $\nu,\nu'\in\mathcal S_\ell$ are called *synonyms* if they have the same parent.

Generation proceeds top-down. First $h^{(L)}_1\sim\mathrm{Unif}(\mathcal V_L)$. Then, recursively, if $h^{(\ell+1)}_u=a$, the child tuple $T^{(\ell)}_u:=\big(h^{(\ell)}_{(u-1)s+1},\ldots,h^{(\ell)}_{us}\big)$ is sampled uniformly from $\mathcal R_{\ell,a}$.

#### Correlation-based learning in the RHM.

A central observation from @cagnetta2024deep is that learning the RHM amounts to learning invariances under the exchange of synonyms. The statistical signal that identifies these synonym classes comes from correlations between a tuple and an external observable. Concretely, let $T^{(\ell)}$ be a level-$\ell$ tuple and let $Z$ be an observable elsewhere in the tree. This arrangement is depicted for different learning objectives in `\Cref{fig:rhm_corr_schematic}`{=latex}. One can consider the connected correlation: $$C_Z(\nu,z)
    :=
    \mathbb P[T^{(\ell)}=\nu,Z=z]
    -
    \mathbb P[T^{(\ell)}=\nu]\mathbb P[Z=z].$$ If two tuples $\nu,\nu'$ are synonyms, then $C_Z(\nu,\cdot)=C_Z(\nu',\cdot),$ whenever $Z$ depends on the tuple only through its parent.[^2] The strength of this correlation signal depends on the choice of $Z$ and on its position relative to the tuple. The signal must propagate through the hidden tree from the parent of $T^{(\ell)}$ to the observable $Z$. Along this path, each unresolved production rule averages over $m$ synonymous choices, attenuating the signal. In sample-complexity terms, each additional unresolved rule costs a factor of order $m$, as illustrated in `\Cref{fig:rhm_corr_schematic}`{=latex}.

*Supervised classification [@cagnetta2024deep]:* the observable $Z$ is the class label at the root. To recover the rules mapping level-$\ell$ tuples to level-$(\ell+1)$ latents, the root-to-tuple correlation must traverse the $L-\ell-1$ levels separating the tuple parent from the root. Including the fact that each grammatical tuple is observed with probability of order $1/(vm)$, this gives sample complexities of the form $P_\ell^{\mathrm{sup}}
    \asymp
    v\,m^{L-\ell}.$ The hardest step is therefore the first one, $\ell=0$, where one must learn the level-1 rules. Deep networks trained on the supervised RHM classification task were found to saturate this scaling and reconstruct the latent hierarchy in their representations.

*Token-level self-supervision [@cagnetta2024towards; @favero2025compositional; @cagnetta2025scaling]:* $Z$ is a masked or noised token. In the first step, local correlations between visible tokens identify the level-1 synonym classes, with sample complexity $P_1^{\mathrm{tok}}
    \asymp
     v m^3.$ Once these low-level latents have been internally reconstructed, the model can use them as coarse context variables. To learn higher levels, the relevant statistics are token--latent correlations: a visible token is correlated with a latent representation of its context. However, because the prediction target is still a surface token, the signal is averaged through the descendant channel from the latent scale down to the leaves. Each additional latent level therefore costs one extra factor of $m$ in sample complexity. Thus, if $\ell \geq
1$ denotes the latent level being learned above the leaves, $$\begin{equation}
    P_\ell^{\mathrm{tok}}
    \asymp
    v\,m^{\ell+2}.
    \label{eq:sample_complexity_token_latent}
\end{equation}$$ Neural networks trained with token-level SSL objectives were empirically shown to saturate this staged scaling: as the dataset size increases, lower-level latents appear first, and higher-level latents become learnable thereafter. The overall sample-complexity is gated behind reconstructing latents $\ell = L-1$ leading to $P_{\rm tok}\asymp vm^{L+1}$.[^3]

# Recovering the RHM hierarchy by clustering {#sec:latent-clustering}

\begin{algorithm}[t]\DontPrintSemicolon
\newcommand{\ilctikzmark}[1]{{\tikz[overlay,remember picture,baseline]\node (#1) {};}}
\newcommand{\ilcwrap}[1]{\parbox[t]{0.76\linewidth}{#1}}
\newcommand*{\AddILCNote}[4]{%
    \begin{tikzpicture}[overlay,remember picture]
        \draw[decoration={brace,amplitude=1.15em},decorate,thick]
            ($(#3)!(#1.north)!($(#3)-(0,1)$) + (1.15em,0)$) --
            ($(#3)!(#2.south)!($(#3)-(0,1)$) + (1.15em,0)$)
            node[align=left,pos=0.5,anchor=west,xshift=1.15em] {#4};
    \end{tikzpicture}%
}
\newcommand*{\PlaceILCNotes}{%
    \makebox[0pt][l]{%
    \raisebox{0pt}[0pt][0pt]{%
        \AddILCNote{ilc-predictor-top}{ilc-predictor-bottom}{ilc-brace-right}{Predictor, $p$}%
        \AddILCNote{ilc-clusterer-top}{ilc-clusterer-bottom}{ilc-brace-right}{Clusterer, $C$}%
    }}%
}
\caption{Iterative Latent Clustering (ILC) --- see \Cref{fig:ilc_schematic} for a graphical representation.}
\label{alg:ilc}
\KwIn{$P$ samples $x^{(1)},\ldots,x^{(P)}$, RHM parameters $L,s,v$, and a clustering module $\mathsf{Cluster}_v$.}
\KwOut{estimated non-root hierarchy $\widehat h^{(1)},\ldots,\widehat h^{(L-1)}$.}
\BlankLine
\textbf{Initialize:} $\widehat h^{(0)}_i = x_i$.\;
\BlankLine
\For{$\ell = 0, 1, \ldots, L-2$}{
  \ilcwrap{\ilctikzmark{ilc-predictor-top}Form all level-$\ell$ tuples $\widehat T^{(\ell)}_u$.\hfill\ilctikzmark{ilc-brace-right}}\;
  \ilcwrap{Estimate the support of grammatical tuples $\widehat{\mathcal S}_\ell$ by the observed tuple set.}\;
  \ilcwrap{For every $\nu\in\widehat{\mathcal S}_\ell$, count its incidence as $N(\nu) = \sum_{p=1}^P
          \mathbf 1\{
              \widehat T^{(p)}_\ell=\nu
          \}$ and compute its empirical cousin context vector}
  \[
      \widehat\phi_\ell(\nu)
      :=
      \frac{1}{N(\nu)}
          \sum_{p=1}^P
          e_{\widehat Z^{(p)}_\ell}
          \mathbf 1\{
              \widehat T^{(p)}_\ell=\nu
          \}
      ,
  \]
  \ilcwrap{where $\widehat T^{(p)}_\ell$ is a fixed tuple at level $\ell$ in sample $p$, and $\widehat Z^{(p)}_\ell$ is a fixed level-$\ell$ element in a cousin tuple (i.e. sharing the $\ell+2$ grandparent with $\widehat T^{(p)}_\ell$).\ilctikzmark{ilc-predictor-bottom}}\;
  \ilcwrap{\ilctikzmark{ilc-clusterer-top}Cluster the context vectors:
  $\widehat{\mathcal S}_{\ell,1},\ldots,\widehat{\mathcal S}_{\ell,v}
  = \mathsf{Cluster}_v\!\left(\{\widehat\phi_\ell(\nu):\nu\in\widehat{\mathcal S}_\ell\}\right).$}\;
  \ilcwrap{Define the next-level latent label by the cluster identity:}
  \[
      \widehat h^{(\ell+1)}_u=a
      \qquad
      \text{if}
      \qquad
      \widehat T^{(\ell)}_u
      \in
      \widehat{\mathcal S}_{\ell,a}.\ilctikzmark{ilc-clusterer-bottom}
  \]\PlaceILCNotes 
}
\end{algorithm}

As discussed in the previous section, the limitation of token-level objectives is not that they fail to learn latent variables. Rather, they learn them while still receiving supervision through visible tokens. The learn-from-your-own-latent setting studied below removes this residual token bottleneck: once level $\ell$ has been recovered, both the conditioning object and the prediction target are lifted to level $\ell$. The next stage is then again a local synonym-clustering problem, but now with the same statistical strength at every scale.

In other words, every level becomes as easy as the first token-level step. Given `\Cref{eq:sample_complexity_token_latent}`{=latex}, this suggests that the whole non-root hierarchy should be recoverable from $P\asymp vm^3$ samples. This is illustrated in `\Cref{fig:rhm_corr_schematic}`{=latex}--**right**.

#### Iterative latent-clustering algorithm (ILC).

The preceding section framed learning the RHM in terms of learning tuple-target correlations, and identifying *synonyms* by tuples that share identical correlations. Here, we cast this as a vector-clustering problem. Let $T$ denote a level-$\ell$ tuple, let $Z$ be a level-$\ell$ target in a cousin tuple (cf. the arrangement in `\Cref{fig:rhm_corr_schematic}`{=latex} -- cousins are simply nodes sharing the same grandparent at level $\ell+2$), and let $e_Z\in\mathbb R^v$ be its one-hot encoding. For each grammatical tuple $\nu\in\mathcal S_\ell$, define the *conditional context vector* $$\phi_\ell(\nu)
    :=
    \mathbb E[e_Z\mid T=\nu]
    \in\Delta^{v-1}.$$

\begin{wrapfigure}{br}{0.54\textwidth}
    
     
    \resizebox{\linewidth}{!}{\begingroup
\providecommand{\clusteringSchematicPanelSep}{}
\providecommand{\clusteringSchematicXSep}{}
\providecommand{\clusteringSchematicLevelOneY}{}
\providecommand{\clusteringSchematicLevelTwoY}{}
\providecommand{\clusteringSchematicRootY}{}
\renewcommand{\clusteringSchematicPanelSep}{8mm}
\renewcommand{\clusteringSchematicXSep}{0.80}
\renewcommand{\clusteringSchematicLevelOneY}{-2.3}
\providecommand{\clusteringSchematicPDeltaY}{1.3}
\providecommand{\clusteringSchematicCDeltaY}{0.7}
\renewcommand{\clusteringSchematicLevelTwoY}{-0.25}
\renewcommand{\clusteringSchematicRootY}{0.6}
\definecolor{ptcOchre}{RGB}{184,119,0}
\definecolor{ptcTeal}{RGB}{0,128,128}
\definecolor{ptcViolet}{RGB}{118,80,160}
\definecolor{ptcBlue}{RGB}{42,111,180}
\definecolor{ptcRed}{RGB}{190,72,72}
\definecolor{rhmTeal}{RGB}{0,128,128}
\definecolor{rhmViolet}{RGB}{118,80,160}
\providecommand{\clusteringSchematicXSep}{0.80}
\providecommand{\clusteringSchematicLevelOneY}{-1.20}
\providecommand{\clusteringSchematicLevelTwoY}{1.75}
\providecommand{\clusteringSchematicRootY}{2.12}
\begin{tikzpicture}[
    >=Stealth,
    node dot/.style={circle, draw, thick, fill=white, inner sep=0pt, minimum size=5.6mm},
    leaf dot/.style={circle, draw, thick, fill=white, inner sep=0pt, minimum size=4.6mm},
    module/.style={regular polygon, regular polygon sides=3, shape border rotate=0, draw, thick, fill=white, inner sep=0pt, minimum size=7.2mm, font=\footnotesize},
    target/.style={draw, thick, fill=white, minimum width=13mm, minimum height=6.4mm, inner sep=0pt},
    latent/.style={circle, draw, thick, fill=white, inner sep=0pt, minimum size=3.2mm},
    tree edge/.style={line width=0.8pt},
    signal/.style={->, line width=1.1pt}
]
    \path[use as bounding box] (-2,-4.55) rectangle (3.85,0.0);
    \IfFileExists{figs/clustering_schematic_ptc_module_panel.tikz}{\providecommand{\clusteringSchematicXSep}{0.80}
\providecommand{\clusteringSchematicLevelOneY}{-1.20}
\providecommand{\clusteringSchematicLevelTwoY}{1.75}
\node[leaf dot, font=\scriptsize] (x1) at ({-3.5*\clusteringSchematicXSep},-4.15) {$x_1$};
\node[leaf dot, font=\scriptsize] (x2) at ({-2.5*\clusteringSchematicXSep},-4.15) {$x_2$};
\node[leaf dot, font=\scriptsize] (x3) at ({-1.5*\clusteringSchematicXSep},-4.15) {$x_3$};
\node[leaf dot, font=\scriptsize] (x4) at ({-0.5*\clusteringSchematicXSep},-4.15) {$x_4$};
\node[leaf dot, font=\scriptsize] (x5) at ({0.5*\clusteringSchematicXSep},-4.15) {$x_5$};
\node[leaf dot, font=\scriptsize] (x6) at ({1.5*\clusteringSchematicXSep},-4.15) {$x_6$};
\node[leaf dot, font=\scriptsize] (x7) at ({2.5*\clusteringSchematicXSep},-4.15) {$x_7$};
\node[leaf dot, font=\scriptsize] (x8) at ({3.5*\clusteringSchematicXSep},-4.15) {$x_8$};

\node[module] (pA1) at ({-3.0*\clusteringSchematicXSep},{\clusteringSchematicLevelOneY-\clusteringSchematicPDeltaY}) {$p$};
\node[module] (pA2) at ({-1.0*\clusteringSchematicXSep},{\clusteringSchematicLevelOneY-\clusteringSchematicPDeltaY}) {$p$};
\node[module] (pB1) at ({1.0*\clusteringSchematicXSep},{\clusteringSchematicLevelOneY-\clusteringSchematicPDeltaY}) {$p$};
\node[module] (pB2) at ({3.0*\clusteringSchematicXSep},{\clusteringSchematicLevelOneY-\clusteringSchematicPDeltaY}) {$p$};
\node[module] (cA1) at ({-3.0*\clusteringSchematicXSep},{\clusteringSchematicLevelOneY-\clusteringSchematicCDeltaY}) {$C$};
\node[module] (cA2) at ({-1.0*\clusteringSchematicXSep},{\clusteringSchematicLevelOneY-\clusteringSchematicCDeltaY}) {$C$};
\node[module] (cB1) at ({1.0*\clusteringSchematicXSep},{\clusteringSchematicLevelOneY-\clusteringSchematicCDeltaY}) {$C$};
\node[module] (cB2) at ({3.0*\clusteringSchematicXSep},{\clusteringSchematicLevelOneY-\clusteringSchematicCDeltaY}) {$C$};
\node[latent, minimum size=5.0mm, font=\scriptsize] (z1l1) at ({-3.0*\clusteringSchematicXSep},\clusteringSchematicLevelOneY) {$\widehat{h}_1^{(1)}$};
\node[latent, minimum size=5.0mm, font=\scriptsize] (z2l1) at ({-1.0*\clusteringSchematicXSep},\clusteringSchematicLevelOneY) {$\widehat{h}_2^{(1)}$};
\node[latent, minimum size=5.0mm, font=\scriptsize] (z3l1) at ({1.0*\clusteringSchematicXSep},\clusteringSchematicLevelOneY) {$\widehat{h}_3^{(1)}$};
\node[latent, minimum size=5.0mm, font=\scriptsize] (z4l1) at ({3.0*\clusteringSchematicXSep},\clusteringSchematicLevelOneY) {$\widehat{h}_4^{(1)}$};

\draw[tree edge] (x1.north) -- (pA1.south west);
\draw[tree edge] (x2.north) -- (pA1.south east);
\draw[tree edge] (x3.north) -- (pA2.south west);
\draw[tree edge] (x4.north) -- (pA2.south east);
\draw[tree edge] (x5.north) -- (pB1.south west);
\draw[tree edge] (x6.north) -- (pB1.south east);
\draw[tree edge] (x7.north) -- (pB2.south west);
\draw[tree edge] (x8.north) -- (pB2.south east);
\draw[tree edge] (pA1.north) -- (cA1.south);
\draw[tree edge] (pA2.north) -- (cA2.south);
\draw[tree edge] (pB1.north) -- (cB1.south);
\draw[tree edge] (pB2.north) -- (cB2.south);
\draw[tree edge] (cA1.north) -- (z1l1.south);
\draw[tree edge] (cA2.north) -- (z2l1.south);
\draw[tree edge] (cB1.north) -- (z3l1.south);
\draw[tree edge] (cB2.north) -- (z4l1.south);

\node[draw=ptcTeal, text=ptcTeal, font=\scriptsize\bfseries, line width=1.1pt, minimum size=5.2mm, inner sep=0pt] (x4Target) at (x4) {};
\draw[signal, draw=ptcTeal] (pA1.east) .. controls +(0.55,0.14) and +(-0.30,0.42) .. (x4Target.west);

\node[module] (pT1) at ({-2.0*\clusteringSchematicXSep},{\clusteringSchematicLevelTwoY-\clusteringSchematicPDeltaY}) {$p$};
\node[module] (pT2) at ({2.0*\clusteringSchematicXSep},{\clusteringSchematicLevelTwoY-\clusteringSchematicPDeltaY}) {$p$};
\node[module] (cT1) at ({-2.0*\clusteringSchematicXSep},{\clusteringSchematicLevelTwoY-\clusteringSchematicCDeltaY}) {$C$};
\node[module] (cT2) at ({2.0*\clusteringSchematicXSep},{\clusteringSchematicLevelTwoY-\clusteringSchematicCDeltaY}) {$C$};
\node[latent, minimum size=5.0mm, font=\scriptsize] (z1l1l2) at ({-2.0*\clusteringSchematicXSep},\clusteringSchematicLevelTwoY) {$\widehat{h}_1^{(2)}$};
\node[latent, minimum size=5.0mm, font=\scriptsize] (z2l1l2) at ({2.0*\clusteringSchematicXSep},\clusteringSchematicLevelTwoY) {$\widehat{h}_2^{(2)}$};

\draw[tree edge] (z1l1.north) -- (pT1.south west);
\draw[tree edge] (z2l1.north) -- (pT1.south);
\draw[tree edge] (z3l1.north) -- (pT2.south);
\draw[tree edge] (z4l1.north) -- (pT2.south east);
\draw[tree edge] (pT1.north) -- (cT1.south);
\draw[tree edge] (pT2.north) -- (cT2.south);
\draw[tree edge] (cT1.north) -- (z1l1l2.south);
\draw[tree edge] (cT2.north) -- (z2l1l2.south);
\node[draw=ptcViolet, text=ptcViolet, font=\scriptsize\bfseries, line width=1.1pt, minimum size=8.0mm, inner sep=0pt] (z4l1Target) at (z4l1) {};
\draw[signal, draw=ptcViolet] (pT1.east) .. controls +(0.55,0.12) and +(-0.42,0.42) .. (z4l1Target.west);
}{\input{clustering_schematic_ptc_module_panel.tikz}}
\end{tikzpicture}
\endgroup
}
    \caption{\textbf{Graphical representation of \Cref{alg:ilc} for $L=3$}. Predictor $p$ implements steps 3-5, and clusterer $C$ constructs next-level latents. Highlighted prediction targets are as in \Cref{fig:rhm_corr_schematic}--\textbf{right}. \label{fig:ilc_schematic}}
\end{wrapfigure}

As before, **the essential observation is that synonyms have the same context vector**. If $\nu\in\mathcal S_{\ell,a}$, we denote the common context vector of parent $a$ by $\Phi_{\ell,a}
    :=
    \phi_\ell(\nu)$. The goal is to cluster the $vm$ tuple context vectors into these $v$ parent centers. This assignment of tuples into clusters explicitly constructs the next-level latents.

`\Cref{alg:ilc}`{=latex} (Iterative Latent Clustering, or ILC) operationalizes this procedure. At each step, the algorithm assumes that level $\ell$ has been decoded, clusters level-$\ell$ tuples by their empirical context vectors, and thereby constructs level $\ell+1$. This clustering procedure is illustrated graphically in `\Cref{fig:ilc_schematic}`{=latex}: the predictor $p$ estimates context vectors (steps 3-5), while the clusterer $C$ maps these vectors to next-level latent labels (steps 6-7). We prove its sample complexity in `\Cref{thm:iterative-latent-clustering}`{=latex} and validate it numerically in `\Cref{fig:kmeans_and_slc}`{=latex}.

#### Theoretical sample complexity.

An RHM grammar is *balanced* if, for every level $\ell$, every grammatical tuple occurs with probability of order $1/(vm)$. It is *separated* if every pair of parent context vector $\Phi_{\ell,a}, \Phi_{\ell,b}$ are separated by distance $\gtrsim 1/m$. We refer the reader to Appendix `\ref{app:clustering-proof}`{=latex} for the formal statements of these assumptions and their justification in the limit of $v\to\infty$ at fixed $f \in(0,1)$. For a balanced and separated RHM grammar with $f<1$, and assuming the existence of a *stable* clustering module (see Appendix `\ref{app:clustering-proof}`{=latex}), we have:

\begin{theorem}[Recovery of the non-root hierarchy; informal]\label{thm:iterative-latent-clustering}
Fix an RHM grammar that is balanced and separated, and a clustering algorithm stable under perturbations. Then there exists a constant \(C>0\), depending only on the constants in those assumptions and on \(s\), such that the following holds. If
\begin{equation}
\label{eq:full-recovery-sample-complexity}
    P
    \geq
    C
    \left[
        vm\log\frac{Lvm}{\delta}
        +
        \frac{v m^3}{1-f}
        \log\frac{Lvm}{\delta}
    \right],
\end{equation}
then the iterative latent-clustering algorithm recovers
\(
    h^{(1)},h^{(2)},\ldots,h^{(L-1)}
\)
and all production-rule classes below the root with probability at least \(1-\delta\), up to independent permutations of the latent labels at each level.

In particular, for fixed \(f\) bounded away from \(1\) and up to logarithmic factors,
\(
    P_{\mathrm{ILC}}
    \asymp
    v m^3.
\)
\end{theorem}

In Appendix `\ref{app:clustering-proof}`{=latex}, we give the formal theorem and its proof, which is a level-by-level induction based on concentration of the empirical context vectors. Notice that $v m^3$ is the scale needed to learn the first-level rules from visible tokens (see `\Cref{sec:rhm}`{=latex}). However, once one level has been recovered, the decoded latent sequence is again an RHM with the same local parameters. The same sample scale therefore suffices at every subsequent level, allowing the algorithm to recover the non-root hierarchy without any exponential dependence on $L$, as illustrated in `\Cref{fig:rhm_corr_schematic}`{=latex}--**right**.

#### Numerical validation.

We generate samples from the fixed-tree RHM and apply the level-by-level procedure described above. At each level, we estimate the empirical context vectors $\widehat \phi_\ell(\nu)$ for observed grammatical tuples and cluster them using $k$-means. The recovered labels are matched to the ground-truth latent labels by the optimal permutation, and we report the resulting reconstruction accuracy. Figure `\ref{fig:kmeans_and_slc}`{=latex}--**left** shows the reconstruction accuracy for $L=5$, $s=3$, $v=8$, and varying $m$. In the unrescaled plot, the transition to perfect reconstruction shifts to larger sample sizes as $m$ increases. After rescaling the number of samples by $v m^3,$ the curves collapse and reconstruction becomes accurate once $P/ v m^3$ is of order one. This confirms the predicted $m^3$ scaling of the local clustering threshold. The algorithm recovers all non-root levels at this same scale, consistent with the theory that the first-level recovery threshold is sufficient to climb the hierarchy recursively.`\looseness=-1`{=latex}

![ **Recursive clustering recovers the RHM hierarchy at the predicted scale.** **Left**: Clustering reconstruction accuracy of the iterative clustering algorithm using $k$-means ($L=5,s=3,v=8$). **Right**: Class reconstruction accuracy of stacked SLC architecture ($L=4,s=3,v=10$) trained end-to-end, evaluated by linear probe on the highest level latents $\widehat{h}^{(L-1)}$. In each panel, the main plot is a *scaling collapse*, effected by rescaling number of samples by predicted sample complexity $P\sim vm^3$. Insets show raw data (top left) and inferred sample complexity (bottom right) $P^\ast$ for which accuracy reaches 50%.`\looseness=-1`{=latex} ](figs/clustering_and_ptc_m_sweeps.png){#fig:kmeans_and_slc width="\\linewidth"}

# Gradient-based Iterative Latent Clustering {#sec:neuralclustering}

In this section, we ask whether the same scaling of the ILC algorithm can be achieved by a differentiable architecture trained end-to-end by gradient descent.

#### The Stacked Latent-Clustering (SLC) network.

We mirror the recursion of `\Cref{alg:ilc}`{=latex} with a stack of $L-1$ identical modules (`\Cref{fig:ilc_schematic}`{=latex}). Each module is composed of two subnetworks that play the roles of the two operations in the algorithm.

- $p$: A *predictor* $\mathrm{Pred}^{(\ell)}$ takes an $s$-tuple of level-$\ell$ tokens and outputs, via cross-entropy training, a categorical distribution over its cousin tokens; this is the neural analog of the empirical cousin context vector.

- $C$: A *clusterer* $\mathrm{Clust}^{(\ell)}$ then maps each prediction vector to a soft assignment in a discrete codebook through a contrastive objective: prediction vectors with high similarity are pulled to the same code while dissimilar ones are pushed apart -- implementing $\mathsf{Cluster}_v$ in the neural setting.

The cluster assignments $\widehat{h}^{(\ell+1)}$ produced by level $\ell$ become the input tokens of the level-$(\ell+1)$ module. To keep the difficulty of the prediction task constant across depth, cluster outputs are tokenized with a softmax. Prediction targets are computed by a teacher network whose weights track an exponential moving average of the student, $W_\mathrm{teacher} \leftarrow (1-\alpha_\mathrm{ema})\,W_\mathrm{teacher}
    + \alpha_\mathrm{ema}\,W_\mathrm{student},$ a common trick to prevent the representation collapse common to self-distilled SSL [@balestriero2023cookbook]. Full architectural and training details are given in Appendix `\ref{sec:app:slc_arch_details}`{=latex}.

#### Results.

`\Cref{fig:kmeans_and_slc}`{=latex}--**right** shows the accuracy of a linear probe trained to recover the top-level latent from the frozen final-layer representations of an SLC network, for $L=4$, $s=3$, $v=10$, and varying $m$. Rescaling the sample axis by $vm^3$ collapses the curves, confirming that the network saturates the clustering threshold of `\Cref{thm:iterative-latent-clustering}`{=latex}. `\Cref{fig:slc_L_scaling}`{=latex} shows that the sample complexity for $L\in \{3, \ldots, 7 \}$ does not vary. Interestingly, inserting a stop-gradient at every module boundary (or even between $p$ and $C$), so that each level is trained *only* by its own prediction and clustering losses, leaves the $vm^3$ scaling unchanged (`\Cref{fig:slc_stop_grads}`{=latex}). SLC therefore admits a strictly local learning rule, consistent with biologically motivated learning schemes.

SLC shows that an *explicit* hierarchical architecture, with one predictor-clusterer module per latent level, can saturate the $vm^3$ statistical bound of `\Cref{sec:latent-clustering}`{=latex}. In the next section, we demonstrate that the SSL representation learning objective of data2vec leads to *implicit* hierarchical supervision, and correspondingly also saturates this learning bound.

# Analysis of data2vec {#sec:data2vec}

<figure id="fig:d2vec_online_class">
<p> </p>
<figcaption><strong>Data2vec on RHM.</strong> <strong>Left:</strong> data2vec learns in a teacher-student setting; the student is supervised on reconstruction of pooled teacher representations at masked token positions. <strong>Right:</strong> Root-classification accuracy of probes on pooled final-layer features, as a function of the number of online pretraining samples <span class="math inline"><em>P</em></span>. RHM parameters: <span class="math inline"><em>v</em> = 16</span>, <span class="math inline"><em>s</em> = 2</span>, <span class="math inline"><em>L</em> = 4</span>. Inset: raw curves vs. <span class="math inline"><em>P</em></span> and inferred sample complexity <span class="math inline"><em>P</em><sup>*</sup></span> for which accuracy reaches 50%. Main: rescaling by <span class="math inline"><em>v</em><em>m</em><sup>3</sup></span>, curves collapse. Lines are averages over three independent realizations, dots are individual runs. </figcaption>
</figure>

In this section, we argue that the iterative clustering mechanism is *implicit* in data2vec [@baevskiData2vecGeneralFramework2022], a popular SSL method that predicts its own latent representations. We choose data2vec over the closely related JEPA family because it comes with a published recipe for discrete-token inputs. In particular, data2vec uses a student-teacher setup. The student network is fed a partially masked input $S$; the teacher sees the unmasked input $x$. At every masked position $i$, the student is trained, with a squared loss, to predict the target $Y_i(x)$ given by the average of the last $K$ block activations of a teacher network applied to the unmasked input $x$, as illustrated in `\Cref{fig:d2vec_online_class}`{=latex}--**left**. As with SLC, the teacher has the same architecture as the student, and its weights track an exponential moving average of the student's weights, updated after each gradient step.

#### Empirical sample complexity.

We pretrain data2vec on the RHM. We follow the data2vec text recipe: a fraction $p_{\rm mask}=0.15$ of positions are masked. Full details are presented in Appendix `\ref{app:data2vec}`{=latex}. We compare two data-presentation regimes: *online*, where each step draws a fresh batch from the grammar, and *offline*, where a fixed training set of $P$ unique strings is reused across epochs. After pretraining, we freeze the encoder and train a one-hidden-layer MLP probe to predict the root label from mean-pooled final-layer features.

`\Cref{fig:d2vec_online_class}`{=latex}--**right** shows the downstream root-classification accuracy of the probes for $L=4$, $s=2$, $v=16$ in the online setting.[^4] The learning curves collapse onto a single curve after rescaling the sample axis by $vm^3$. Root classification on the RHM requires resolving all $L-1$ non-root latent levels and then identifying the partition of top-level tuples into root classes. As discussed in `\Cref{sec:rhm}`{=latex}, token-level SSL is bottlenecked by learning the level-$(L-1)$ rules from tokens, which costs $\sim vm^{L+1}=vm^5$ samples; an additional $\mathcal O(vm)$ labeled examples then suffice to learn the root partition.[^5] The observed scaling is hence incompatible with this token-level prediction baseline.

#### Theoretical analysis.

We now explain the $vm^3$ scaling by tying data2vec back to the iterative latent clustering mechanism of `\Cref{sec:latent-clustering}`{=latex}. Although the target $Y_i(x)$ is a continuous vector, we posit it acts as a soft encoding of the RHM latents that the encoder has already learned: at any point in training, $Y_i(x)$ contains a linear component of every latent learned in the encoder (as we verify in `\Cref{fig:app:d2vec_levels}`{=latex}). We idealize the slow EMA as a discretized sequence of phases: within each phase, the teacher is held fixed, and between phases it is refreshed to absorb the latents most recently acquired by the student. The argument then proceeds by induction over the levels. Phase $0$ reduces data2vec training to the token-level masked-prediction problem of `\Cref{sec:rhm}`{=latex} and recovers the level-$1$ latents at $P\gtrsim vm^3$. Each subsequent phase *lifts* the prediction problem onto level-$\ell$ tuples, where it coincides with the cousin-tuple clustering problem of `\Cref{sec:latent-clustering}`{=latex} -- with the same sample complexity. The full non-root hierarchy is therefore recovered at this scale. We now formalize this argument through two assumptions.

*(A1) Targets carry the learned latents.* Let $z_i^{(\ell)}:=h^{(\ell)}_{\lceil i/s^\ell\rceil}$ be the level-$\ell$ ancestor of position $i$, and let $e_z\in\mathbb R^v$ denote the one-hot encoding of a latent symbol $z$. If the encoder linearly represents the latents $h^{(1)},\ldots,h^{(\ell)}$, then the teacher target admits a decomposition $$Y_i(x)
    \;=\;
    F_i(S)
    \;+\;
    \sum_{a=0}^{\ell}
        B_a\,e_{z_i^{(a)}}
    \;+\;
    {\rm residual},$$ where $F_i(S)$ collects components determined by the visible input and the matrices $B_a$ are non-zero linear projections. The residual term collects everything else, which is typically nonlinear and not used in our analysis. The residual architecture of the transformer makes this natural: a feature decodable from one block propagates through the identity path to every later block, and hence appears with a non-zero linear coefficient in the layers entering the top-$K$ teacher average. At initialization, $\ell=0$ and the only ancestor is $x_i$ itself, so $Y_i(x)$ contains a linear component of the masked one-hot inputs.`\looseness=-1`{=latex}

*(A2) Correlation learning.* Whenever a correlation between the prediction target and a feature of the visible input becomes detectable above sampling noise, gradient descent extracts that feature. This is the same correlation-learning hypothesis used in previous RHM analyses [@cagnetta2024towards; @favero2025compositional], where it is supported empirically and by one-step gradient calculations.

#### Phase 0.

The population minimizer of the data2vec loss is the conditional expectation $\mathbb E[Y_i(x)\mid S]$. By linearity of conditional expectation, in phase $0$, predicting $Y_i(x)$ from $S$ reduces, up to the visible component and the non-linear residual, to predicting the masked one-hot $e_{x_i}$ from its visible context. This is exactly the token-level masked-prediction problem analyzed in `\Cref{sec:rhm}`{=latex}: local token--token correlations become detectable above $P\gtrsim vm^3$, and by (A2) the network extracts them. The resulting representation collapses synonymous level-$0$ tuples onto a common image, and the level-$1$ ancestor becomes linearly decodable from the encoder.

#### Phase $\ell\geq 1$.

Suppose phases $0,\ldots,\ell-1$ have produced linearly decodable representations of $h^{(1)},\ldots,h^{(\ell)}$ inside the encoder. After the next teacher update, these latents enter the target, and (A1) gives the new component $\mathbb E[e_{z_i^{(\ell)}}\mid S]$ in the optimal predictor. Because levels $1,\ldots,\ell$ are decodable, the visible context $S$ can be parsed into level-$\ell$ symbols away from the masked position. In particular, it contains the level-$\ell$ cousin tuple $T_\ell$, the closest level-$\ell$ tuple sitting outside the production rule of position $i$. Conditioning on $T_\ell$ produces the context vector $\phi_\ell(\nu)=\mathbb{E}[e_{z_i^{(\ell)}}\mid T_\ell=\nu]$ studied in `\Cref{sec:latent-clustering}`{=latex}. The prediction problem at scale $\ell$ has *lifted* from leaves to level-$\ell$ tuples, and is now identical to the clustering problem solved by the ILC algorithm of that section, requiring $P\gtrsim vm^3$. When the network extracts this signal, the level-$(\ell+1)$ ancestor $z_i^{(\ell+1)}$ becomes linearly decodable, and the next teacher update carries it into the target.

The induction continues up to $\ell=L-2$. Every phase shares the same sample complexity. Since the same training set is reused across phases, the offline sample complexity equals the per-phase threshold, $P_{\rm d2v} \;\asymp\; vm^3$, and depth enters only through the number of phases, not through the per-phase requirement.

![**Data2vec clusters synonyms.** Synonym clustering score $\mathcal C_\ell$ of the encoder at $\ell\in\{1,2,3\}$, for $L=4$, $s=2$, $v=16$. A positive score means synonyms are mapped closer together than non-synonyms. *Insets:* raw curves vs. $P$. *Main:* same curves rescaled by $vm^3$ collapse.](figs/d2vec_clustering_scores.png){#fig:d2vec-clustering width="85%"}

#### The encoder internally clusters synonyms.

Our theoretical argument predicts that the data2vec encoder collapses synonymous level-$\ell$ tuples onto a common representation -- the very geometric signature of clustering. We test this prediction directly by measuring the *synonym clustering score* $\mathcal C_\ell$ of the encoder at level $\ell$ by comparing the change in its hidden representations under two interventions on the input: a *synonym swap*, which resamples every level-$\ell$ production rule and thus replaces each level-$(\ell-1)$ tuple by a synonym while leaving higher-level latents unchanged, and a *generic swap* to an unrelated input. The score is normalized so that $\mathcal C_\ell=0$ when synonym swaps move the representation as much as generic swaps (no clustering), and $\mathcal C_\ell=1$ when synonyms are mapped to a common image (perfect clustering). `\Cref{fig:d2vec-clustering}`{=latex} reports $\mathcal C_\ell$ at each non-root level after the third block of the encoder as a function of pretraining samples $P$. The scores at all three levels rise from zero and collapse onto a common curve once the sample axis is rescaled by $vm^3$. This is direct evidence that data2vec's encoder implements the same recursive clustering mechanism as the ILC algorithm of `\Cref{sec:latent-clustering}`{=latex}, with the same per-level $vm^3$ sample complexity predicted by the theory.`\looseness=-1`{=latex}

# Conclusion

We have proved that learning from one's own latents recovers the full non-root latent tree of the Random Hierarchy Model from a number of samples scaling as $m^3$, exponentially fewer than the $m^{L+1}$ required by token-level self-supervised objectives. We confirmed this prediction with a hierarchical clustering algorithm, an end-to-end neural network of predictor-clusterer modules, and the first sample-complexity analysis of data2vec. The main lesson is that latents at the same level of the hierarchy are far more correlated with each other than they are with raw tokens, so predicting from one's own latents amplifies a signal that token-level prediction dilutes. This places on a firm quantitative footing the common intuition that token-level prediction is suboptimal.

#### Practical implications.

Recent work supports that empirical neural scaling laws in language are governed by power-law decays of token correlations with context length [@cagnetta2024towards; @cagnettaDerivingNeuralScaling2026; @favero2025compositional]. Can generative models trained with own-latent self-supervision systematically beat existing scaling laws, alone or in combination with token-level losses? A useful first step would be a controlled comparison between data2vec and a same-architecture next-token-prediction baseline as the training-set size $P$ is varied: at small $P$ the two should diverge sharply, while at very large $P$ they should converge to the same latent representations. Such a comparison would empirically test our central prediction and, if confirmed, motivate latent-supervised generative models as a route to break current scaling regimes.

#### Biological plausibility.

As we show in Appendix `\ref{app:sec:slc_local_rules}`{=latex}, the predictor-clusterer modules of `\Cref{sec:neuralclustering}`{=latex} can also reach the $m^3$ scaling of sample complexity using only quasilocal learning rules -- stop-gradients between modules in place of end-to-end backpropagation. In this local learning setting our method is conceptually similar to contrastive predictive coding [@oordRepresentationLearningContrastive2019a] and CLAPP [@illingLocalPlasticityRules2021], which both propose biologically plausible self-supervised objectives that predict in latent rather than raw-sensory space. This raises the question of whether the remarkable sample efficiency of the human brain stems from a similar mechanism. Very recently, CLAPP was trained on the RHM in a setting where information on the top root label was given, recovering the supervised learning sample complexity $m^L$ -- see `\cref{sec:app:further_related_works}`{=latex} for details. It would be interesting to test whether such bio-plausible architectures can reach a $m^3$ sample complexity on the RHM simply by learning from their own latent.

#### Limitations.

The RHM is a deliberately simple synthetic language: its context-free grammar has a fixed tree topology and unambiguous, non-recursive production rules. Extending the analysis to (i) variable tree topologies, (ii) recursive production rules that can call themselves, and (iii) explicitly context-dependent rules will bring the theory closer to natural language and images. Such extensions are likely important to characterize precisely the relative strengths and weaknesses of various architectures. Although our analysis suggests that stacking JEPAs as in H-JEPA is largely redundant, very recent self-supervised architectures claim explicit hierarchical supervision but are neither stacked JEPAs nor stacked data2vecs, and instead resemble data2vec-style training procedures with multi-scale teachers [@loweSelfdistillationHiddenLayers2026a; @maesLeWorldModelStableEndtoend2026]. These architectures would be interesting to study theoretically. More broadly, developing a suite of synthetic data models will provide a laboratory in which new architectures can be designed, tested, and theoretically understood before being scaled up.

\begin{ack}
We thank Francesco Cagnetta, Francesco D'Amico, Ariane Delrocq, Antonio Sclocchi, and Wu Zihan for discussions and feedback on the manuscript, and the members of the Simons collaboration for discussions. D. J. Korchinski acknowledges financial support from the Natural Sciences and Engineering Research Council of Canada (NSERC PDF - 587940 - 2024). A. Favero acknowledges support from the Infosys-Cambridge AI Centre. This work was supported by the Simons Foundation through the Simons Collaboration on the Physics of Learning and Neural Computation (Award ID: SFI-MPS-POL00012574-05), PI  Wyart.
\end{ack}
\newpage 
\bibliographystyle{unsrtnat}
\newpage 
\appendix

# Further related work`\label{sec:app:further_related_works}`{=latex}

#### Approximation- and information-theoretic bounds.

Deep networks can represent hierarchically compositional functions with exponentially fewer parameters than shallow ones [@poggio2017why], which translates into nonparametric sample-complexity bounds polynomial in the input dimension for both supervised [@schmidt2020nonparametric] and self-supervised [@mei2024unets] tasks. These results characterize expressivity and information-theoretic statistics, but do not describe what gradient descent can actually learn from a finite training set.

#### Supervised learning.

@mossel2016deep introduced phylogeny-inspired generative models on regular trees, and @malach2018provably showed that they can be learned by an iterative clustering algorithm. @degiuli2019random introduced languages with random production rules. The RHM [@cagnetta2024deep] combines both assumptions, leading to models with controllable correlations where sample complexity can be predicted and compared to observations in deep nets. @ren2026provable proved rigorously that result for multi-step gradient descent. @parley2026deep obtains sample complexity for non-regular trees. @defilippis2026optimal derive optimal scaling laws for hierarchical multi-index models, a related class of generative models with continuous latents.

#### Self-supervised learning.

A complementary line of work has shown that transformers can approximately implement parsing-style (inside-algorithm) inference on context-free grammars [@allenzhu2024cfg; @zhao2023transformers; @garnierbrun2024transformers]. These results interpret the algorithm that trained LLMs implement, but do not characterize the learning mechanism nor its sample complexity. Such analyses do exist for the RHM, but only for next-token prediction [@cagnetta2024deep] and for diffusion [@favero2025compositional; @favero2025bigger], not for learning from one's own latents as considered here.

#### Clustering-based approaches to self-distillation.

Our SLC method is similar in spirit to clustering or self-learned pseudo-labeling-based methods for image-representation learning, such as DeepCluster [@caronDeepClusteringUnsupervised2018] which in turn evolved into SwAV [@caronUnsupervisedLearningVisual2020], the DINO family of models [@caronEmergingPropertiesSelfsupervised2021; @oquabDINOv2LearningRobust2024; @simeoniDINOv32026], iBOT [@zhouIBOTImageBERT2022], and CAPI [@darcetClusterPredictLatents2025]. Our method adds *explicit* hierarchical clustering -- it would be interesting to test to what degree *implicit* hierarchical supervision is present for these clustering-based approaches.

#### Explicitly hierarchical self-distillation SSL.

Schematic realizations of the H-JEPA [@lecun2022path], V-JEPA 2.1 [@mur-labadiaVJEPA21Unlocking2026], Bootleg [@loweSelfdistillationHiddenLayers2026a], and our SLC architectures are depicted in `\cref{fig:hjepa_bootleg_schematic}`{=latex}. Bootleg and V-JEPA 2.1 are extremely recent, report state of the art performance, and employ multi-loss optimization with one prediction per sampled representation level. Bootleg is schematically very similar to data2vec (cf. `\cref{fig:d2vec_online_class}`{=latex}--**left**), except that in bootleg gradients are computed after averaging losses over representations, whereas in data2vec the averaging is done first over representations. V-JEPA 2.1 meanwhile collates representations from throughout the student network, similar to H-JEPA, but also allows low-level target predictions to be influenced by higher level latents. This high$\rightarrow$low prediction path is missing in H-JEPA, but present in SLC, V-JEPA 2.1, and Bootleg. It is important to clarify whether this is related to the fact that, with few exceptions [@liHiTJEPAHierarchicalSelfsupervised2025; @girgisHierarchicalJEPAMeets2026], \`\`standard" H-JEPA appears to have found limited application in the literature.

#### Predictive coding and the RHM.

@delrocqSelfsupervisedLocalLearning2026 recently showed that a backprop-free predictive coding model, CLAPP, can learn the RHM's hierarchical structure with sample complexity comparable to that of backprop. In that work, the model was trained on the RHM in the setting where all production rules exist, such that all sentences belong to the language (corresponding to $f=1$ in our notations). In that case, self-supervised methods cannot learn the RHM rules, as all inputs are equiprobable and correlations within the input are absent. However, the variation of CLAPP considered in that work uses a contrastive update rule that is gated by an extra learning signal: updates are given a $\pm 1$ sign indicating whether pairs belong to the same top-level class. The sample complexity obtained was that of supervised learning $\sim m^L$. Baseline CLAPP [@illingLocalPlasticityRules2021] does not rely on this non-local context supervision -- it would be interesting to understand whether it learns the RHM rules in sample complexity $m^3$ when $f<1$.

\clearpage

<figure id="fig:hjepa_bootleg_schematic" data-latex-placement="p">

<figcaption><strong>Top left:</strong> Structure of losses and targets for naïve H-JEPAs. <strong>Top right:</strong> Structure of losses and targets in V-JEPA 2.1. <strong>Bottom left:</strong> Structure of losses and targets for Bootleg. <strong>Bottom right:</strong> A schematic implementation of Stacked Latent Clustering, in which high-level representations are grounded in predicted lower-level representations.</figcaption>
</figure>

\clearpage

# Proof of Iterative Latent Clustering recovery {#app:clustering-proof}

This appendix proves the recovery guarantee for Iterative Latent Clustering. Throughout the proof, probabilities are taken over fresh samples from a fixed grammar instance, unless an expectation over grammars is explicitly denoted by $\mathbb E_{\mathcal G}$.

## Formal assumptions

We use the following two assumptions. The first is a typical-event assumption on the random grammar. The second is an algorithmic stability assumption on the clustering module.

\begin{assumption}[Balanced and separated grammar]\label{ass:app-balanced-separated}
There exist constants \(c_{\rm bal},C_{\rm bal},c_{\rm sep}>0\), independent of \(v,m,L\), such that for every level \(\ell=0,\ldots,L-2\), every grammatical tuple \(\nu\in\mathcal S_\ell\) satisfies
\[
    \frac{c_{\rm bal}}{vm}\leq \mathbb P[T=\nu]\leq \frac{C_{\rm bal}}{vm},
\]
and the parent context vectors satisfy
\[
    \min_{a\neq b}\|\Phi_{\ell,a}-\Phi_{\ell,b}\|_2\geq c_{\rm sep}\frac{\sqrt{1-f}}{m}.
\]
\end{assumption}
\begin{assumption}[Stable clustering module]\label{ass:app-stable-clustering}
Let \(\{z_\nu:\nu\in\mathcal S_\ell\}\subset\mathbb R^v\) be points with true centers \(\{\Phi_{\ell,a}:a\in\mathcal V_{\ell+1}\}\). Suppose that
\[
    \|z_\nu-\Phi_{\ell,\mathrm{par}_\ell(\nu)}\|_2\leq \varepsilon
\]
for all \(\nu\in\mathcal S_\ell\), and that
\[
    \|\Phi_{\ell,a}-\Phi_{\ell,b}\|_2\geq \Delta
\]
for all \(a\neq b\). If \(\varepsilon\leq \Delta/8\), then \(\mathsf{Cluster}_v(\{z_\nu\}_{\nu\in\mathcal S_\ell})\) returns the true partition \(\mathcal S_\ell=\bigsqcup_{a\in\mathcal V_{\ell+1}}\mathcal S_{\ell,a}\), up to a permutation of labels.
\end{assumption}

Assumption `\ref{ass:app-balanced-separated}`{=latex} is a high-probability random-grammar event. Balancedness follows from asymptotic uniformity of single-symbol marginals in the limit $v\to\infty$ at fixed $f\in(0,1)$, see Lemma C.1 of [@parley2026deep]. Indeed, if $\nu\in\mathcal S_{\ell,a}$, then $\nu$ is generated with probability $1/m$ conditional on its parent $a$. Therefore $$\mathbb P[T^{(\ell)}=\nu]
    =
    \frac1m\mathbb P[h^{(\ell+1)}=a]
    \sim
    \frac1{vm}.$$ The separation scale is the latent-latent correlation scale. Indeed, writing $$\psi_\ell(\nu):=\phi_\ell(\nu)-\mathbb E[e_Y],$$ the connected covariance column is $$C_\ell(:,\nu)=\mathbb P[T=\nu]\psi_\ell(\nu).$$ For $\ell=0$, $T$ is a grammatical visible $s$-tuple and $Y$ is a visible token in a sibling branch, with lowest common ancestor two levels above the leaves. This is precisely the token--tuple covariance computed in @cagnetta2024towards. In our notation, their calculation gives, for a fixed visible value $y$ and a grammatical tuple $\nu$, $$\mathbb E_{\mathcal G}\!\left[C_0(y,\nu)^2\,\middle|\,\nu\in\mathcal S_0\right]\asymp \frac{1-f}{v^3m^4}.$$ Summing over the $v$ possible values of $Y$ yields the squared column norm $$\mathbb E_{\mathcal G}\!\left[\|C_0(:,\nu)\|_2^2\,\middle|\,\nu\in\mathcal S_0\right]\asymp v\cdot \frac{1-f}{v^3m^4}=\frac{1-f}{v^2m^4}.$$

The same estimate holds at every latent level by self-similarity of the RHM. Indeed, fix $\ell>0$ and consider the sequence of level-$\ell$ variables $$H^{(\ell)}:=\big(h^{(\ell)}_1,\ldots,h^{(\ell)}_{s^{L-\ell}}\big).$$ Marginally, $H^{(\ell)}$ is generated by an RHM of depth $L-\ell$ with the same local parameters $(s,v,m)$, using the production rules at levels $\ell,\ell+1,\ldots,L-1$. In this truncated RHM, the level-$\ell$ variables play the role of visible tokens. Our statistic $C_\ell(:,\nu)$ is therefore exactly the level-$0$ token--tuple covariance of this truncated model: $T$ is a terminal $s$-tuple, $Y$ is a terminal child in a sibling branch, and their lowest common ancestor is again two levels above the terminal scale. Since the random rule ensemble has the same law at every level, the same covariance calculation gives $$\mathbb E_{\mathcal G}\!\left[\|C_\ell(:,\nu)\|_2^2\,\middle|\,\nu\in\mathcal S_\ell\right]\asymp \frac{1-f}{v^2m^4},$$ uniformly for $\ell=0,\ldots,L-2$. Since $$C_\ell(:,\nu)=\mathbb P[T=\nu]\psi_\ell(\nu)$$ and, on the balancedness event, $\mathbb P[T=\nu]\asymp 1/(vm)$, this implies $$\mathbb E_{\mathcal G}\!\left[\|\psi_\ell(\nu)\|_2^2\,\middle|\,\nu\in\mathcal S_\ell\right]\asymp \frac{(1-f)/(v^2m^4)}{1/(v^2m^2)}=\frac{1-f}{m^2}.$$

Thus $\sqrt{1-f}/m$ is the natural scale of the centered context-vector signal.

To turn this into a separation statement, we need to assume that different parent context vectors have generic relative positions, as expected for independent random production rules. In particular, under a random-direction heuristic, overlaps are smaller than squared norms by a factor $v^{-1/2}$, so pairwise distances have the same scale as the individual norms.

## Theorem and proof

\begin{theorem}[Recovery of the non-root hierarchy]\label{thm:iterative-latent-clustering-formal}
Assume the balancedness and separation conditions above, and assume the clustering is stable under perturbations as described. Then there exists a constant \(C>0\), depending only on the constants in those assumptions and on \(s\), such that the following holds. If
\begin{equation}
    P
    \geq
    C
    \left[
        vm\log\frac{Lvm}{\delta}
        +
        \frac{v m^3}{1-f}
        \log\frac{Lvm}{\delta}
    \right],
\end{equation}
then the iterative latent-clustering algorithm recovers
\(
    h^{(1)},h^{(2)},\ldots,h^{(L-1)}
\)
and all production-rule classes below the root with probability at least \(1-\delta\), up to independent permutations of the latent labels at each level.

In particular, up to logarithmic factors,
\(
    P_{\mathrm{ILC}}
    \asymp
    (1-f)^{-1} \, v m^3.
\)
\end{theorem}

#### Synonyms have identical context vectors.

We first record the property that makes clustering possible.

\begin{lemma}[Synonyms have identical context vectors]\label{lem:app-synonyms}
If \(\nu,\nu'\in\mathcal S_\ell\) satisfy \(\mathrm{par}_\ell(\nu)=\mathrm{par}_\ell(\nu')\), then \(\phi_\ell(\nu)=\phi_\ell(\nu')\).
\end{lemma}

::: proof
*Proof.* Let $A$ be the level-$(\ell+1)$ parent of the branch containing $T$, let $B$ be the level-$(\ell+1)$ parent of the sibling branch containing $Y$, and let $G$ be their shared level-$(\ell+2)$ parent. The local graphical structure is $G\to(A,B)$, $A\to T$, and $B\to Y$. By unambiguity, observing $T=\nu$ determines $A=\mathrm{par}_\ell(\nu)$. By the context-free Markov structure, conditioned on $A$, the rest of the tree is independent of which production rule was used to generate $T$. Hence, if $\nu\in\mathcal S_{\ell,a}$, $$\mathbb P[Y=y\mid T=\nu]=\mathbb P[Y=y\mid A=a].$$ The right-hand side depends on $\nu$ only through its parent $a$. Therefore all tuples in the same synonym class have the same context vector. ◻
:::

#### Concentration of empirical context vectors.

Fix a level $\ell$ and a grammatical tuple $\nu\in\mathcal S_\ell$. Let $N_\nu$ be the number of samples in which the chosen cousin tuple equals $\nu$, and let $Y_1,\ldots,Y_{N_\nu}$ be the corresponding target values. Conditional on $T=\nu$, these are independent draws from the conditional law of $Y$, and the empirical context vector $$\widehat\phi_\ell(\nu)=\frac1{N_\nu}\sum_{q=1}^{N_\nu} e_{Y_q}$$ is the empirical mean of $N_\nu$ independent one-hot vectors with mean $\phi_\ell(\nu)=\mathbb E[e_Y\mid T=\nu]$.

\begin{lemma}[Context-vector concentration]\label{lem:app-context-concentration}
There is a constant \(C_0>0\) such that, for any \(\eta\in(0,e^{-1}]\),
\[
    \mathbb P\!\left[\|\widehat\phi_\ell(\nu)-\phi_\ell(\nu)\|_2>C_0\sqrt{\frac{\log(1/\eta)}{N_\nu}}\,\middle|\,N_\nu\right]\leq \eta.
\]
\end{lemma}

::: proof
*Proof.* Condition on $N_\nu=n\geq1$. Define $$U_q:=e_{Y_q}-\phi_\ell(\nu),
    \qquad
    \bar U:=\frac1n\sum_{q=1}^n U_q.$$ Then $\mathbb E[U_q]=0$ and $$\widehat\phi_\ell(\nu)-\phi_\ell(\nu)=\bar U.$$ Since the $U_q$'s are independent and mean zero, the cross terms vanish, and $$\begin{aligned}
    \mathbb E\|\bar U\|_2^2
    &=
    \mathbb E\left\|
        \frac1n\sum_{q=1}^n U_q
    \right\|_2^2 \\
    &=
    \frac1{n^2}\sum_{q=1}^n \mathbb E\|U_q\|_2^2 \\
    &\leq \frac1n,
\end{aligned}$$ where we used $$\mathbb E\|e_Y-\phi_\ell(\nu)\|_2^2
    =
    1-\|\phi_\ell(\nu)\|_2^2
    \leq 1.$$ Thus, $$\mathbb E\|\bar U\|_2\leq n^{-1/2}.$$

Now consider the function $$F(Y_1,\ldots,Y_n):=\|\bar U\|_2.$$ Changing one observation changes $\bar U$ by at most $\sqrt2/n$, and hence changes $F$ by at most $\sqrt2/n$. By McDiarmid's bounded-differences inequality, $$\mathbb P\!\left[
        F-\mathbb EF>t
    \right]
    \leq
    \exp(-n t^2).$$ Taking $$t=\sqrt{\frac{\log(1/\eta)}{n}},$$ we get, with probability at least $1-\eta$, $$F
    \leq
    \frac1{\sqrt n}
    +
    \sqrt{\frac{\log(1/\eta)}{n}}.$$ Since $\eta\leq e^{-1}$, we have $\log(1/\eta)\geq1$, so the first term is absorbed into the second. Thus, for a constant $C_0$, $$\|\widehat\phi_\ell(\nu)-\phi_\ell(\nu)\|_2
    =
    \|\bar U\|_2
    \leq
    C_0\sqrt{\frac{\log(1/\eta)}{n}}$$ with probability at least $1-\eta$. ◻
:::

#### Proof of the theorem.

At each level $\ell$, let $T_\ell$ be the level-$\ell$ tuple to be clustered and $Z_\ell$ the level-$\ell$ target in the sibling branch. For sample $p$, denote the corresponding true variables by $T_\ell^{(p)}$ and $Z_\ell^{(p)}$

We now prove `\Cref{thm:iterative-latent-clustering-formal}`{=latex}.

::: proof
*Proof of `\Cref{thm:iterative-latent-clustering-formal}`{=latex}.* We first define a good event using the true latent variables. For every level $\ell=0,\ldots,L-2$ and every grammatical tuple $\nu\in\mathcal S_\ell$, let $$N_{\ell,\nu}:=\sum_{p=1}^P \mathbf 1\{T_\ell^{(p)}=\nu\}$$ be the number of samples in which the true level-$\ell$ cousin tuple equals $\nu$, and let $$\widetilde\phi_\ell(\nu):=
    \frac{1}{N_{\ell,\nu}}\sum_{p=1}^P e_{Z_\ell^{(p)}}\mathbf 1\{T_\ell^{(p)}=\nu\}$$ be the corresponding oracle empirical context vector.

By balancedness, $$\mathbb P[T_\ell=\nu]\geq \frac{c_{\rm bal}}{vm}.$$ Thus $N_{\ell,\nu}$ is binomial with mean at least $c_{\rm bal}P/(vm)$. A Chernoff bound and a union bound over all $(L-1)vm$ pairs $(\ell,\nu)$ imply that, if $$P\gtrsim vm\log\frac{Lvm}{\delta},$$ then, with probability at least $1-\delta/2$, $$N_{\ell,\nu}\gtrsim \frac{P}{vm}$$ simultaneously for all $\ell$ and $\nu$.

On this count event, Lemma `\ref{lem:app-context-concentration}`{=latex}, applied with $\eta=\delta/(2Lvm)$, and another union bound give $$\max_{\ell,\nu}
    \|\widetilde\phi_\ell(\nu)-\phi_\ell(\nu)\|_2
    \lesssim
    \sqrt{\frac{vm}{P}\log\frac{Lvm}{\delta}}$$ with probability at least $1-\delta/2$. Therefore, if $$P\gtrsim \frac{vm^3}{1-f}\log\frac{Lvm}{\delta},$$ then, with probability at least $1-\delta$, $$\|\widetilde\phi_\ell(\nu)-\phi_\ell(\nu)\|_2
    \leq
    \frac18 c_{\rm sep}\frac{\sqrt{1-f}}{m}$$ for all $\ell$ and all $\nu\in\mathcal S_\ell$. Call this simultaneous event $\mathcal E$.

We now show that on $\mathcal E$, the algorithm succeeds deterministically. At level $0$, the variables are visible tokens, so $\widehat h^{(0)}=h^{(0)}$. Suppose inductively that level $\ell$ has been recovered exactly, up to a permutation of labels. Then the recovered level-$\ell$ tuples are the true level-$\ell$ tuples, up to relabeling. Consequently, the empirical context vectors computed by the algorithm are exactly the oracle vectors $\widetilde\phi_\ell(\nu)$, up to permutations of tuple labels and output coordinates. These permutations preserve distances.

Thus, on $\mathcal E$, every empirical context vector lies within $$\frac18 c_{\rm sep}\frac{\sqrt{1-f}}{m}$$ of its true parent center. By separation, $$\min_{a\neq b}\|\Phi_{\ell,a}-\Phi_{\ell,b}\|_2
    \geq
    c_{\rm sep}\frac{\sqrt{1-f}}{m}.$$ The stable clustering assumption therefore implies that the clustering step recovers the true synonym partition $$\mathcal S_\ell
    =
    \bigsqcup_{a\in\mathcal V_{\ell+1}}\mathcal S_{\ell,a}$$ up to a permutation of labels. Assigning each tuple to its cluster recovers level $\ell+1$, again up to a permutation of labels.

Iterating this deterministic induction from $\ell=0$ to $L-2$ recovers $h^{(1)},\ldots,h^{(L-1)}$ and all production-rule classes below the root. Since $\mathcal E$ holds with probability at least $1-\delta$, the theorem follows.

The required sample size is $$P\gtrsim vm\log\frac{Lvm}{\delta}
    +
    \frac{vm^3}{1-f}\log\frac{Lvm}{\delta}.$$ Thus, up to logarithmic factors, $P \gtrsim vm^3/(1-f)$ suffices for recovery, and we write $$P_{\rm ILC}\asymp \frac{vm^3}{1-f}$$ for this threshold. ◻
:::

# Stacked latent-clustering details and additional results`\label{sec:app:slc}`{=latex}

This appendix gives implementation details and additional diagnostics for the SLC network introduced in `\Cref{sec:neuralclustering}`{=latex}. We use the RHM notation of `\Cref{sec:rhm}`{=latex}: true latent variables are denoted by $h^{(\ell)}_i$, while learned SLC tokens are denoted by $\widehat h^{(\ell)}_i$.

## SLC architecture details `\label{sec:app:slc_arch_details}`{=latex}

We encode RHM input strings of size $s^L$ into one-hot feature vectors of dimension $\mathbb{R}^{s^L \times d_h}$, where $d_h \gg v$ is the dimension into which we tokenize learned latents. This initializes the level-$0$ tokens as $\widehat h^{(0)}_i=x_i$. We use a single dimension $d_h$ for the cluster codebook and for the token vocabulary passed between SLC modules: each clusterer assigns labels in $\{1,\ldots,d_h\}$, these labels become the tokens for the next module, and predictors therefore output categorical distributions over $d_h$ token identities.

A SLC model for an RHM of depth $L$ is comprised of $L-1$ SLC modules. Module $\ell$ reduces the learned level-$\ell$ sequence to an encoding of the level-$(\ell+1)$ latents. For a module at level $\ell$, the tensor shapes are $$\begin{array}{rcl}
\widehat h^{(\ell)}
    &\in& \mathbb R^{s^{L-\ell}\times d_h},\\[1mm]
x_u^{(\ell)}
    :=
    \big(\widehat h^{(\ell)}_{(u-1)s+1},\ldots,
    \widehat h^{(\ell)}_{us}\big)
    &\in& \mathbb R^{s\times d_h},\\[1mm]
\widehat\phi^{(\ell)}_u
    =
    \mathrm{Pred}^{(\ell)}(x_u^{(\ell)})
    &\in& \mathbb R^{s\times(s-1)\times s\times d_h},\\[1mm]
\mathrm{vec}\,\widehat\phi^{(\ell)}_u
    &\in& \mathbb R^{s^2(s-1)d_h},\\[1mm]
q_u^{(\ell+1)}
    =
    \mathrm{Clust}^{(\ell)}(\mathrm{vec}\,\widehat\phi^{(\ell)}_u)
    &\in& \Delta^{d_h-1},\\[1mm]
\widehat h^{(\ell+1)}
    &\in& \mathbb R^{s^{L-\ell-1}\times d_h}.
\end{array}$$ Here $u=1,\ldots,s^{L-\ell-1}$ indexes level-$\ell$ patches, and $d_{h_2}$ is the hidden width of the CNN layers below. The dimensions of $\widehat\phi^{(\ell)}_u$ enumerate, respectively, the possible position of the input patch within its grandparent, the target cousin patch, the target token within that patch, and the $d_h$ token identity. This task is easiest for $d_h = v$, as clustering is architecturally enforced, and is hardest for $d_h\ge mv$ (when synonyms can be uniquely encoded without clustering). Unless otherwise noted, we will always use $d_h=mv$.

A SLC module is composed of a predictor and a clusterer. This arrangement is depicted in  `\Cref{fig:slc_schematic}`{=latex}.

<figure id="fig:slc_schematic">

<figcaption><strong>Schematic of a single SLC module, with the layer-wise prediction and clustering losses.</strong></figcaption>
</figure>

<figure>

<figcaption><strong>Multiple predictions improve signal</strong>. We train the predictor and clusterer modules with weight sharing, and train cross-entropy prediction at all possible positions and cousin targets. </figcaption>
</figure>

#### Predictor.

The predictor operates on an $s$-tuple patch and predicts all $s(s-1)$ tokens with the same grandparent (cf. `\Cref{fig:slc_targetting}`{=latex} for a visualization). Using all cousin targets increases the available signal and improves prefactors, although it does not change the asymptotic sample-complexity scaling. Because the same patch can occupy any of the $s$ positions under its grandparent and we use weight sharing across positions, the predictor outputs $s^2(s-1)$ distributions; for each observed patch, only the $s(s-1)$ distributions matching its actual position enter the loss. The predictor is implemented in terms of convolutional layers, as $$\begin{equation}
    \widehat\phi^{(\ell)}(x) = \mathrm{Pred}^{(\ell)}(x) = (\SM\circ \CNN_3 \circ A\circ \CNN_2 \circ A \circ \CNN_1)(x)
\end{equation}$$ where $A = \relu \circ \mathrm{BN}$ denotes the activation function, $\CNN_1 : \mathbb{R}^{s\times d_h} \rightarrow \mathbb{R}^{d_{h_2}}$ is a stride $s$ 1d convolutional layer, while $\CNN_2  : \mathbb{R}^{ d_{h_2}} \rightarrow \mathbb{R}^{d_{h_2}}$ and $\CNN_3 : \mathbb{R}^{d_{h_2}}\rightarrow \mathbb{R}^{s\times (s-1)\times s \times d_h}$ are both stride 1. The softmax operation $\SM$ acts on the last dimension so that $\sum_{a=1}^{d_h} \widehat\phi^{(\ell)}(x)_{r\rho ta} =1$. This last dimension is therefore the marginal probability distribution over neighbouring tokens, conditioned on the input patch position $r$, target cousin-patch index $\rho$, and target token position $t$.

We train the predictor with a cross-entropy prediction loss. Fix a grandparent index $g\in\{1,\ldots,s^{L-\ell-2}\}$ and an input patch position $r\in\{1,\ldots,s\}$ within that grandparent, so that $u=(g-1)s+r$ and the input patch is $x_u^{(\ell)}$. For a target cousin patch position $r'\ne r$, let $\rho_r(r')\in\{1,\ldots,s-1\}$ denote the index of $r'$ among the patch positions excluding $r$. The level-$\ell$ token at target position $t$ inside target patch $r'$ has absolute index $$i(g,r',t):=((g-1)s+r'-1)s+t .$$ For this input patch, the prediction loss is $$\begin{equation}
    \mathcal{L}_\mathrm{pred}(g,r)
    =
    -\sum_{r'\ne r}\sum_{t=1}^s \sum_{a=1}^{d_h}
    \widehat h^{(\ell)}_{i(g,r',t),a}
    \log\!\left(
        \widehat\phi^{(\ell)}(x_u^{(\ell)})_{r,\rho_r(r'),t,a}
    \right).
\end{equation}$$ The predictor loss for module $\ell$ is the mean of this quantity over all grandparent indices $g$ and input patch positions $r$.

#### Clusterer.

The clusterer in turn assigns soft cluster labels to each patch's prediction vector -- this amounts to assigning a learned level-$(\ell+1)$ latent label. $$\begin{equation}
    q^{(\ell+1)} = \mathrm{Clust}^{(\ell)}(\widehat\phi^{(\ell)}) = (\SM \circ \CNN_5 \circ A \circ \CNN_4 )(\widehat\phi^{(\ell)})
\end{equation}$$ where $\widehat\phi^{(\ell)}$ denotes the flattened output of $\mathrm{Pred}^{(\ell)}$, $\CNN_4 : \mathbb{R}^{s^2(s-1)d_h} \rightarrow \mathbb{R}^{d_{h_2}}$ and $\CNN_5 : \mathbb{R}^{d_{h_2}} \rightarrow \mathbb{R}^{d_h}$ are stride-1 CNNs. The soft code $q^{(\ell+1)}$ is the learned token $\widehat h^{(\ell+1)}$ passed to the next module. *A priori* we do not specify that $d_h = v$, so the clusterer could choose to instead assign each synonym of a patch into its own cluster. Indeed, for finite data, differing but synonymous patches will imply different contexts due to finite sampling effects. We address this by using a contrastive clustering loss that penalizes assignments of sufficiently similar predictions into different clusters.

Let $q_i = \mathrm{Clust}^{(\ell)}(\widehat\phi^{(\ell)}_i)$ denote the soft cluster assignment for prediction vector $\widehat\phi^{(\ell)}_i$, where the index $i$ is taken to be over patch positions and batches, and we drop the $^{(\ell)}$ superscript for brevity. The predictions are then centred, so that $\widehat\phi^{(\ell)}_i \rightarrow \widehat\phi^{(\ell)}_i - \langle \widehat\phi^{(\ell)}_j  \rangle_j$. The similarity $S_{ij}$ between two predictions is the scaled cosine similarity $$\begin{equation}
    S_{ij} = \frac{1}{2}\left(1+\frac{\widehat\phi^{(\ell)}_i\cdot\widehat\phi^{(\ell)}_j}{||\widehat\phi^{(\ell)}_i||_2||\widehat\phi^{(\ell)}_j||_2}\right)
\end{equation}$$ of these centred predictions. The clustering loss is then $$\begin{equation}
    \mathcal{L}_\mathrm{clust} = \langle \mathcal{L}_\mathrm{sim}(q_i,q_j,S_{ij}) + \lambda_{\mathrm{sep}}\mathcal{L}_\mathrm{sep}(q_i,q_j,S_{ij})\rangle_{ij} + \lambda_\mathrm{sparsity} \langle \mathcal{L}_\mathrm{sparsity}(q_i)\rangle_i
\end{equation}$$ where the $\langle \cdot\rangle_{ij}$ indicates an average over patch positions and batches. The similarity loss component $$\begin{equation}
    \mathcal{L}_\mathrm{sim}(q_i,q_j,S_{ij}) = (q_i-q_j)\cdot(q_i-q_j)\relu(S_{ij} - S_{m})
\end{equation}$$ serves to penalize predictions that are sufficiently similar (more than the margin $S_m$) but assigned to different clusters (i.e. with non-zero $\Delta q$) -- this loss ensures that clusters consist of similar predictions. Alone, of course, the model could simply cluster ALL predictions into a single cluster. Therefore, we include a separation loss $$\begin{equation}
    \mathcal{L}_{\mathrm{sep}}(q_i,q_j,S_{ij}) = q_i\cdot q_j(1-S_{ij})
\end{equation}$$ which penalizes dissimilar (i.e. large $1-S_{ij}$) predictions that are assigned into overlapping clusters (i.e. $q_i\cdot q_j > 0$). Finally, we introduce $$\begin{equation}
    \mathcal{L}_\mathrm{sparsity}(q_i) = -q_i\cdot q_i
\end{equation}$$ which tends to maximize the $L_2$ norm of the cluster assignment and therefore the sparsity (since $||q_i||_1 = 1$ due to the softmax). Preliminary experiments demonstrated that this sparsity reward improves training stability and downstream interpretability.

A general caveat of employing contrastive losses is that memory costs can explode. Employed naïvely, the first SLC module would compute $s^{2(L-1)}$ comparisons, each of which involves a potentially costly dot-product of a $s^2(s-1)d_h$ vector. In practice for each batch, we select a random subset of $N_\mathrm{compare}$ patch positions and batches on which to compute the clustering loss for memory constraints.

#### Training protocol and stabilizers.

To keep the prediction problem at depth $\ell$ comparable to the prediction problem at depth $0$, we tokenize clusterer outputs by means of a softmax operation, $\SM(X,T) = \exp(X/T)/\sum_i\exp(X_i/T)$. For $T=0$, this is hard label assignment. Preliminary experiments found that $T<1$ can improve convergence, but here we report results for $T=1$.

The clustering codebook dimension $d_h$ is another important stabilizer. If the space of possible labels is large, each input can be assigned a unique label; this is a failure to cluster, because synonyms of a latent are not assigned the same symbol in the next layer. The input vocabulary for layer $\ell+1$ then increases by a factor $m$, making the prediction task for layer $\ell+1$ intractable. This problem is exacerbated by noisy or finite sampling, where synonymous inputs do not produce identical predictions. This can be resolved by an architectural bottleneck, i.e. setting $d_h = v$, as is done for the ILC. For most natural data, this dimension is unknown *a priori*; to simulate these conditions we set the label dimension to $d_h>mv$ (the hardest case) and instead use the contrastive loss to set the scale of attraction between similar predictions.

To mitigate representation collapse, we use a teacher-student framework, in which latent targets are obtained from a teacher network whose weights are simply the exponentially lagged weights of the student. After each step of gradient descent, the teacher weights $W^{(T)}$ are updated towards the student weights $W^{(S)}$ with $W^{(T)} \leftarrow (1-\alpha_\mathrm{ema})W^{(T)} + \alpha_\mathrm{ema} W^{(S)}$, where $1/\alpha_\mathrm{ema}$ sets the exponential moving average timescale. The prediction loss therefore uses teacher representations to obtain targets $x_{:}$, which are evaluated against student predictions. The clustering loss meanwhile, computes the similarity $S_{ij}$ using the teacher's predictions and subsequently uses the student's clustering assignments $q_i$, $q_j$. As an alternative to the teacher-student framework, we highlight in `\Cref{app:sec:slc_local_rules}`{=latex} that collapse can also be overcome by introducing stop-gradients between SLC modules, effectively making learning entirely local, reminiscent to predictive coding. Finally, we perform Jacobian descent, computing the gradient for each loss independently, and aggregate the gradients by means of the UPGrad algorithm. This selects an update in the dual-cone of the gradients to avoid conflicting updates [@quintonJacobianDescentMultiobjective2025], which we found in preliminary experiments reduced the need for careful hyperparameter selection.

Unless otherwise specified, SLC models are trained using AdamW and Jacobian descent (via TorchJD and UPGrad [@quintonJacobianDescentMultiobjective2025]) and following hyperparameters: $\alpha_\mathrm{ema} = 0.015$, $\mathrm{lr}=3\times10^{-3}$, $\mathrm{wd} = 10^{-3}$, $d_{h_2} = 150$, batch size 32, clustering dimension $d_h=128$, $S_m = 0.8$, $\lambda_\mathrm{sep} = 0.5$, $N_\mathrm{compare}=300$, and $\lambda_\mathrm{spars} = 10^{-2}$. These were optimized by hyperparameter sweeps conducted at $L=5$ for $v=16$, $m=8$, $s=3$ using Optuna's Tree-Parzen-Estimator sampler and Hyperband pruner [@akibaOptunaNextgenerationHyperparameter2019].

We use a validation set of 2000 samples to evaluate the model throughout training. When training classifiers on the final SLC representation, we use early stopping to select the best pretraining checkpoint and classification checkpoint based on this validation set performance, but report test set performance on a held-out test set of 2000 samples. In `\Cref{fig:slc_L_scaling}`{=latex} the test sets are smaller, with 1000 samples.

## Training with local rules `\label{app:sec:slc_local_rules}`{=latex}

In `\Cref{fig:slc_stop_grads}`{=latex}, we ablate the global training machinery while also testing increasingly biologically plausible training conditions. We first prevent gradient flow between SLC layers (denoted \`\`Layer SG" for layerwise stop-gradient), then disable the exponential moving average (EMA) teacher network, and finally disable gradient flow between the prediction and clustering submodules. These changes remove the long-range backpropagating error signal and the biologically implausible time-delayed duplicate teacher network. When disabling EMA but allowing gradients to flow from the clustering loss through to the predictor, we observe representation collapse. We credit this to the clustering loss overpowering the prediction loss: trivial predictions are easy to cluster perfectly at the expense of prediction accuracy. Preventing gradient flow between prediction and clustering networks prevents this failure mode, showing that SLC can still learn when updates are constrained to local signals.

![**Local learning suffices to learn the RHM.** We compare baseline training (blue) to increasingly local learning rules. \`\`Layer SG" in legend indicates stopgradients between SLC layers, \`\`$P\rightarrow C$ SG" indicates stopgradient between the predictor and clusterer submodules. EMA in legend indicates exponential-moving-average teacher is used as a target. RHM parameters are: $L=4, v=10, m=10, s=3$. Solid lines are average over three independent realizations. ](figs/ptc_m_sweep_L4_stopgrad_comparison.png){#fig:slc_stop_grads width="50%"}

## Training dynamics

In `\Cref{fig:slc_model_dynamics}`{=latex} we report the layer-wise accuracy and prediction loss of the different SLC modules. We evaluate the accuracy of module $\ell$ by taking the learned hard labels $\widehat h^{(\ell+1)}_i$ (with the $T=0$ hard label assignment) and comparing them with the ground-truth RHM latents $\latent{\ell+1}{i}$. We identify an optimal mapping between the assigned labellings and the true latents with the Kuhn-Munkres \`\`Hungarian" algorithm, and report the accuracy attained with such a mapping.

![**Training proceeds layer-wise.** As each layer's clustering improves, the subsequent layer's prediction loss decreases, enabling that subsequent layer to then cluster.](figs/training_curves_L7_p9797_rep0.png){#fig:slc_model_dynamics width="50%"}

## Independence of sample complexity with L

In `\Cref{fig:kmeans_and_slc_mL_plus_1}`{=latex} we check that the sample complexity scaling for ILC and SLC better supports $m^3$ rather than $m^{L+1}$.

![**Attempted curve collapse with token based SSL scaling.** Data are as in `\Cref{fig:kmeans_and_slc}`{=latex}, but with the $m^{L+1}$ scaling that holds for token level SSL. The collapse of accuracy vs. training sample is significantly degraded. ](figs/clustering_and_ptc_m_sweeps_m_Lplus1_scaling.png){#fig:kmeans_and_slc_mL_plus_1 width="95%"}

Furthermore, in `\Cref{fig:slc_L_scaling}`{=latex} we validate the sample complexity $P \sim m^3$ for systems with varying depth, up to $L=7$.

![**The SLC model exhibits a depth-independent sample complexity.** Accuracy of linear probes trained to classify the root latent $\latent{L}{1}$ from the final SLC representation. RHM parameters are $s=3$, $m=8$, and $v=16$, with depth $L$ indicated by the legend. The pre-trained models were trained for 30 epochs, and the classifiers for the same duration and on the same dataset. Solid lines are an average over three independent RHM instantiations.](figs/ptc_L_scaling.png){#fig:slc_L_scaling width="50%"}

# Further results on data2vec {#app:data2vec}

`\Cref{fig:app:d2vec_levels}`{=latex}--**left** shows the synonym clustering score for encoder layers to latents at different depths. All layers remain invariant to $\ell=L=4$ because there is not signal for which the network can learn the root latent. For latents at level $\ell < L$, we find clustering throughout the middle layers of the network. This indicates that clustering occurs throughout the early layers of data2vec. Furthermore, the fact that for latent $\ell$ encoder layer $\ell-1$ is the first to exhibit clustering of the synonyms indicates that latents are constructed layer-by-layer. Deeper layers exhibit stronger clustering of low-level latents, up to layer 2. This inversion is reminiscent of an encoder-decoder architecture, and mimics what is observed in U-Net architectures [@favero2025compositional].

`\Cref{fig:app:d2vec_levels}`{=latex}--**right** tests the presence of linear traces of latents in the teacher target for different $m$ and $\ell$. This data collapses with $vm^3$, confirming that the transition to all latents being linearly encoded in the teacher occurs at the predicted $vm^3$ scaling.

In `\Cref{fig:d2vec_offline}`{=latex} we test the scaling collapse for data2vec instances trained in the offline setting, thereby decoupling the number of gradient descent steps from the number of original observed samples. We find $vm^3$ provides a good collapse.

![ **Left:** post-training synonym clustering score per encoder layer and RHM level $\ell$ ($m=6$). Values above $0$ indicate that synonymous level-$(\ell-1)$ tuples are clustered closer than non-synonymous ones. **Right:** linear-probe accuracy for the latent $h^{(\ell)}$ from the teacher target, vs. rescaled sample size $P/(vm^3)$, for $\ell=2,3,4$. Curves collapse with $vm^3$ for every $\ell$.](figs/sensitivity_and_latent_probe_combined.png){#fig:app:d2vec_levels width="105%"}

![**Offline data2vec pretraining.** Linear and MLP probe test accuracy on the classification task as a function of the pretraining training-set size $P$. Each point is averaged over three seeds. Main axes: curves rescaled by $vm^3$. Insets: same data, raw $P$.](figs/offline_probe_collapse_vm3.png){#fig:d2vec_offline width=".95\\linewidth"}

\renewcommand{\arraystretch}{1.05}

  **Group**              **Hyperparameter**                  **Value**
  ---------------------- ----------------------------------- --------------------------------------------------------------------------
  Encoder architecture   hidden dim $d_{\text{model}}$       $2048$
                         attention heads $n_h$               $32$
                         head dim                            $64$
                         encoder layers $n_l$                $8$
                         FFN dim $d_{\text{ff}}$             $8192 = 4\,d_{\text{model}}$
                         non-linearity                       GELU
                         dropout                             $0.1$
                         positional embeddings               learned, $T \times d_{\text{model}}$
                         token embeddings                    learned, $V_{\text{tok}}\!=\!v\!+\!2$ ids
                         LayerNorm placement                 pre-LN
  data2vec objective     mask probability                    $0.15$
                         mask span length                    $1$
                         top-$K$ teacher layers averaged     $K = 4$
                         per-layer LayerNorm of targets      yes
                         global LayerNorm of targets         no
                         regression head depth               $1$ linear layer
                         loss                                smooth L1, $\beta = 4$
                         teacher EMA decay $\mu$             $0.99$ (constant; no annealing)
  Optimization           optimizer                           AdamW
                         $(\beta_1, \beta_2)$                $(0.9, 0.98)$
                         weight decay                        $0.01$ (excl. bias / LayerNorm)
                         learning rate                       $\eta = 10^{-4}$ (constant)
                         LR schedule                         none (no warmup, no decay)
                         gradient clipping                   global L2 norm $\leq 1.0$
                         batch size $B$                      $512$
                         training steps $T_{\text{steps}}$   $262\,144$
  Probes                 pooling                             mean over positions
                         linear probe                        `Linear`$(d_{\text{model}}, v)$
                         MLP probe                           $d_{\text{model}}\!\to\!2 d_{\text{model}}\!\to\!v$, GELU, dropout $0.1$
                         probe optimizer                     Adam, lr $10^{-3}$
                         probe steps per eval                $2000$ (each: linear, MLP)

  : Data2vec hyperparameters. {#tab:hyperparams_data2vec_online_m6}

# Compute budget `\label{app:sec:compute}`{=latex}

SLC experiments were conducted on individual H100 nodes. The SLC experiments in `\Cref{fig:kmeans_and_slc}`{=latex} took approximately 10 H100 hours. The $L$ scaling experiment reported in `\Cref{fig:slc_L_scaling}`{=latex} cost approximately 100 H100 hours, as the amount of computations scales linearly with input data size which consists of $s^L$ tokens (and therefore exponentially in depth). `\Cref{fig:slc_stop_grads}`{=latex} cost approximately 3 H100 hours. We do not have an estimate for the Optuna hyperparameter sweeps or preliminary experiments.

The data2vec experiments cost approximately 1,000 H100 hours.

\newpage 

[^1]: Equal contribution

[^2]: Indeed, conditioned on the parent of the tuple, the rest of the tree is independent of which production rule was used to generate the tuple, and the uniform choice of production rules gives $\mathbb P[T^{(\ell)}=\nu]=\mathbb P[T^{(\ell)}=\nu']$ within each synonym class.

[^3]: Notice that since the root is sampled uniformly and each root symbol has $m$ equiprobable rules, the distribution does not reveal how the $vm$ valid top-level tuples are partitioned into $v$ groups of size $m$, one group per root symbol. Thus unsupervised learning can recover $h^{(1)},h^{(2)},\ldots,h^{(L-1)}$ and the support of valid top-level tuples, but not the root labels themselves.

[^4]: The same plots are reported in Appendix `\ref{app:data2vec}`{=latex} for the offline setting.

[^5]: For reference, fully supervised learning of the root from leaves would itself require $\sim vm^L=vm^4$ samples, so the $vm^3$ scaling we observe is in fact better than what supervised learning achieves -- and data2vec uses no labels during pretraining.
