---
abstract: |
  Latent reasoning via continuous chain-of-thoughts (Latent CoT) has emerged as a promising alternative to discrete CoT reasoning. Operating in continuous space increases expressivity and has been hypothesized to enable *superposition*: the ability to maintain multiple candidate solutions simultaneously within a single representation. Despite theoretical arguments, it remains unclear whether language models actually leverage superposition when reasoning using latent CoTs. We investigate this question across three regimes: a *training-free* regime that constructs latent thoughts as convex combinations of token embeddings, a *fine-tuned* regime where a base model is adapted to produce latent thoughts, and a *from-scratch regime* where a model is trained entirely with latent thoughts to solve a given task. Using `\logitlens `{=latex}and entity-level probing to analyze internal representations, we find that only models trained *from scratch* exhibit signs of using superposition. In the training-free and fine-tuned regimes, we find that the superposition either collapses or is not used at all, with models discovering shortcut solutions instead. We argue that this is due to two complementary phenomena: i) pretraining on natural language data biases models to commit to a token in the last layers ii) capacity has a huge effect on which solutions a model favors. Together, our results offer a unified explanation for when and why superposition arises in continuous chain-of-thought reasoning, and identify the conditions under which it collapses.`\looseness-1`{=latex}
author:
- |
  Michael Rizvi-Martel$^{1,2}$[^1]`\quad `{=latex}Guillaume Rabusseau$^{1,2}$`\quad `{=latex}Marius Mosbach$^{1,3}$ `\mystrut`{=latex}\
  $^1$Mila -- Quebec AI Institute `\quad `{=latex}$^2$Université de Montréal `\quad `{=latex}$^3$McGill University\
bibliography:
- colm2026_conference.bib
title: The Illusion of Superposition? A Principled Analysis of Latent Thinking in Language Models
---

```{=latex}
\newcommand{\circnum}[1]{%
  \tikz[baseline=(char.base)]{
    \node[shape=circle, draw, inner sep=1.5pt] (char) {\small #1};
  }%
}
```
```{=latex}
\newcommand{\figleft}{{\em (Left)}}
```
```{=latex}
\newcommand{\figcenter}{{\em (Center)}}
```
```{=latex}
\newcommand{\figright}{{\em (Right)}}
```
```{=latex}
\newcommand{\figtop}{{\em (Top)}}
```
```{=latex}
\newcommand{\figbottom}{{\em (Bottom)}}
```
```{=latex}
\newcommand{\captiona}{{\em (a)}}
```
```{=latex}
\newcommand{\captionb}{{\em (b)}}
```
```{=latex}
\newcommand{\captionc}{{\em (c)}}
```
```{=latex}
\newcommand{\captiond}{{\em (d)}}
```
```{=latex}
\newcommand{\newterm}[1]{{\bf #1}}
```
```{=latex}
\def\figref#1{figure~\ref{#1}}
```
```{=latex}
\def\Figref#1{Figure~\ref{#1}}
```
```{=latex}
\def\twofigref#1#2{figures \ref{#1} and \ref{#2}}
```
```{=latex}
\def\quadfigref#1#2#3#4{figures \ref{#1}, \ref{#2}, \ref{#3} and \ref{#4}}
```
```{=latex}
\def\secref#1{section~\ref{#1}}
```
```{=latex}
\def\Secref#1{Section~\ref{#1}}
```
```{=latex}
\def\twosecrefs#1#2{sections \ref{#1} and \ref{#2}}
```
```{=latex}
\def\secrefs#1#2#3{sections \ref{#1}, \ref{#2} and \ref{#3}}
```
```{=latex}
\def\eqref#1{equation~\ref{#1}}
```
```{=latex}
\def\Eqref#1{Equation~\ref{#1}}
```
```{=latex}
\def\plaineqref#1{\ref{#1}}
```
```{=latex}
\def\chapref#1{chapter~\ref{#1}}
```
```{=latex}
\def\Chapref#1{Chapter~\ref{#1}}
```
```{=latex}
\def\rangechapref#1#2{chapters\ref{#1}--\ref{#2}}
```
```{=latex}
\def\algref#1{algorithm~\ref{#1}}
```
```{=latex}
\def\Algref#1{Algorithm~\ref{#1}}
```
```{=latex}
\def\twoalgref#1#2{algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\Twoalgref#1#2{Algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\partref#1{part~\ref{#1}}
```
```{=latex}
\def\Partref#1{Part~\ref{#1}}
```
```{=latex}
\def\twopartref#1#2{parts \ref{#1} and \ref{#2}}
```
```{=latex}
\def\ceil#1{\lceil #1 \rceil}
```
```{=latex}
\def\floor#1{\lfloor #1 \rfloor}
```
```{=latex}
\def\1{\bm{1}}
```
```{=latex}
\newcommand{\train}{\mathcal{D}}
```
```{=latex}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
```
```{=latex}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
```
```{=latex}
\def\eps{{\epsilon}}
```
```{=latex}
\def\reta{{\textnormal{$\eta$}}}
```
```{=latex}
\def\ra{{\textnormal{a}}}
```
```{=latex}
\def\rb{{\textnormal{b}}}
```
```{=latex}
\def\rc{{\textnormal{c}}}
```
```{=latex}
\def\rd{{\textnormal{d}}}
```
```{=latex}
\def\re{{\textnormal{e}}}
```
```{=latex}
\def\rf{{\textnormal{f}}}
```
```{=latex}
\def\rg{{\textnormal{g}}}
```
```{=latex}
\def\rh{{\textnormal{h}}}
```
```{=latex}
\def\ri{{\textnormal{i}}}
```
```{=latex}
\def\rj{{\textnormal{j}}}
```
```{=latex}
\def\rk{{\textnormal{k}}}
```
```{=latex}
\def\rl{{\textnormal{l}}}
```
```{=latex}
\def\rn{{\textnormal{n}}}
```
```{=latex}
\def\ro{{\textnormal{o}}}
```
```{=latex}
\def\rp{{\textnormal{p}}}
```
```{=latex}
\def\rq{{\textnormal{q}}}
```
```{=latex}
\def\rr{{\textnormal{r}}}
```
```{=latex}
\def\rs{{\textnormal{s}}}
```
```{=latex}
\def\rt{{\textnormal{t}}}
```
```{=latex}
\def\ru{{\textnormal{u}}}
```
```{=latex}
\def\rv{{\textnormal{v}}}
```
```{=latex}
\def\rw{{\textnormal{w}}}
```
```{=latex}
\def\rx{{\textnormal{x}}}
```
```{=latex}
\def\ry{{\textnormal{y}}}
```
```{=latex}
\def\rz{{\textnormal{z}}}
```
```{=latex}
\def\rvepsilon{{\mathbf{\epsilon}}}
```
```{=latex}
\def\rvtheta{{\mathbf{\theta}}}
```
```{=latex}
\def\rva{{\mathbf{a}}}
```
```{=latex}
\def\rvb{{\mathbf{b}}}
```
```{=latex}
\def\rvc{{\mathbf{c}}}
```
```{=latex}
\def\rvd{{\mathbf{d}}}
```
```{=latex}
\def\rve{{\mathbf{e}}}
```
```{=latex}
\def\rvf{{\mathbf{f}}}
```
```{=latex}
\def\rvg{{\mathbf{g}}}
```
```{=latex}
\def\rvh{{\mathbf{h}}}
```
```{=latex}
\def\rvu{{\mathbf{i}}}
```
```{=latex}
\def\rvj{{\mathbf{j}}}
```
```{=latex}
\def\rvk{{\mathbf{k}}}
```
```{=latex}
\def\rvl{{\mathbf{l}}}
```
```{=latex}
\def\rvm{{\mathbf{m}}}
```
```{=latex}
\def\rvn{{\mathbf{n}}}
```
```{=latex}
\def\rvo{{\mathbf{o}}}
```
```{=latex}
\def\rvp{{\mathbf{p}}}
```
```{=latex}
\def\rvq{{\mathbf{q}}}
```
```{=latex}
\def\rvr{{\mathbf{r}}}
```
```{=latex}
\def\rvs{{\mathbf{s}}}
```
```{=latex}
\def\rvt{{\mathbf{t}}}
```
```{=latex}
\def\rvu{{\mathbf{u}}}
```
```{=latex}
\def\rvv{{\mathbf{v}}}
```
```{=latex}
\def\rvw{{\mathbf{w}}}
```
```{=latex}
\def\rvx{{\mathbf{x}}}
```
```{=latex}
\def\rvy{{\mathbf{y}}}
```
```{=latex}
\def\rvz{{\mathbf{z}}}
```
```{=latex}
\def\erva{{\textnormal{a}}}
```
```{=latex}
\def\ervb{{\textnormal{b}}}
```
```{=latex}
\def\ervc{{\textnormal{c}}}
```
```{=latex}
\def\ervd{{\textnormal{d}}}
```
```{=latex}
\def\erve{{\textnormal{e}}}
```
```{=latex}
\def\ervf{{\textnormal{f}}}
```
```{=latex}
\def\ervg{{\textnormal{g}}}
```
```{=latex}
\def\ervh{{\textnormal{h}}}
```
```{=latex}
\def\ervi{{\textnormal{i}}}
```
```{=latex}
\def\ervj{{\textnormal{j}}}
```
```{=latex}
\def\ervk{{\textnormal{k}}}
```
```{=latex}
\def\ervl{{\textnormal{l}}}
```
```{=latex}
\def\ervm{{\textnormal{m}}}
```
```{=latex}
\def\ervn{{\textnormal{n}}}
```
```{=latex}
\def\ervo{{\textnormal{o}}}
```
```{=latex}
\def\ervp{{\textnormal{p}}}
```
```{=latex}
\def\ervq{{\textnormal{q}}}
```
```{=latex}
\def\ervr{{\textnormal{r}}}
```
```{=latex}
\def\ervs{{\textnormal{s}}}
```
```{=latex}
\def\ervt{{\textnormal{t}}}
```
```{=latex}
\def\ervu{{\textnormal{u}}}
```
```{=latex}
\def\ervv{{\textnormal{v}}}
```
```{=latex}
\def\ervw{{\textnormal{w}}}
```
```{=latex}
\def\ervx{{\textnormal{x}}}
```
```{=latex}
\def\ervy{{\textnormal{y}}}
```
```{=latex}
\def\ervz{{\textnormal{z}}}
```
```{=latex}
\def\rmA{{\mathbf{A}}}
```
```{=latex}
\def\rmB{{\mathbf{B}}}
```
```{=latex}
\def\rmC{{\mathbf{C}}}
```
```{=latex}
\def\rmD{{\mathbf{D}}}
```
```{=latex}
\def\rmE{{\mathbf{E}}}
```
```{=latex}
\def\rmF{{\mathbf{F}}}
```
```{=latex}
\def\rmG{{\mathbf{G}}}
```
```{=latex}
\def\rmH{{\mathbf{H}}}
```
```{=latex}
\def\rmI{{\mathbf{I}}}
```
```{=latex}
\def\rmJ{{\mathbf{J}}}
```
```{=latex}
\def\rmK{{\mathbf{K}}}
```
```{=latex}
\def\rmL{{\mathbf{L}}}
```
```{=latex}
\def\rmM{{\mathbf{M}}}
```
```{=latex}
\def\rmN{{\mathbf{N}}}
```
```{=latex}
\def\rmO{{\mathbf{O}}}
```
```{=latex}
\def\rmP{{\mathbf{P}}}
```
```{=latex}
\def\rmQ{{\mathbf{Q}}}
```
```{=latex}
\def\rmR{{\mathbf{R}}}
```
```{=latex}
\def\rmS{{\mathbf{S}}}
```
```{=latex}
\def\rmT{{\mathbf{T}}}
```
```{=latex}
\def\rmU{{\mathbf{U}}}
```
```{=latex}
\def\rmV{{\mathbf{V}}}
```
```{=latex}
\def\rmW{{\mathbf{W}}}
```
```{=latex}
\def\rmX{{\mathbf{X}}}
```
```{=latex}
\def\rmY{{\mathbf{Y}}}
```
```{=latex}
\def\rmZ{{\mathbf{Z}}}
```
```{=latex}
\def\ermA{{\textnormal{A}}}
```
```{=latex}
\def\ermB{{\textnormal{B}}}
```
```{=latex}
\def\ermC{{\textnormal{C}}}
```
```{=latex}
\def\ermD{{\textnormal{D}}}
```
```{=latex}
\def\ermE{{\textnormal{E}}}
```
```{=latex}
\def\ermF{{\textnormal{F}}}
```
```{=latex}
\def\ermG{{\textnormal{G}}}
```
```{=latex}
\def\ermH{{\textnormal{H}}}
```
```{=latex}
\def\ermI{{\textnormal{I}}}
```
```{=latex}
\def\ermJ{{\textnormal{J}}}
```
```{=latex}
\def\ermK{{\textnormal{K}}}
```
```{=latex}
\def\ermL{{\textnormal{L}}}
```
```{=latex}
\def\ermM{{\textnormal{M}}}
```
```{=latex}
\def\ermN{{\textnormal{N}}}
```
```{=latex}
\def\ermO{{\textnormal{O}}}
```
```{=latex}
\def\ermP{{\textnormal{P}}}
```
```{=latex}
\def\ermQ{{\textnormal{Q}}}
```
```{=latex}
\def\ermR{{\textnormal{R}}}
```
```{=latex}
\def\ermS{{\textnormal{S}}}
```
```{=latex}
\def\ermT{{\textnormal{T}}}
```
```{=latex}
\def\ermU{{\textnormal{U}}}
```
```{=latex}
\def\ermV{{\textnormal{V}}}
```
```{=latex}
\def\ermW{{\textnormal{W}}}
```
```{=latex}
\def\ermX{{\textnormal{X}}}
```
```{=latex}
\def\ermY{{\textnormal{Y}}}
```
```{=latex}
\def\ermZ{{\textnormal{Z}}}
```
```{=latex}
\def\vzero{{\bm{0}}}
```
```{=latex}
\def\vone{{\bm{1}}}
```
```{=latex}
\def\vmu{{\bm{\mu}}}
```
```{=latex}
\def\vtheta{{\bm{\theta}}}
```
```{=latex}
\def\va{{\bm{a}}}
```
```{=latex}
\def\vb{{\bm{b}}}
```
```{=latex}
\def\vc{{\bm{c}}}
```
```{=latex}
\def\vd{{\bm{d}}}
```
```{=latex}
\def\ve{{\bm{e}}}
```
```{=latex}
\def\vf{{\bm{f}}}
```
```{=latex}
\def\vg{{\bm{g}}}
```
```{=latex}
\def\vh{{\bm{h}}}
```
```{=latex}
\def\vi{{\bm{i}}}
```
```{=latex}
\def\vj{{\bm{j}}}
```
```{=latex}
\def\vk{{\bm{k}}}
```
```{=latex}
\def\vl{{\bm{l}}}
```
```{=latex}
\def\vm{{\bm{m}}}
```
```{=latex}
\def\vn{{\bm{n}}}
```
```{=latex}
\def\vo{{\bm{o}}}
```
```{=latex}
\def\vp{{\bm{p}}}
```
```{=latex}
\def\vq{{\bm{q}}}
```
```{=latex}
\def\vr{{\bm{r}}}
```
```{=latex}
\def\vs{{\bm{s}}}
```
```{=latex}
\def\vt{{\bm{t}}}
```
```{=latex}
\def\vu{{\bm{u}}}
```
```{=latex}
\def\vv{{\bm{v}}}
```
```{=latex}
\def\vw{{\bm{w}}}
```
```{=latex}
\def\vx{{\bm{x}}}
```
```{=latex}
\def\vy{{\bm{y}}}
```
```{=latex}
\def\vz{{\bm{z}}}
```
```{=latex}
\def\evalpha{{\alpha}}
```
```{=latex}
\def\evbeta{{\beta}}
```
```{=latex}
\def\evepsilon{{\epsilon}}
```
```{=latex}
\def\evlambda{{\lambda}}
```
```{=latex}
\def\evomega{{\omega}}
```
```{=latex}
\def\evmu{{\mu}}
```
```{=latex}
\def\evpsi{{\psi}}
```
```{=latex}
\def\evsigma{{\sigma}}
```
```{=latex}
\def\evtheta{{\theta}}
```
```{=latex}
\def\eva{{a}}
```
```{=latex}
\def\evb{{b}}
```
```{=latex}
\def\evc{{c}}
```
```{=latex}
\def\evd{{d}}
```
```{=latex}
\def\eve{{e}}
```
```{=latex}
\def\evf{{f}}
```
```{=latex}
\def\evg{{g}}
```
```{=latex}
\def\evh{{h}}
```
```{=latex}
\def\evi{{i}}
```
```{=latex}
\def\evj{{j}}
```
```{=latex}
\def\evk{{k}}
```
```{=latex}
\def\evl{{l}}
```
```{=latex}
\def\evm{{m}}
```
```{=latex}
\def\evn{{n}}
```
```{=latex}
\def\evo{{o}}
```
```{=latex}
\def\evp{{p}}
```
```{=latex}
\def\evq{{q}}
```
```{=latex}
\def\evr{{r}}
```
```{=latex}
\def\evs{{s}}
```
```{=latex}
\def\evt{{t}}
```
```{=latex}
\def\evu{{u}}
```
```{=latex}
\def\evv{{v}}
```
```{=latex}
\def\evw{{w}}
```
```{=latex}
\def\evx{{x}}
```
```{=latex}
\def\evy{{y}}
```
```{=latex}
\def\evz{{z}}
```
```{=latex}
\def\mA{{\bm{A}}}
```
```{=latex}
\def\mB{{\bm{B}}}
```
```{=latex}
\def\mC{{\bm{C}}}
```
```{=latex}
\def\mD{{\bm{D}}}
```
```{=latex}
\def\mE{{\bm{E}}}
```
```{=latex}
\def\mF{{\bm{F}}}
```
```{=latex}
\def\mG{{\bm{G}}}
```
```{=latex}
\def\mH{{\bm{H}}}
```
```{=latex}
\def\mI{{\bm{I}}}
```
```{=latex}
\def\mJ{{\bm{J}}}
```
```{=latex}
\def\mK{{\bm{K}}}
```
```{=latex}
\def\mL{{\bm{L}}}
```
```{=latex}
\def\mM{{\bm{M}}}
```
```{=latex}
\def\mN{{\bm{N}}}
```
```{=latex}
\def\mO{{\bm{O}}}
```
```{=latex}
\def\mP{{\bm{P}}}
```
```{=latex}
\def\mQ{{\bm{Q}}}
```
```{=latex}
\def\mR{{\bm{R}}}
```
```{=latex}
\def\mS{{\bm{S}}}
```
```{=latex}
\def\mT{{\bm{T}}}
```
```{=latex}
\def\mU{{\bm{U}}}
```
```{=latex}
\def\mV{{\bm{V}}}
```
```{=latex}
\def\mW{{\bm{W}}}
```
```{=latex}
\def\mX{{\bm{X}}}
```
```{=latex}
\def\mY{{\bm{Y}}}
```
```{=latex}
\def\mZ{{\bm{Z}}}
```
```{=latex}
\def\mBeta{{\bm{\beta}}}
```
```{=latex}
\def\mPhi{{\bm{\Phi}}}
```
```{=latex}
\def\mLambda{{\bm{\Lambda}}}
```
```{=latex}
\def\mSigma{{\bm{\Sigma}}}
```
```{=latex}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
```
```{=latex}
\def\tA{{\tens{A}}}
```
```{=latex}
\def\tB{{\tens{B}}}
```
```{=latex}
\def\tC{{\tens{C}}}
```
```{=latex}
\def\tD{{\tens{D}}}
```
```{=latex}
\def\tE{{\tens{E}}}
```
```{=latex}
\def\tF{{\tens{F}}}
```
```{=latex}
\def\tG{{\tens{G}}}
```
```{=latex}
\def\tH{{\tens{H}}}
```
```{=latex}
\def\tI{{\tens{I}}}
```
```{=latex}
\def\tJ{{\tens{J}}}
```
```{=latex}
\def\tK{{\tens{K}}}
```
```{=latex}
\def\tL{{\tens{L}}}
```
```{=latex}
\def\tM{{\tens{M}}}
```
```{=latex}
\def\tN{{\tens{N}}}
```
```{=latex}
\def\tO{{\tens{O}}}
```
```{=latex}
\def\tP{{\tens{P}}}
```
```{=latex}
\def\tQ{{\tens{Q}}}
```
```{=latex}
\def\tR{{\tens{R}}}
```
```{=latex}
\def\tS{{\tens{S}}}
```
```{=latex}
\def\tT{{\tens{T}}}
```
```{=latex}
\def\tU{{\tens{U}}}
```
```{=latex}
\def\tV{{\tens{V}}}
```
```{=latex}
\def\tW{{\tens{W}}}
```
```{=latex}
\def\tX{{\tens{X}}}
```
```{=latex}
\def\tY{{\tens{Y}}}
```
```{=latex}
\def\tZ{{\tens{Z}}}
```
```{=latex}
\def\gA{{\mathcal{A}}}
```
```{=latex}
\def\gB{{\mathcal{B}}}
```
```{=latex}
\def\gC{{\mathcal{C}}}
```
```{=latex}
\def\gD{{\mathcal{D}}}
```
```{=latex}
\def\gE{{\mathcal{E}}}
```
```{=latex}
\def\gF{{\mathcal{F}}}
```
```{=latex}
\def\gG{{\mathcal{G}}}
```
```{=latex}
\def\gH{{\mathcal{H}}}
```
```{=latex}
\def\gI{{\mathcal{I}}}
```
```{=latex}
\def\gJ{{\mathcal{J}}}
```
```{=latex}
\def\gK{{\mathcal{K}}}
```
```{=latex}
\def\gL{{\mathcal{L}}}
```
```{=latex}
\def\gM{{\mathcal{M}}}
```
```{=latex}
\def\gN{{\mathcal{N}}}
```
```{=latex}
\def\gO{{\mathcal{O}}}
```
```{=latex}
\def\gP{{\mathcal{P}}}
```
```{=latex}
\def\gQ{{\mathcal{Q}}}
```
```{=latex}
\def\gR{{\mathcal{R}}}
```
```{=latex}
\def\gS{{\mathcal{S}}}
```
```{=latex}
\def\gT{{\mathcal{T}}}
```
```{=latex}
\def\gU{{\mathcal{U}}}
```
```{=latex}
\def\gV{{\mathcal{V}}}
```
```{=latex}
\def\gW{{\mathcal{W}}}
```
```{=latex}
\def\gX{{\mathcal{X}}}
```
```{=latex}
\def\gY{{\mathcal{Y}}}
```
```{=latex}
\def\gZ{{\mathcal{Z}}}
```
```{=latex}
\def\sA{{\mathbb{A}}}
```
```{=latex}
\def\sB{{\mathbb{B}}}
```
```{=latex}
\def\sC{{\mathbb{C}}}
```
```{=latex}
\def\sD{{\mathbb{D}}}
```
```{=latex}
\def\sF{{\mathbb{F}}}
```
```{=latex}
\def\sG{{\mathbb{G}}}
```
```{=latex}
\def\sH{{\mathbb{H}}}
```
```{=latex}
\def\sI{{\mathbb{I}}}
```
```{=latex}
\def\sJ{{\mathbb{J}}}
```
```{=latex}
\def\sK{{\mathbb{K}}}
```
```{=latex}
\def\sL{{\mathbb{L}}}
```
```{=latex}
\def\sM{{\mathbb{M}}}
```
```{=latex}
\def\sN{{\mathbb{N}}}
```
```{=latex}
\def\sO{{\mathbb{O}}}
```
```{=latex}
\def\sP{{\mathbb{P}}}
```
```{=latex}
\def\sQ{{\mathbb{Q}}}
```
```{=latex}
\def\sR{{\mathbb{R}}}
```
```{=latex}
\def\sS{{\mathbb{S}}}
```
```{=latex}
\def\sT{{\mathbb{T}}}
```
```{=latex}
\def\sU{{\mathbb{U}}}
```
```{=latex}
\def\sV{{\mathbb{V}}}
```
```{=latex}
\def\sW{{\mathbb{W}}}
```
```{=latex}
\def\sX{{\mathbb{X}}}
```
```{=latex}
\def\sY{{\mathbb{Y}}}
```
```{=latex}
\def\sZ{{\mathbb{Z}}}
```
```{=latex}
\def\emLambda{{\Lambda}}
```
```{=latex}
\def\emA{{A}}
```
```{=latex}
\def\emB{{B}}
```
```{=latex}
\def\emC{{C}}
```
```{=latex}
\def\emD{{D}}
```
```{=latex}
\def\emE{{E}}
```
```{=latex}
\def\emF{{F}}
```
```{=latex}
\def\emG{{G}}
```
```{=latex}
\def\emH{{H}}
```
```{=latex}
\def\emI{{I}}
```
```{=latex}
\def\emJ{{J}}
```
```{=latex}
\def\emK{{K}}
```
```{=latex}
\def\emL{{L}}
```
```{=latex}
\def\emM{{M}}
```
```{=latex}
\def\emN{{N}}
```
```{=latex}
\def\emO{{O}}
```
```{=latex}
\def\emP{{P}}
```
```{=latex}
\def\emQ{{Q}}
```
```{=latex}
\def\emR{{R}}
```
```{=latex}
\def\emS{{S}}
```
```{=latex}
\def\emT{{T}}
```
```{=latex}
\def\emU{{U}}
```
```{=latex}
\def\emV{{V}}
```
```{=latex}
\def\emW{{W}}
```
```{=latex}
\def\emX{{X}}
```
```{=latex}
\def\emY{{Y}}
```
```{=latex}
\def\emZ{{Z}}
```
```{=latex}
\def\emSigma{{\Sigma}}
```
```{=latex}
\newcommand{\etens}[1]{\mathsfit{#1}}
```
```{=latex}
\def\etLambda{{\etens{\Lambda}}}
```
```{=latex}
\def\etA{{\etens{A}}}
```
```{=latex}
\def\etB{{\etens{B}}}
```
```{=latex}
\def\etC{{\etens{C}}}
```
```{=latex}
\def\etD{{\etens{D}}}
```
```{=latex}
\def\etE{{\etens{E}}}
```
```{=latex}
\def\etF{{\etens{F}}}
```
```{=latex}
\def\etG{{\etens{G}}}
```
```{=latex}
\def\etH{{\etens{H}}}
```
```{=latex}
\def\etI{{\etens{I}}}
```
```{=latex}
\def\etJ{{\etens{J}}}
```
```{=latex}
\def\etK{{\etens{K}}}
```
```{=latex}
\def\etL{{\etens{L}}}
```
```{=latex}
\def\etM{{\etens{M}}}
```
```{=latex}
\def\etN{{\etens{N}}}
```
```{=latex}
\def\etO{{\etens{O}}}
```
```{=latex}
\def\etP{{\etens{P}}}
```
```{=latex}
\def\etQ{{\etens{Q}}}
```
```{=latex}
\def\etR{{\etens{R}}}
```
```{=latex}
\def\etS{{\etens{S}}}
```
```{=latex}
\def\etT{{\etens{T}}}
```
```{=latex}
\def\etU{{\etens{U}}}
```
```{=latex}
\def\etV{{\etens{V}}}
```
```{=latex}
\def\etW{{\etens{W}}}
```
```{=latex}
\def\etX{{\etens{X}}}
```
```{=latex}
\def\etY{{\etens{Y}}}
```
```{=latex}
\def\etZ{{\etens{Z}}}
```
```{=latex}
\newcommand{\pdata}{p_{\rm{data}}}
```
```{=latex}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
```
```{=latex}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
```
```{=latex}
\newcommand{\pmodel}{p_{\rm{model}}}
```
```{=latex}
\newcommand{\Pmodel}{P_{\rm{model}}}
```
```{=latex}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
```
```{=latex}
\newcommand{\pencode}{p_{\rm{encoder}}}
```
```{=latex}
\newcommand{\pdecode}{p_{\rm{decoder}}}
```
```{=latex}
\newcommand{\precons}{p_{\rm{reconstruct}}}
```
```{=latex}
\newcommand{\laplace}{\mathrm{Laplace}}
```
```{=latex}
\newcommand{\E}{\mathbb{E}}
```
```{=latex}
\newcommand{\Ls}{\mathcal{L}}
```
```{=latex}
\newcommand{\R}{\mathbb{R}}
```
```{=latex}
\newcommand{\emp}{\tilde{p}}
```
```{=latex}
\newcommand{\lr}{\alpha}
```
```{=latex}
\newcommand{\reg}{\lambda}
```
```{=latex}
\newcommand{\rect}{\mathrm{rectifier}}
```
```{=latex}
\newcommand{\softmax}{\mathrm{softmax}}
```
```{=latex}
\newcommand{\sigmoid}{\sigma}
```
```{=latex}
\newcommand{\softplus}{\zeta}
```
```{=latex}
\newcommand{\KL}{D_{\mathrm{KL}}}
```
```{=latex}
\newcommand{\Var}{\mathrm{Var}}
```
```{=latex}
\newcommand{\standarderror}{\mathrm{SE}}
```
```{=latex}
\newcommand{\Cov}{\mathrm{Cov}}
```
```{=latex}
\newcommand{\normlzero}{L^0}
```
```{=latex}
\newcommand{\normlone}{L^1}
```
```{=latex}
\newcommand{\normltwo}{L^2}
```
```{=latex}
\newcommand{\normlp}{L^p}
```
```{=latex}
\newcommand{\normmax}{L^\infty}
```
```{=latex}
\newcommand{\parents}{Pa}
```
```{=latex}
\DeclareMathOperator*{\argmax}{arg\,max}
```
```{=latex}
\DeclareMathOperator*{\argmin}{arg\,min}
```
```{=latex}
\DeclareMathOperator{\sign}{sign}
```
```{=latex}
\DeclareMathOperator{\Tr}{Tr}
```
```{=latex}
\let\ab\allowbreak
```
```{=latex}
\newcommand*\iftodonotes{\if@todonotes@disabled\expandafter\@secondoftwo\else\expandafter\@firstoftwo\fi}
```
```{=latex}
\newcommand{\noindentaftertodo}{\iftodonotes{\noindent}{}}
```
```{=latex}
\newcommand{\fixme}[2][]{\todo[color=yellow,size=\scriptsize,fancyline,caption={},#1]{#2}}
```
```{=latex}
\newcommand{\Fixme}[2][]{\fixme[inline,#1]{#2}\noindentaftertodo}
```
```{=latex}
\newcommand{\note}[4][]{\todo[author=#2,color=#3,size=\scriptsize,fancyline,caption={},#1]{#4}}
```
```{=latex}
\newcommand{\response}[1]{\hrule\textbf{#1:}}
```
```{=latex}
\newcommand{\marius}[2][]{\note[#1]{Marius}{orange!40}{#2}}
```
```{=latex}
\newcommand{\Marius}[2][]{\marius[inline,#1]{#2}\noindentaftertodo}
```
```{=latex}
\newcommand{\mm}[1]{\textcolor{orange}{MM: #1}}
```
```{=latex}
\newcommand{\mariussuggests}[2]{\textcolor{red!20}{\sout{#1}}\textcolor{orange}{mm: #2}}
```
```{=latex}
\newcommand{\addcite}{{\color{red}(add citation) }}
```
```{=latex}
\newcommand{\addcites}{{\color{red}(add citations) }}
```
```{=latex}
\newcommand{\writerest}{{\color{red} ... (complete this) }}
```
```{=latex}
\newcommand{\texttodo}[1]{{\color{red}{todo: #1}}}
```
```{=latex}
\newcommand{\xx}{{\color{red}\textbf{X}}}
```
```{=latex}
\newcommand{\logitlens}{Logit Lens\xspace}
```
```{=latex}
\newcommand{\prob}{{\color{probcolor}p}}
```
```{=latex}
\newcommand{\emb}{{\color{embcolor}e}}
```
```{=latex}
\newcommand{\ctemb}{{\color{embcolor}\tilde{e}}}
```
```{=latex}
\newcommand{\hidden}{{\color{embcolor}h}}
```
```{=latex}
\newcommand{\Eemb}{{\color{matcolor}E}}
```
```{=latex}
\newcommand{\Eunemb}{{\color{unembcolor}U}}
```
```{=latex}
\newcommand{\seq}[1]{\mathbf{#1}}
```
```{=latex}
\def\mystrut{\rule{0pt}{1.1\normalbaselineskip}}
```
```{=latex}
\newcommand{\fix}{\marginpar{FIX}}
```
```{=latex}
\newcommand{\new}{\marginpar{NEW}}
```
```{=latex}
\newcommand{\MR}[1]{\textcolor{green!50!black}{MR: #1}}
```
`\etocdepthtag`{=latex}.tocdefault

```{=latex}
\ifcolmsubmission
```
```{=latex}
\linenumbers
```
```{=latex}
\fi
```
```{=latex}
\maketitle
```
# Introduction {#sec:introduction}

Chain-of-thought (CoT) reasoning has become the standard approach for tackling complex problems with large language models (LLMs), enabling them to break down problems by reasoning \`\`step-by-step" [@wei2022chain]. Various works have tried to further improve LLM reasoning through methods like self-consistency [@wang2022self], tree-of-thoughts [@yao2023tree], stream-of-search [@gandhi2024stream], and post-training LLMs on CoT data is now a crucial part of the LLM training pipeline [@o3_system_card_2025; @guo2025deepseek]. Exploring an alternative approach for reasoning, recent work proposed to have LLMs reason *directly in latent space* [@hao2024training; @butt2025soft], which has been shown theoretically to be more expressive than discrete CoT [@zhu2025reasoning]. A compelling hypothesis for latent CoT's advantage is *superposition*: models could maintain multiple candidate solutions simultaneously, exploring several reasoning paths before committing to an answer [@zhu2025reasoning]. This would represent a fundamental advantage over discrete CoT, which must commit to a single token at each step. However, there is little empirical evidence that LLMs leverage this capability, motivating the following research question:

::: center
*Does superposition actually occur in latent CoT models?*
:::

We investigate this question across three complementary settings. First, we analyze Soft Thinking [@zhang2025soft], a training-free latent CoT method that creates superposition by computing linear combinations of input embeddings. Second, we examine Coconut [@hao2024training], a method that fine-tunes models to reason with continuous latent thoughts. Lastly, we analyze a from-scratch variant of Coconut, where a small GPT2-style model is entirely trained with latent thoughts to solve a given task.

Empirically, we make the following contributions: First, using `\logitlens`{=latex} [@nostalgebraist2020logitlens] to probe internal representations, we find that **off-the-shelf LLMs collapse superposed inputs to a single interpretation** within the first few layers: when comparing with a discrete CoT baseline, entropy profiles are nearly identical. Moreover, replacing a soft token with a standard one yields nearly indistinguishable KL divergence and cosine similarity. Second, through entity-level probing, we show that **the Coconut model learns to extract answers directly from the question representations**, achieving comparable accuracy *without any latent tokens*. Our belief evolution analysis explains this: we find that models do not leverage step-by-step reasoning during latent computation. Third, when applying the same Coconut analysis to a model trained from scratch, we find that it **indeed shows signs of leveraging superposition**: the model encodes uncertainty between the correct next step and other possible next steps within its latent thoughts.`\looseness-1`{=latex} Together, these results suggest that *training from-scratch* is the most conducive mechanism to develop superposition in models.

Based on these results, we conduct an additional set of experiments aimed at better understanding this discrepancy. We find that the failure of superposition in training-free and fine-tuning approaches is chiefly due to two main factors:

1.  **Models trained on next token prediction learn to commit to a token in the last layers.** We find that across all considered training-free and finetuned models, entropy of the logit distributions drops heavily at the last layers. This drop is much less significant in from-scratch models.

2.  **Capacity matters.** Even when trained from scratch, models that are too large are more prone to learn shortcuts.

We believe our work highlights important caveats of current latent thinking methodologies and offers principled guidelines to design the next generation of latent reasoning models.

# Related Work {#sec:related-work}

#### Latent Reasoning.

Many works investigate the use of continuous tokens and latent representations in LLMs. @hao2024training show that finetuning LLMs to output a reasoning trace of continuous tokens provided considerable gains on logical reasoning tasks that require search during planning. @zhu2025reasoning popularized the notion of \`\`superposition" by showing theoretically that superposition in the latent state allows Transformer models to solve graph reachability tasks more efficiently. Follow-up work by @butt2025soft proposes a novel method to train continuous CoTs via reinforcement learning which achieves comparable performance to discrete CoT on known math reasoning benchmarks. We base our analysis on the \`\`Soft Thinking" approach by @zhang2025soft; a training-free method to generate latent CoTs based on convex combinations of embedding vectors. They report that their method offers a slight improvement on math benchmarks compared to discrete CoT baselines. Also close to our work, @deng2025latent claim to have devised a training scheme which enables superposition in LLMs.

Alternative continuous thinking schemes have also been explored. Many works investigate the use of \`\`filler" or \`\`thinking" tokens; blank tokens which can be used to store intermediary computations [@pfau2024let; @goyal2023think; @herel2024thinking]. Moreover, there is a growing interest in \`\`looped layers", a method which trains Transformers with recurrent attention layers [@yang2023looped; @mcleish2025teaching]. These methods also show promise in increasing reasoning abilities through additional latent computation.

#### Interpretability of Reasoning Models.

Understanding the internal representations of models has been a central research topic in NLP, even prior to LLMs [@adi2016fine; @linzen2016assessing; @gulordava-etal-2018-colorless; @belinkov-2022-probing *inter alia*]. Recently, several works investigated how CoT reasoning changes internal computations [@yang2024large; @dutta2024think; @cywinski2025towards]. To the best of our knowledge, no other works have attempted to understand the inner workings of latent CoT models from an interpretability perspective.`\looseness-1`{=latex}

# Background {#sec:method}

![Two latent CoT approaches. **Left:** Coconut feeds the last hidden state directly back as the next input embedding, forming a recurrent loop in continuous space. **Right:** Soft Thinking computes a probability distribution over the vocabulary and forms the next input as a weighted sum of token embeddings. Both methods replace discrete reasoning tokens with continuous representations, but differ in how these representations are constructed.](figures/latent_cot_overview.png){#fig:latent-cot width="0.95\\linewidth"}

Let $\mathcal{M}$ be a transformer language model with $L$ layers and vocabulary $\mathcal{V}$. We denote the *embedding matrix* by $\Eemb \in \mathbb{R}^{|\mathcal{V}| \times d}$, which maps discrete tokens to $d$-dimensional vectors, and the *unembedding matrix* by $\Eunemb \in \mathbb{R}^{|\mathcal{V}| \times d}$, which projects hidden states back to the vocabulary space. For a token $k \in \mathcal{V}$, we write $\emb_k = \Eemb[k]$ for its embedding. Given an input sequence $\seq{x} = (x_1, \ldots, x_n)$, at each position $i$, $\mathcal{M}$ computes a hidden representation $\hidden_i^{(\ell)} \in \mathbb{R}^d$ at each layer $\ell \in \{1, \ldots, L\}$. We start by defining the notion of superposition which is crucial to our analysis:

::: definition
**Definition 1** (Superposition). A model reasons in *superposition* at position $i$ if its hidden state $\hidden_i^{(\ell)}$ encodes in some basis of computation a distribution over multiple candidate continuations.
:::

We will consider two ways of obtaining superposition throughout this paper *forced* superposition and *learned* superposition. We define both below:

::: definition
**Definition 2** (Forced superposition). Forced superposition is superposition introduced at the input by explicitly constructing embeddings as convex combinations of token embeddings: $\ctemb = \sum_k \alpha_k \, \emb_k$ with $\alpha \in \Delta^{|\mathcal{V}|-1}$.
:::

::: definition
**Definition 3** (Learned superposition). Learned superposition is superposition that emerges from training a model to use latent thoughts on a task which rewards parallel exploration of reasoning paths.
:::

#### CoT Generation.

We consider a setting where, given an input query, a model generates a CoT followed by a final answer. A sequence consists of *input tokens* $\seq{x} = (x_1, \ldots, x_n)$, *reasoning tokens* $\seq{r} = (r_1, \ldots, r_T)$, and *answer tokens* $\seq{y} = (y_1, \ldots, y_m)$. Generation proceeds autoregressively: at step $t$, the model computes $\prob(r_t \mid \seq{x}, r_1, \ldots, r_{t-1})$ and selects a token according to some decoding strategy (we focus on greedy decoding). In *discrete* CoT, each $r_t \in \mathcal{V}$ is a vocabulary token with embedding $\emb_{r_t}$ fed as input to the next step. The key difference with *latent* CoT is which point in embedding space is used: discrete CoT is constrained to the vocabulary manifold $\{\emb_k : k \in \mathcal{V}\}$, while latent CoT can use any point in embedding space.`\looseness-1`{=latex}

#### Soft Thinking.

Soft Thinking [@zhang2025soft] is a form of *forced* superposition which uses information from the logit distribution to craft the superposition. At each step $t$, instead of selecting a discrete token, the method computes a distribution over the vocabulary simplex $\prob_t \in \Delta^{|\mathcal{V}|-1}$ over the vocabulary and forms the embedding $$\ctemb_t = \sum_{k \in \mathcal{V}} \prob_{t,k} \, \emb_k,$$ which lies in the convex hull of vocabulary embeddings. According to [@zhang2025soft], this method \`\`naturally preserves a \`superposition' which retains the entire information in each step".

#### Coconut.

Coconut [@hao2024training] takes a different approach: instead of constructing soft embeddings from vocabulary distributions, it feeds the model's own last hidden representation back as the next input embedding, enabling recurrent \`\`reasoning in continuous latent space." The model is trained via a staged curriculum; progressively replacing discrete CoT tokens with continuous latent thoughts. On ProsQA, a synthetic graph-traversal QA task, the authors report that latent tokens encode a breadth-first search (BFS) over the graph, citing logit-lens probing that reveals intermediate entities at latent positions. In `\Cref{sec:coconut}`{=latex}, we revisit these claims and conduct an interpretability analysis on trained Coconut models.

#### `\logitlens`{=latex}.

To understand how soft thinking tokens are processed, we employ `\logitlens`{=latex} [@nostalgebraist2020logitlens], a technique for interpreting intermediate LLM computations. Normally, only the final layer representation $\hidden_i^{(L)}$ is projected to vocabulary space via $\prob^{(L)}_i = \mathrm{softmax}(\Eunemb \, \hidden_i^{(L)})$. `\logitlens `{=latex}applies this projection to intermediate representations, yielding $\prob^{(\ell)}_i = \mathrm{softmax}(\Eunemb \, \hidden_i^{(\ell)})$ at any layer $\ell$, revealing how predictions evolve across layers. For soft thinking, since soft tokens are linear combinations of embeddings, using logit lens is well motivated. For Coconut, since weights are shared across models, applying logit lens is justified; it effectively measures cosine similarity between latent thoughts and vocabulary tokens.

# Do Off-the-Shelf Models Reason in Superposition? {#sec:superposition-experiments}

If LLMs are indeed capable of leveraging forced superposition, their internal representations when processing soft thinking tokens should differ meaningfully from discrete tokens, maintaining uncertainty and showing higher entropy at intermediate layers. We test this hypothesis with two experiments: (1) a *side-by-side comparison* examining entropy profiles across layers when using latent CoT vs. discrete CoT, and (2) an *embedding-level intervention* measuring how changing a single token from soft thinking to discrete affects representations.

#### Experimental setup.

We use QwQ-32B (reasoning model) and Qwen2-1.5B (base; results in `\Cref{app:st-details}`{=latex}) [@bai2023qwen]. We perform our analysis on MATH500 [@lightman2023let], AIME2024 [@amc2025aime] and a 500 example subset of the test set from GSM8K [@cobbe2021trainingverifierssolvemath]. We apply `\logitlens `{=latex}at 5 strategic layers and focus on presenting QwQ-32B results on MATH500 in the main text. Results on Qwen2-1.5B and additional results concerning AIME2024 and GSM8K are reported in `\Cref{app:st-results}`{=latex}.

## Comparing entropy profiles in latent vs. discrete CoT {#subsec:entropy_profile}

<figure id="fig:logit_lens_analysis">
<figure id="fig:entropy_comparison">
<img src="figures/entropy_comparison_math500.png" />
<figcaption>Entropy comparison (MATH500, N=500)</figcaption>
</figure>
<figure id="fig:kl_layer">
<img src="figures/kl_by_layer_math500.png" />
<figcaption>KL divergence (MATH500, N=500)</figcaption>
</figure>
<figcaption> <strong>Superposition collapses early in the forward pass of QwQ-32B.</strong> <strong>(a)</strong> Shannon entropy shows identical patterns for Soft Thinking (orange) and discrete CoT (blue), both converging to near-zero entropy at the same rate. <strong>(b)</strong> KL divergence drops to <span class="math inline"> ∼ 10<sup>−4</sup></span> in middle layers, showing soft thinking tokens become functionally identical to discrete tokens within the first few layers. The uncertainty in soft thinking token embeddings does not propagate through the network. Grey line represents the KL with a uniform distribution over the vocabulary as baseline. </figcaption>
</figure>

We start by comparing the internal computations of a model using soft thinking to a discrete CoT baseline via Logit Lens. We prompt both models with each problem, have them generate their respective CoTs, and at every 50 decoding steps apply `\logitlens `{=latex}to compute the Shannon entropy of the distribution over $\mathcal{V}$ at selected layers. If entropy profiles differ significantly, this would be evidence towards the soft tokens meaningfully altering the internal computations.

`\Cref{fig:entropy_comparison}`{=latex} shows entropy averaged over all CoT steps and problems. The entropy across layers is nearly *identical* for both approaches, with the same pattern: high entropy in early-to-middle layers collapsing to near-zero at the final layer. This is inconsistent with superposition. If the model truly maintained multiple solutions in parallel, Soft Thinking should exhibit higher entropy throughout; the indistinguishable profiles instead suggest that soft thinking tokens are processed like discrete CoT tokens, collapsing to a single interpretation early in the forward pass.

However, this compares independently generated CoTs that may differ in content. To control for this, we next investigate changing only a *single* token from soft to discrete.

## Intervening with discrete tokens during soft thinking {#subsec:token_intervention}

We perform an intervention experiment to test more directly whether internal representations differ when processing soft vs. discrete tokens. At every 50 steps of a Soft Thinking generation, we run two independent forward passes: one using the usual soft thinking token $\ctemb_t$, and one replacing it with the discrete argmax embedding $\emb_{\mathrm{argmax}\,\prob_t}$. For both, we apply `\logitlens `{=latex}at selected layers and compute KL divergence and cosine similarity between the resulting hidden representations. Note that only the intervened token changes; previous tokens in the KV cache remain soft thinking tokens.

As can be seen in `\Cref{fig:kl_layer}`{=latex}, the KL divergence remains small (relative to the baseline) across thinking steps and layers, achieving at most values of $10^{-1}$. The KL is largest at the first and last layers. We hypothesize that this is due to embedding differences in the first layers and to minor logit differences in the final layer. `\Cref{tab:soft-thinking-summary}`{=latex} corroborates this: the average cosine similarity between argmax and soft tokens is consistently high. Finally `\Cref{fig:top_tokens_visualization}`{=latex} (in Appendix) shows that top predicted tokens are typically very similar. Moreover, token predictions with high entropy do not seem to encode \"hesitation\" between key entities, but rather show the model hesitating between prepositions or punctuation. Put together, these results suggest that even at the token level, soft thinking methods produce computations that do not significantly differ from the discrete baseline.

::: {#tab:soft-thinking-summary}
  **Metric**                       **Qwen2-1.5B**        **QwQ-32B**
  ------------------------------ ------------------- -------------------
  Cosine similarity               $0.998 \pm 0.013$   $0.996 \pm 0.025$
  Mixing weight entropy (nats)     $0.10 \pm 0.26$     $0.18 \pm 0.34$

  : Summary of Soft Thinking token intervention metrics. Cosine similarity is averaged across all layers, steps, and problems. Mixing entropy is the Shannon entropy of the convex combination weights $\prob_t$ used to form each soft token.
:::

# Do Trained Models Reason in Superposition? {#sec:coconut}

The previous section showed that off-the-shelf LLMs do not leverage superposition when given superposed inputs. A natural follow-up question is whether models *trained* for latent reasoning behave differently. In order to answer this question, we perform experiments using Coconut, widely regarded as one of the canonical frameworks for latent CoT training. We test both fine-tuned and from-scratch Coconut variants [@hao2024training].

## Fine-Tuned Models {#sec:coconut-finetuned}

We start by studying how fine-tuning pretrained language models to use latent thinking impacts their ability to reason in superposition. Here, we investigate if the model is able to *learn* to encode such a superposition into its latent thoughts.

#### Experimental setup.

Following the methodology of [@hao2024training], we evaluate GPT-2 (124M) on ProsQA, a synthetic graph-traversal QA task requiring multi-hop logical inference over defined relationships (e.g., *\"Every dax is a wug. Every dax is a zug. Every wug is a blicket. Rex is a dax. Is Rex a blicket or a gorple? Blicket.\"*). The CoT baseline is trained with standard CoT supervised fine-tuning. The Coconut model is trained with the staged curriculum of @hao2024training, which progressively replaces CoT steps with continuous latent tokens.

#### Latent tokens are unnecessary to performance. {#sec:coconut-nolatent}

We start by evaluating the trained Coconut model by feeding *only the question* (no latent tokens, no multi-pass recurrence) and greedily decoding the answer. As can be seen in `\Cref{tab:prosqa-accuracy}`{=latex}, the model achieves **96.6% accuracy** under this regime; this suggests that the latent thinking only accounts for a 3% contribution in the performance of the Coconut model. We hypothesize that this increase in performance is due to an *echo chamber* phenomenon. For cases where $P($target$)$ is lower at step 0, it is possible the \"latent thinking\" procedure increases the probability. Next, we use probing to investigate the structure of the latent thoughts to better understand what motivates this behavior.`\looseness-1`{=latex}

#### Probing reveals no step-by-step reasoning. {#sec:coconut-evolution}

We probe by projecting hidden states at each reasoning position through the LM head. To better understand how belief evolves, we track how the normalized entity distribution evolves across reasoning steps. At each step, graph entities are categorized into four groups: *correct next* (the right answer for the given step), *wrong neighbors* (nodes that are adjacent but do not lead to the correct path), *target* (the right answer), and *other* (entities that are part of the graph but not the correct answer nor reachable from the current node). For instance in the example above, the *correct next* would be \"wug\" the *wrong neighbor* would be \"zug\" and the target entity would be \"blicket\". For Coconut, `\Cref{fig:belief-evolution-finetuned}`{=latex} (left) shows the target entity dominating the distribution throughout the entire reasoning process. For CoT, (right) the correct-next entity dominates early steps and the target takes over only at the final step; the expected signature of step-by-step multi-hop reasoning. This behavior suggests that the model learned a *shortcut solution*: it first obtains the correct answer in a single forward pass, then copies it over through the latent tokens.

Notably, this synthetic task was explicitly designed by @hao2024training to require parallel exploration of multiple reasoning paths, making the failure to learn superposition particularly striking. Despite this favorable setting, the model converges to a shortcut solution, suggesting that superposition is not a naturally preferred strategy under standard fine-tuning. This casts doubt on whether scaling model size or applying similar training procedures on more complex benchmarks would, by itself, induce superposition-based reasoning. We observe similar behavior on ProntoQA [@saparov2022language]; see Appendix D for details.

![Step-aware entity belief for fine-tuned GPT-2 on 5-step ProsQA examples (normalized among all entities). **Coconut (left)**: target entity dominates from step 0 onward with no progression. **CoT (right)**: correct-next entity dominates early steps, target takes over at the final step, the expected pattern for genuine multi-hop reasoning.](figures/stepwise_finetuned_5step.png){#fig:belief-evolution-finetuned width="0.9\\linewidth"}

```{=latex}
\subcaption{Fine-tuned GPT-2, 12 layers.}
```
  **Condition**                 **Accuracy**
  ---------------------------- --------------
  CoT (discrete, 5-hop)            85.3%
  Coconut (6 latent tokens)        99.0%
  Coconut (no latent tokens)       96.6%

  : ProsQA accuracy under different reasoning conditions and model depths.

```{=latex}
\hfill
```
```{=latex}
\subcaption{Trained from scratch, 8 heads, 768-dim.}
```
  **Layers**    **w/   /   w/o latent**
  ------------ -------------------------
  2                   94.5 / 13.8
  4                   96.2 / 16.0
  8                   80.7 / 62.8
  12                  61.6 / 63.0

  : ProsQA accuracy under different reasoning conditions and model depths.

`\label{tab:prosqa-accuracy}`{=latex}

## From-Scratch Models {#sec:fromscratch}

![Step-aware entity belief for from-scratch 2-layer models on 4-step ProsQA examples. **From Scratch (left)**: correct-next and wrong-neighbor entities dominate intermediate steps, with belief evolving across the reasoning trace. **CoT $\to$ Coconut (right)**: similar pattern with slightly more concentrated belief on correct-next entities.](figures/stepwise_fromscratch_4step.png){#fig:belief-evolution-fromscratch width="0.9\\linewidth"}

Next, we investigate the use of superposition in models trained from scratch. We train a GPT-2 Style Transformer with 8 heads and 768-dim hidden size on a simplified variant of the ProsQA with a symbolic tokenizer (40 tokens) [@zhu2025reasoning]. We choose to evaluate a 2 layer model as it is consequent with the theoretical construction and results reported by @zhu2025reasoning. We compare three training regimes: i) from scratch with latent thinking (Coconut); ii) from scratch using cross entropy on the gold CoT *then* train with Coconut (CoT + Coconut); iii) pure discrete CoT training (CoT). Note that the variant of Coconut used here is different from that of [@hao2024training]: here at every step in the problem models are trained to use $i$ latent thoughts to predict the value of the $i$th node. Crucially, this methodology never trains the model on the entire gold CoT. As in Section `\ref{sec:coconut-finetuned}`{=latex}, we employ probing to understand how belief evolves through the model's latent CoT.

#### Superposition occurs in from-scratch models.

Figure `\ref{fig:belief-evolution-fromscratch}`{=latex} shows belief evolution across thinking steps for from-scratch Coconut variants. For both the Coconut and CoT+Coconut training methods, the models show evidence of leveraging superposition. The state is dominated with the correct next entity probability but still leaves significant probability to the other potential neighbors. This remains true even when the model is first trained to perform the task with a discrete CoT, suggesting that next-token pretraining is not the sole factor leading to superposition collapse.

#### Latent tokens are necessary to performance.

As can be seen in Table `\ref{tab:prosqa-accuracy}`{=latex}, models trained from scratch do indeed leverage their latent thoughts contrary to the fine-tuned case. Removing access to latent steps produces significant performance drops (94.5 $\to$ 13.8 in the worst case), thus corroborating the entity belief results.

# Exploring the Limitations of Latent Thinking {#sec:why}

The previous two sections provide evidence that superposition only occurs in very limited set of scenarios: only from-scratch models manage to leverage superposition in their reasoning process. In this section, we propose an explanation and provide empirical evidence for this discrepancy based on two complementary phenomena.

#### Models trained on next token prediction commit to a token in the last layers.

`\Cref{fig:attractor}`{=latex} shows the entropy across layers of a forced superposition where all tokens in the combination have uniform weight for Qwen2-1.5B. Uniform weights are chosen to avoid the confound of soft tokens computed from peaky logit distributions. Moreover, we compare the pretrained model to a model with weights reinitialized at random to isolate the confound of the learning process and the architecture. Across different numbers of tokens in the mixture, the trend is clear: entropy drops rapidly when reaching the final layers. This contrasts sharply with the random weights baseline: here, the entropy remains high throughout. Given that this phenomena only occurs in the pretrained model, this suggests that the pretraining is the driving factor to the final-layer entropy collapse. We also note that fine-tuning does not seem to be enough to fix this issue; the last-layer entropy collapse persists when fine-tuning GPT2 using the Coconut methodology (see Appendix `\ref{app:coconut-logitlens}`{=latex}).

![Entropy of synthetic uniform superpositions across layers (Qwen2-1.5B). $k$ tokens are sampled uniformly at random and combined with equal weights ($1/k$). **Left**: pretrained weights collapse entropy at the final layers regardless of $k$. **Right**: randomly initialized weights preserve entropy. 50 samples per condition; shaded regions show $\pm 1$ std.](figures/attractor_comparison_1x2.png){#fig:attractor width="0.80\\linewidth"}

#### Capacity matters even when training models from scratch.

Figure `\ref{fig:depth_stepwise_entity}`{=latex} shows belief evolution during reasoning steps for different model depths (2, 4, 8 and 12 layer respectively) on the ProsQA task. This ablation over depth reveals that, even when training latent reasoning models from scratch, models with too much capacity show signs of learning shortcuts with 8L and 12L models showing $P($target$)$ as the leading entity. Table `\ref{tab:prosqa-accuracy}`{=latex} corroborates this; as layer count increases so does accuracy without latents. Together these results show that superposition is brittle: unless models capacity is sufficiently constrained, superposition does not emerge as the de facto solution.

![Step-aware entity belief across model depths (4-step ProsQA examples, from-scratch Coconut). Step-aware entity belief across model depths (4-step ProsQA, from-scratch Coconut). Shallow models (2L, 4L) show belief evolving across latent steps, with the correct-next entity dominating intermediate positions. Deeper models (8L, 12L) show no such progression; the \`\`other" category dominates throughout and \`\`target" appears as of reasoning step 2 suggesting the model is not leveraging a superposition-based solution. ](figures/depth_spectrum_stepwise_4step.png){#fig:depth_stepwise_entity width="\\linewidth"}

# Discussion and Conclusion {#sec:conclusion}

In conclusion, our experiments provide consistent evidence against reasoning in superposition across two complementary settings. For Soft Thinking, off-the-shelf LLMs process superposed inputs nearly identically to discrete tokens: entropy profiles match, KL divergences approach zero, and cosine similarities exceed 0.99. For Coconut, a fine-tuned model achieves 96.6% accuracy without any latent tokens, and entity-level probing reveals no step-by-step reasoning during latent computation.

#### Why does superposition collapse in pretrained models?

As we argue above, this is not merely an inductive bias but a consequence of the pretraining objective: autoregressive training optimizes for discrete next-token prediction, creating representations that separate token identities. When presented with a superposed input, the model projects it onto the nearest discrete interpretation, precisely what the training objective rewards. Moreover, the Soft Thinking distribution $\prob_t$ is itself very peaky across steps (see `\Cref{fig:top_tokens_visualization}`{=latex} and the entropy heatmaps in `\Cref{app:st-details}`{=latex}), meaning the input is already near-discrete and there is little superposition to exploit in the first place.

#### Is token-level superposition desirable?

Beyond finding that models do not reason in superposition, we question whether *token-level* superposition is a desirable property in the first place. Many soft thinking tokens combine semantically unrelated items (see `\Cref{fig:top_tokens_visualization}`{=latex}): formatting characters, punctuation, or tokens with similar logits but unrelated meanings. A superposition of \`\`(" and \`\`{" represents syntactic uncertainty, not exploration of alternative reasoning paths. Meaningful parallel exploration likely requires superposition at a higher level of abstraction (over entire reasoning strategies, not individual tokens), which we consider an interesting avenue for future work.

#### Latent reasoning as flexibility.

Despite our negative findings for current methods, latent reasoning remains a promising direction. The advantage of continuous embeddings may not be superposition but *flexibility*: the ability to express intermediate computations that do not correspond to natural language tokens, avoiding the discretization bottleneck. Future work should investigate whether latent reasoning provides benefits through other mechanisms, such as smoother optimization landscapes or more expressive intermediate representations.

#### Limitations.

It would be valuable to run similar experiments on other latent reasoning approaches, such as the RL-trained continuous CoTs of @butt2025soft or the recurrent layer frameworks proposed by @giannou2023looped [@mcleish2025teaching]. Moreover, it could be interesting for future work to look into layer-wise behavior or extract circuits to derive a fine-grained understanding of the mechanisms learned during latent CoT reasoning Finally, we hope in future work to apply the findings from this paper to design novel latent thinking methods.

# Acknowledgments {#acknowledgments .unnumbered}

M. Rizvi-Martel's research is supported by NSERC (CGS-D Scholarship). G. Rabusseau's by NSERC and the CIFAR AI Chair program. We also acknowledge NVIDIA for providing computational resources.

```{=latex}
\bibliographystyle{colm2026_conference}
```
`\etocdepthtag`{=latex}.tocappendix `\appendix`{=latex} `\crefalias{section}{appendix}`{=latex} `\crefalias{subsection}{appendix}`{=latex} `\crefalias{subsubsection}{appendix}`{=latex}

```{=latex}
\pagebreak
```
# Appendix Contents {#appendix-contents .unnumbered}

```{=latex}
\etocsettagdepth{default}{none}
```
```{=latex}
\etocsettagdepth{appendix}{subsection}
```
```{=latex}
\etocsettocstyle{}{}
```
```{=latex}
\tableofcontents
```

------------------------------------------------------------------------

# Disclosure of LLM usage {#app:llm-disclosure}

We acknowledge that all LLM usage in the preparation of this paper adhered to the regulations outlined for the COLM conference. We used `Claude Opus 4.6` only to assist in the implementation, data visualization, and for shortening of text originally written by the authors.

# Soft Thinking: Experimental Details {#app:st-details}

## Models {#app:models}

We use three models spanning two model families:

-   **QwQ-32B**: A 32.5B-parameter reasoning model based on the Qwen2.5 architecture with 64 transformer layers and a hidden dimension of 5120. We use this as our primary model since it was trained for chain-of-thought reasoning.

-   **Qwen2-1.5B**: A 1.5B-parameter base language model with 28 transformer layers and a hidden dimension of 1536. We use this smaller model to test whether our findings generalize across model scales.

-   **DeepSeek-R1-Distill-Llama-70B**: A 70B-parameter reasoning model distilled from DeepSeek-R1 [@guo2025deepseek] into the Llama architecture, with 80 transformer layers and a hidden dimension of 8192. We include this model to verify that our findings extend beyond the Qwen family to a different architecture and scale.

The Qwen models use a shared tokenizer with a vocabulary of 151,643 tokens. DeepSeek-R1-Distill-Llama-70B uses the Llama tokenizer with a vocabulary of 128,256 tokens. All experiments use the models in `bfloat16` precision.

## Soft Thinking Configuration {#app:soft-thinking-config}

We use the following decoding hyperparameters for all `\logitlens `{=latex}experiments, consistent across both models:

::: {#tab:soft-thinking-hyperparams}
  **Parameter**                      **Symbol**       **Value**
  ------------------------------ ------------------- -----------
  Temperature                          $\tau$            0.6
  Top-$k$ (sampling)                     $k$             30
  Soft top-$k$ (embedding mix)    $k_\mathrm{soft}$      15
  Weighting scheme                                     Softmax
  Max new tokens                  $T_\mathrm{max}$      2048

  : Soft Thinking decoding hyperparameters.
:::

At each reasoning step $t$, the model computes logits over the vocabulary and selects the top-$k_\mathrm{soft}$ tokens. Their logits are normalized via softmax to obtain the mixing weights $\prob_t \in \Delta^{k_\mathrm{soft}-1}$, and the soft thinking embedding is formed as $\ctemb_t = \sum_{i=1}^{k_\mathrm{soft}} \prob_{t,i} \, \emb_{v_i}$ where $v_1, \ldots, v_{k_\mathrm{soft}}$ are the top-$k_\mathrm{soft}$ tokens by logit value.

For the benchmark evaluation runs (`\Cref{tab:math500,tab:aime2024}`{=latex}), we additionally vary the cold stop threshold $c \in \{0.0, 0.01, 0.1, 0.2\}$, which terminates soft thinking when the top token probability exceeds $1 - c$.

## `\logitlens `{=latex}Setup {#app:logit-lens-setup}

We apply `\logitlens `{=latex}at 5 evenly spaced probe layers: $\{0, \lfloor L/4 \rfloor, \lfloor L/2 \rfloor, \lfloor 3L/4 \rfloor, L-1\}$, where $L$ is the number of transformer layers. This corresponds to layers $\{0, 7, 14, 21, 27\}$ for Qwen2-1.5B, $\{0, 16, 32, 48, 63\}$ for QwQ-32B, and $\{0, 20, 40, 60, 79\}$ for DeepSeek-R1-Distill-Llama-70B.

#### Entropy profile comparison (`\Cref{subsec:entropy_profile}`{=latex}).

For each problem, we run two independent generations: one using Soft Thinking and one using standard discrete decoding (greedy argmax). Every 50 decoding steps, we apply `\logitlens `{=latex}at each probe layer and record the Shannon entropy of the resulting distribution over $V$. We also store the top-10 predicted tokens at each checkpoint for qualitative comparison.

#### Token-level intervention (`\Cref{subsec:token_intervention}`{=latex}).

During Soft Thinking generation, we intervene every 50 steps by performing two fresh forward passes over the full sequence: one using the soft thinking embedding and one replacing it with the argmax embedding. Crucially, only the current token's embedding differs; the KV cache from previous (soft) tokens is shared. At each probe layer, we compute:

-   KL divergence: $\KL(\prob_\mathrm{soft}^{(\ell)} \| \prob_\mathrm{argmax}^{(\ell)})$ between the `\logitlens `{=latex}distributions.

-   Cosine similarity: $\cos(\hidden_\mathrm{soft}^{(\ell)}, \hidden_\mathrm{argmax}^{(\ell)})$ between hidden representations.

-   Entropy difference: $H(\prob_\mathrm{soft}^{(\ell)}) - H(\prob_\mathrm{argmax}^{(\ell)})$.

-   Top-$k$ token overlap: $|S_\mathrm{soft}^{(\ell)} \cap S_\mathrm{argmax}^{(\ell)}| / k$ for $k = 10$.

## Evaluation Problems {#app:eval-problems}

We select 5 problems of varying difficulty from three standard math reasoning benchmarks. The same problems are used across all experiments and both models to enable direct comparison.

```{=latex}
\begin{tcolorbox}[colback=blue!3, colframe=blue!40, title={\textbf{GSM8K, Problem 42 (Easy)}}, fonttitle=\small]\small
Grandma Jones baked 5 apple pies for the fireman's luncheon. She cut each pie into 8 pieces and set the five pies out on the buffet table for the guests to serve themselves. At the end of the evening, after the guests had taken and eaten their pieces of pie, there were 14 pieces of pie remaining. How many pieces were taken by the guests?

\medskip
\textit{Answer:} 26
\end{tcolorbox}
```
```{=latex}
\begin{tcolorbox}[colback=blue!3, colframe=blue!40, title={\textbf{MATH500, Problem 10 (Medium)}}, fonttitle=\small]\small
What is the least positive integer multiple of 30 that can be written with only the digits 0 and 2?

\medskip
\textit{Answer:} 2220
\end{tcolorbox}
```
```{=latex}
\begin{tcolorbox}[colback=blue!3, colframe=blue!40, title={\textbf{MATH500, Problem 55 (Medium)}}, fonttitle=\small]\small
Suppose that $f$ is a polynomial such that $(x-1)\cdot f(x)=3x^4+x^3 - 25x^2 +38x -17.$ What is the degree of $f$?

\medskip
\textit{Answer:} 3
\end{tcolorbox}
```
```{=latex}
\begin{tcolorbox}[colback=orange!3, colframe=orange!50, title={\textbf{AIME 2024, Problem 3 (Hard)}}, fonttitle=\small]\small
Let $x,y$ and $z$ be positive real numbers that satisfy the following system of equations:
\[
\log_2\!\left(\frac{x}{yz}\right) = \frac{1}{2}, \quad
\log_2\!\left(\frac{y}{xz}\right) = \frac{1}{3}, \quad
\log_2\!\left(\frac{z}{xy}\right) = \frac{1}{4}.
\]
Then the value of $\left|\log_2(x^4y^3z^2)\right|$ is $\tfrac{m}{n}$ where $m$ and $n$ are relatively prime positive integers. Find $m+n$.

\medskip
\textit{Answer:} 33
\end{tcolorbox}
```
```{=latex}
\begin{tcolorbox}[colback=orange!3, colframe=orange!50, title={\textbf{AIME 2024, Problem 5 (Hard)}}, fonttitle=\small]\small
Alice chooses a set $A$ of positive integers. Then Bob lists all finite nonempty sets $B$ of positive integers with the property that the maximum element of $B$ belongs to $A$. Bob's list has 2024 sets. Find the sum of the elements of~$A$.

\medskip
\textit{Answer:} 55
\end{tcolorbox}
```
## Compute {#app:compute}

All experiments were run on a SLURM cluster using NVIDIA L40S GPUs (48 GB each). `\Cref{tab:compute}`{=latex} summarizes the resource allocation for each experiment.

::: {#tab:compute}
  **Experiment**                       **Model**        **GPUs**       **RAM**   **Wall Time**
  ------------------------------------ ------------ ----------------- --------- ---------------
  Comparative (`\logitlens`{=latex})   Qwen2-1.5B    $1 \times$ L40S    32 GB       $< 1$ h
  Comparative (`\logitlens`{=latex})   QwQ-32B       $4 \times$ L40S   128 GB       $< 2$ h
  Convergence                          Qwen2-1.5B    $1 \times$ L40S    32 GB       $< 1$ h
  Convergence                          QwQ-32B       $4 \times$ L40S   128 GB       $< 3$ h
  Benchmark (MATH500)                  QwQ-32B       $4 \times$ L40S   128 GB       varies
  Benchmark (AIME2024)                 QwQ-32B       $4 \times$ L40S   128 GB       varies
  Coconut (ProsQA)                     GPT-2         $4 \times$ L40S   128 GB     $\sim 3$ h

  : Compute resources per experiment.
:::

For QwQ-32B, we distribute the model across 4 GPUs using HuggingFace Accelerate's `device_map="balanced"` strategy with a per-GPU memory cap of 75% to avoid out-of-memory errors from uneven shard allocation.

# Soft Thinking: Additional Results {#app:st-results}

In this section, we present additional results for the soft thinking experiments. We present token level comparisons, full dataset logit lens visualizations (for models and metrics which are not shown in the main text) and additional benchmark results showing accuracy of models for different hyperparameter choices.

## Token-level comparisons {#app:logit-lens-1.5b}

In this section, we include visualizations of the top-3 tokens for both the soft-thinking and argmax decoding methods. We visualize the top-3 across 5 chosen problem instances (see `\Cref{app:eval-problems}`{=latex} for mode details) for both the step with highest and lowest KL divergence. Results show that superposition is often being performed on tokens which do not have a meaningful relationship to the problem at hand.

![Top 3 tokens at the output layer for time steps with largest (top) and smallest (bottom) KL divergence between soft and argmax representations in **QwQ-32B**. Each column represents a problem instance.](figures/token_comparison_extreme_kl_combined_32B.png){#fig:top_tokens_visualization width="\\linewidth"}

![Top 3 predicted tokens at the output layer for time steps with largest (top) and smallest (bottom) KL divergence between soft and argmax representations in **Qwen2-1.5B**. Each column represents a problem instance.](figures/token_comparison_extreme_kl_combined.png){#fig:top_tokens_1.5b width="\\linewidth"}

## Full-Dataset `\logitlens `{=latex}Results {#app:fulldataset-logitlens}

In this section we present visualizations of cosine similarity, KL divergence and entropy difference not shown in the main paper. This section also includes figures for Qwen2-1.5B and DeepSeek-R1-Distill-Llama-70B which are not included in the main text

![Dataset-averaged intervention metrics for **QwQ-32B on MATH500** (N=500). Left to right: KL divergence, cosine similarity, absolute entropy difference. Each cell is the mean over all problems at that (layer, relative position) bin.](figures/avg_heatmap_32B_math500.png){#fig:avg-heatmap-32b-math500 width="\\linewidth"}

![Dataset-averaged intervention metrics for **QwQ-32B on AIME 2024** (N=30).](figures/avg_heatmap_32B_aime2024.png){#fig:avg-heatmap-32b-aime width="\\linewidth"}

![Dataset-averaged intervention metrics for **Qwen2-1.5B on MATH500** (N=500).](figures/avg_heatmap_1.5B_math500.png){#fig:avg-heatmap-1.5b-math500 width="\\linewidth"}

![Dataset-averaged intervention metrics for **Qwen2-1.5B on GSM8K** (N=500).](figures/avg_heatmap_1.5B_gsm8k.png){#fig:avg-heatmap-1.5b-gsm8k width="\\linewidth"}

![Dataset-averaged intervention metrics for **Qwen2-1.5B on AIME 2024** (N=30).](figures/avg_heatmap_1.5B_aime2024.png){#fig:avg-heatmap-1.5b-aime width="\\linewidth"}

![Dataset-averaged intervention metrics for **DeepSeek-R1-Distill-Llama-70B on MATH500** (N=500).](figures/DS70B/avg_heatmap_DS70B_math500.png){#fig:avg-heatmap-ds70b-math500 width="\\linewidth"}

![Dataset-averaged intervention metrics for **DeepSeek-R1-Distill-Llama-70B on AIME 2024** (N=30).](figures/DS70B/avg_heatmap_DS70B_aime2024.png){#fig:avg-heatmap-ds70b-aime width="\\linewidth"}

<figure id="fig:fulldataset-32b-aime">
<figure>
<img src="figures/entropy_comparison_aime2024.png" />
<figcaption>Entropy comparison</figcaption>
</figure>
<figure>
<img src="figures/kl_by_layer_aime2024.png" />
<figcaption>KL divergence by layer</figcaption>
</figure>
<figcaption><strong>QwQ-32B on AIME2024 (N=30).</strong> Entropy and KL divergence by layer.</figcaption>
</figure>

<figure id="fig:fulldataset-1.5b-math500">
<figure>
<img src="figures/entropy_comparison1_5B_math500.png" />
<figcaption>Entropy comparison</figcaption>
</figure>
<figure>
<img src="figures/kl_by_layer1_5B_math500.png" />
<figcaption>KL divergence by layer</figcaption>
</figure>
<figcaption><strong>Qwen2-1.5B on MATH500 (N=500).</strong> Entropy and KL divergence by layer.</figcaption>
</figure>

<figure id="fig:fulldataset-1.5b-gsm8k">
<figure>
<img src="figures/entropy_comparison1_5B_gsm8k.png" />
<figcaption>Entropy comparison</figcaption>
</figure>
<figure>
<img src="figures/kl_by_layer1_5B_gsm8k.png" />
<figcaption>KL divergence by layer</figcaption>
</figure>
<figcaption><strong>Qwen2-1.5B on GSM8K (N=500).</strong> Entropy and KL divergence by layer.</figcaption>
</figure>

<figure id="fig:fulldataset-1.5b-aime">
<figure>
<img src="figures/entropy_comparison1_5B_aime2024.png" />
<figcaption>Entropy comparison</figcaption>
</figure>
<figure>
<img src="figures/kl_by_layer1_5B_aime2024.png" />
<figcaption>KL divergence by layer</figcaption>
</figure>
<figcaption><strong>Qwen2-1.5B on AIME2024 (N=30).</strong> Entropy and KL divergence by layer.</figcaption>
</figure>

<figure id="fig:fulldataset-ds70b-math500">
<figure>
<img src="figures/DS70B/entropy_comparison_DS70B_math500.png" />
<figcaption>Entropy comparison</figcaption>
</figure>
<figure>
<img src="figures/DS70B/kl_by_layer_DS70B_math500.png" />
<figcaption>KL divergence by layer</figcaption>
</figure>
<figcaption><strong>DeepSeek-R1-Distill-Llama-70B on MATH500 (N=500).</strong> Entropy and KL divergence by layer. Logit lens is applied at 5 evenly spaced layers <span class="math inline">{0, 20, 40, 60, 79}</span> of the 80-layer model.</figcaption>
</figure>

<figure id="fig:fulldataset-ds70b-aime">
<figure>
<img src="figures/DS70B/entropy_comparison_DS70B_aime2024.png" />
<figcaption>Entropy comparison</figcaption>
</figure>
<figure>
<img src="figures/DS70B/kl_by_layer_DS70B_aime2024.png" />
<figcaption>KL divergence by layer</figcaption>
</figure>
<figcaption><strong>DeepSeek-R1-Distill-Llama-70B on AIME2024 (N=30).</strong> Entropy and KL divergence by layer. Same 5-layer probe as ; KL remains near zero from layer 20 onward.</figcaption>
</figure>

![`\logitlens `{=latex}entropy at reasoning positions for fine-tuned GPT-2 (left, 12 layers) and a from-scratch 2-layer model (right). In the fine-tuned model, Coconut latent positions maintain uniformly low entropy across all layers, while CoT positions show high early-layer entropy that resolves by layers 8--9. In the from-scratch model, Coconut latent positions retain higher entropy than their CoT counterparts, consistent with the from-scratch model maintaining richer latent representations. Shaded regions denote $\pm 1$ standard deviation across examples.](figures/latent_collapse_comparison.png){#fig:latent-collapse-comparison width="\\linewidth"}

## Benchmark Results {#app:benchmark-results}

In this section we present a reproduction of accuracy results on MATH500 and AIME2024 for QwQ-32B. We find similar but not identical numbers to [@zhang2025soft]. We hypothesize this is due to numerical precision issues.

::: {#tab:math500}
  Run   Decoding    max_topk   Cold Stop    Accuracy (%)    Avg Tokens
  ----- ---------- ---------- ------------ -------------- ------------
  1     Discrete       --      0.0 (none)      97.08             4,326
  2     Discrete       --         0.1          97.08             4,326
  3     Discrete       --         0.2        **97.19**           4,307
  4     Softmax        10      0.0 (none)      96.47             4,222
  5     Softmax        10         0.1          96.84             4,056
  6     Softmax        15         0.01         96.84             4,044
  7     Softmax        15         0.1          96.80             4,003
  8     Uniform        3          0.1          93.97             5,388

  : MATH500 results with QwQ-32B (N=500). Discrete decoding uses standard greedy/sampling; Softmax and Uniform refer to the weighting scheme used to construct soft thinking embeddings.
:::

::: {#tab:aime2024}
  Run   Decoding    max_topk   Cold Stop    Accuracy (%)    Avg Tokens
  ----- ---------- ---------- ------------ -------------- ------------
  1     Discrete       --      0.0 (none)    **77.29**          13,445
  2     Discrete       --         0.1        **77.29**          13,445
  3     Discrete       --         0.2        **77.29**          13,445
  4     Softmax        15         0.01         75.83            12,445
  5     Softmax        15         0.1          76.25            11,818

  : AIME2024 results with QwQ-32B (N=30).
:::

# Coconut: Experimental Details {#app:coconut-details}

## ProsQA Task {#app:prosqa-task}

ProsQA [@hao2024training] is a synthetic graph-traversal QA task designed to evaluate multi-hop reasoning. Each example consists of a randomly generated directed graph over named entities, a starting node, and a target node reachable via a sequence of directed edges. The question provides the graph structure (as a list of edges) and asks for the entity reachable from a given starting node after a specified number of hops. The ground-truth chain-of-thought consists of the sequence of intermediate entities visited during the traversal.

We use the ProsQA dataset provided with the original Coconut codebase, which contains 17,886 training examples and 300 validation examples. Graph depths range from 3 to 6 hops, with the distribution concentrated at 4 and 5 hops.

## Training Setup {#app:coconut-training}

We train GPT-2 (124M parameters, 12 layers, hidden dimension 768) on ProsQA using the Coconut training procedure of @hao2024training. `\Cref{tab:coconut-hyperparams}`{=latex} summarizes the key hyperparameters.

::: {#tab:coconut-hyperparams}
  **Parameter**                                    **CoT baseline**   **Coconut**
  ----------------------------------------------- ------------------ --------------
  Base model                                         GPT-2 (124M)     GPT-2 (124M)
  Learning rate                                       $10^{-4}$        $10^{-4}$
  Optimizer                                             AdamW            AdamW
  Weight decay                                           0.01             0.01
  Batch size (per device)                                 16               16
  Gradient accumulation steps                             2                2
  Total epochs                                            50               50
  Epochs per stage                                        --               5
  Latent tokens per step ($c_\mathrm{thought}$)           --               1
  Max latent stage                                        --               6
  Precision                                              FP32             FP32

  : Coconut training hyperparameters on ProsQA.
:::

Training follows a staged curriculum. At stage $k$ (epochs $5k$ through $5(k+1)-1$), the first $k$ chain-of-thought steps are replaced by $k$ continuous latent tokens; the remaining steps are kept as discrete text. The model is trained to predict only the remaining CoT steps and the final answer (labels for question and latent positions are masked with $-100$). By stage 6 (epochs 30--34), all reasoning steps have been replaced by latent tokens, and the model must produce the answer using only continuous latent computation. The optimizer is reset at the start of each epoch (`reset_optimizer=True`).

Three special tokens are added to the vocabulary: `<|start-latent|>`, `<|latent|>`, and `<|end-latent|>`, all initialized from the embedding of the `<<` token. During the forward pass, each `<|latent|>` token's embedding is replaced by the last hidden state from the preceding forward pass, implementing the continuous thought recurrence. A custom collator left-pads batches to align latent token positions across examples, enabling KV cache reuse in the multi-pass forward.

We use `torchrun` with 4 GPUs; FSDP wraps the model but does not shard GPT-2's layers (effectively acting as DDP at this model scale). The best CoT checkpoint is at epoch 49 (85.3% validation accuracy); the best Coconut checkpoint is at epoch 50 (99% validation accuracy).

# Coconut: Additional Results {#app:coconut-results}

## Entity Distributions {#app:coconut-entity}

In this section, we present entity distribution probing results for different step counts or scenarios which are not presented in the main paper.

![Normalized entity probability mass at each reasoning step for 4-step ProsQA examples (fine-tuned Coconut). Each stacked area shows the fraction of probability assigned to four entity categories: correct next hop (blue), wrong graph neighbors (orange), target/final answer (red), and other entities (gray). In the CoT model, probability mass shifts progressively toward the correct next entity at each step, consistent with step-by-step chain traversal. In the Coconut model, the target entity dominates from the first reasoning step onward, indicating that the model commits to the final answer without tracking intermediate hops.](figures/stepwise_all_entity_norm_4step.png){#fig:stepwise-entity-norm-4step width="\\linewidth"}

![Step-aware entity belief across model depths for 3-step ProsQA examples (from-scratch Coconut). Each panel corresponds to a model trained from random initialization at a different depth (2L, 4L, 8L, 12L). Shallow models (2L, 4L) exhibit belief evolution across latent steps, with probability mass shifting toward the correct next entity at intermediate positions. Deeper models (8L, 12L) show no such progression; the target entity appears early without intermediate tracking. This complements `\Cref{fig:depth_stepwise_entity}`{=latex}, which shows the analogous pattern for 4-step examples.](figures/depth_spectrum_stepwise_3step.png){#fig:depth-spectrum-3step width="\\linewidth"}

<figure id="fig:stepwise-hard-examples">
<figure>
<img src="figures/stepwise_hard_examples_coconut.png" />
<figcaption>Coconut (latent positions)</figcaption>
</figure>
<figure>
<img src="figures/stepwise_hard_examples_cot.png" />
<figcaption>CoT (reasoning positions)</figcaption>
</figure>
<figcaption>Step-aware entity belief for the 3.4% of ProsQA test examples (17/500) that the Coconut model answers correctly with latent tokens but incorrectly without them. In the Coconut panel, the ``other” category dominates (<span class="math inline"> ∼ 70%</span>), with the target receiving only <span class="math inline"> ∼ 25 − 35%</span> of the probability mass; the model does not confidently commit to the final answer from the question embedding alone on these examples, unlike the general population (). In contrast, the CoT model still performs step-by-step chain traversal on the same examples. This suggests that the latent tokens may be load-bearing for a small subset of examples where the question embedding does not suffice.</figcaption>
</figure>

<figure id="fig:prontoqa-stepwise-entity">
<figure>
<img src="figures/prontoqa_stepwise_coconut.png" />
<figcaption>Coconut (latent positions)</figcaption>
</figure>
<figure>
<img src="figures/prontoqa_stepwise_cot.png" />
<figcaption>CoT (reasoning positions)</figcaption>
</figure>
<figcaption>Normalized entity probability mass at each reasoning step on ProntoQA (fine-tuned Coconut). In the Coconut model, the ``other” category dominates throughout (<span class="math inline"> ∼ 80%</span>), with correct next, wrong neighbors, and target each receiving minimal probability mass. In contrast, the CoT model tracks the correct next entity at each intermediate step before transitioning to the target at the final step.</figcaption>
</figure>

![Normalized probability of the correct label (\`\`True" or \`\`False") across latent positions for the Coconut model on ProntoQA. At the pre-thinking position the model assigns ${\sim}93\%$ of the probability mass to the correct label; this drops to ${\sim}60\%$ during latent reasoning (T1--T5) before recovering at the post-thinking position. The dip during latent positions is consistent with the model redistributing probability mass without performing productive intermediate computation. Probing for these labels is more consistent with the ProntoQA task as the model must return a binary \"True or False\" response, as opposed to returning the entity name as is the case in ProsQA.](figures/true_false_belief_all_examples.png){#fig:true-false-belief width="0.5\\linewidth"}

## Coconut Other Results {#app:coconut-logitlens}

In this section, we present additional experiments not presented in the paper. Mainly, we look at the entropy throughout layers for Coconut and CoT models on the ProsQA task, plot Coconut gradient norms throughout training and show entity belief plots for the ProntoQA [@saparov2022language] task.

![Mean $L_2$ gradient norms per parameter group during Coconut fine-tuning on ProsQA (GPT-2). Each panel shows one parameter group: word token embeddings (wte + lm_head), attention, MLP, LayerNorm, and positional embeddings (wpe). Alternating shaded bands denote the five-epoch training stages of the Coconut curriculum. Note that gradient magnitudes remain non-trivial across all groups and stages, indicating that the model parameters are being actively updated throughout training; however, as shown in `\Cref{sec:coconut}`{=latex}, this training does not yield latent representations that participate in multi-step reasoning.](figures/grad_norms_by_group.png){#fig:grad-norms-by-group width="\\linewidth"}

![`\logitlens `{=latex}entropy at reasoning positions for fine-tuned GPT-2 (left) and a from-scratch 2-layer model (right). In the fine-tuned model, Coconut latent positions maintain near-zero entropy across all layers, consistent with early commitment to a single token. In the from-scratch model, this pattern reverses: Coconut latent positions retain higher entropy than CoT, suggesting richer latent representations. Shaded regions denote $\pm 1$ standard deviation.](figures/latent_collapse_comparison.png){#fig:placeholder width="\\linewidth"}

[^1]: Corresponding author. Contact: [`\footnotesize `{=latex}`michael.rizvi-martel@mila.quebec`](mailto:michael.rizvi-martel@mila.quebec)