---
abstract: |
  Diffusion models have achieved success in high-fidelity data synthesis, yet their capacity for more complex, structured reasoning like text following tasks remains constrained. While advances in language models have leveraged strategies such as latent reasoning and recursion to enhance text understanding capabilities, extending these to multimodal text-to-image generation tasks is challenging due to the continuous and non-discrete nature of visual tokens. To tackle this problem, we draw inspiration from modular human cognition and propose a recursive, sparse mixture-of-experts framework integrated into conventional diffusion models. Our approach introduces a recursive component within joint attention layers that iteratively refines visual tokens over multiple latent steps while efficiently sharing parameters via sparse selection of neural modules. At each step, a gating network is devised to dynamically select specialized neural modules, conditioned on the current visual tokens, the diffusion timestep, and the conditioning information. Comprehensive evaluation on class-conditioned ImageNet image generation tasks and additional studies on the GenEval and DPG benchmark demonstrate the superiority of the proposed method in enhancing model image generation performance.
author:
- |
  `\normalsize `{=latex}Yuwei Sun^1^, Yuxuan Yao^2^, Hui Li^2^, Siyu Zhu^1,2^\
  `\normalsize`{=latex}^1^Shanghai Academy of AI for Science, ^2^Fudan University\
bibliography:
- references.bib
title: "The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents"
---

```{=latex}
\maketitle
```
# Introduction

Diffusion models have achieved remarkable success in generating high-fidelity, diverse outputs. However, while these models excel at data synthesis, their capacity for structured, multi-step reasoning for enhanced text following capability remains underexplored. This gap mirrors a long-standing challenge in the language domain, where recent advances in large language models (LLMs) such as Chain-of-Thought (CoT) and Test-time Latent Reasoning solve complex problems by either generating explicit intermediate reasoning steps or recursive internal state update [@wei2022chain; @wang2022rationale; @snell2024scaling]. Nevertheless, progress in recursive latent-step reasoning has been largely confined to language, with limited exploration in multimodal scenarios like text to image generation tasks.

One obstacle is that, different from language tokens, visual information is less discrete and cannot be efficiently reasoned over or simply computationally expensive and data-hungry within their continuous latent space. To tackle this problem, we opt to inspirations from the modular and adaptive nature of human multimodal cognition. We build on recent work in sparse mixture-of-experts (MoE) architectures [@madan2021fast; @zhu2025llada; @zhao2024dynamic] and propose a recursive modular framework tailored to diffusion models such as Diffusion Transformers (DiTs) and Stable Diffusion 3 (SD3) for better capturing complex patterns within their latent space. We start with simpler class-conditioned DiTs and extend the architecture to the more challenging SD3 model. We aim to demonstrate that recursive latent computation could also benefit vision tasks, enabling efficient generation quality enhancement.

Our approach is centered on a recursive component integrated into transformer attention layers. This component iteratively refines visual tokens over multiple latent steps using a set of shared neural modules. These modules are sparsely activated for each latent step and dynamically selected by a gating network, conditioned on the current vision tokens, the diffusion timestep, and the conditioning class label or the input text embeddings. Crucially, we incorporate two additional mechanisms for ensuring learnable expert routing and more efficient recursive training: (1) a Gumbel-Softmax strategy enabling winner-takes-all gradient allocation [@lei2024compete] for training the sparse selection component, encouraging the emergence of distinct neural modules, and (2) a tailored mixture-of-adapters mechanism with recursive adapter updates for accumulating the recursive module's output across multiple latent steps with lightweight parameters.

We evaluate model performance on both class-conditioned generation tasks with ImageNet [@deng2009imagenet], measured with FID, sFID, IS, precision, and recall, and text-to-image generation tasks based on the GenEval benchmark [@ghosh2023geneval]. The comprehensive empirical evaluation, with token trajectory visualization showing specialization of neural modules, demonstrates the benefits of the proposed recursive sparse reasoning architecture. Using the sparse activation of lightweight shared neural modules significantly alleviates computational cost while achieving superior model performance compared to a wide range of conventional methods.

Overall, our main contributions are three-fold:

1\) We propose a recursive sparse reasoning framework, specifically tailored for vision diffusion models such as DiTs and SD3. This framework aims to enhance conditional alignment in image generation tasks via iterative modular processing of input vision tokens, with a condition-guided routing strategy (Section `\ref{sec:module}`{=latex}).

2\) We target the joint attention component in layers specifically responsible for structural layout, where a set of neural modules are sparsely activated over several latent steps based on the current vision tokens, the diffusion timestep, and conditioning information such as complex text prompt embeddings (Section `\ref{sec:recursion}`{=latex}).

3\) Fine-tuning strategies tailored for coherent, recursive token refinement are proposed. These strategies enable efficient recursive model adaptation and progressive enhancement of text-visual alignment in the GenEval and DPG benchmarks (Section `\ref{sec:t2i}`{=latex}). Additionally, this framework shows potential for adaptation to visual navigation tasks such as the FrozenLake (Section `\ref{sec:agent}`{=latex}).

# Related work

This section provides a comprehensive summary of the most relevant recent work on diffusion transformers, sparse modular computation, and latent reasoning methods.

#### Diffusion Transformers

Diffusion models [@ho2020denoising; @song2020denoising] have become the foundation for modern text-to-image generation, enabling high-fidelity, diverse sample synthesis [@rombach2022high]. A key architectural shift was pioneered by the Diffusion Transformers (DiTs) [@peebles2023DIT], which demonstrated the scalability and effectiveness of Transformer backbones for diffusion, inspiring a series of follow-up works focused on improved conditioning mechanisms [@chen2023pixartalpha; @chen2024pixartsigma; @zhou2025goldennoisediffusionmodels]. The recent development of joint multimodal architectures within the diffusion process with models like Stable Diffusion 3 (SD3) [@esser2024sd3] exemplifies this as Multimodal Diffusion Transformers (MMDiTs). However, despite these advances, existing architectures process conditioning information in a largely monolithic manner, leaving room for more structured, modular processing within the denoising process.

#### Modularity and sparse transformers

Different from the monolithic design of conventional MMDiTs, cross-modal cognition in humans typically relies on sparse activations of modular neural networks, where diverse specialized modules compete via a communication bottleneck, a principle formalized as the Global Workspace Theory (GWT) [@goyal2021coordination; @blum2021theoretical; @butlin2023consciousness]. Inducing sparsity in Transformer-based architectures via modulation has led to the emergence of distinct functions in neural modules through winner-takes-all gradient allocation [@madan2021fast; @lei2024compete; @sun2025associative], enhancing both performance and parameter efficiency. For example, LLaDA-MoE [@zhu2025llada] integrates a mixture-of-experts architecture into the training objective of masked diffusion models. Dynamic Diffusion Transformer [@zhao2024dynamic; @{zhao2025dydit++}] showed emergent resource allocation across neural modules, with some modules employed more frequently in the earlier layers while less frequently in the later layers. Inspired by these findings and aiming to enable better capacity in recursive computation, we explore recursive latent reasoning with sparsely activated modules to enhance alignment in diffusion models through multi-step refinement.

#### Latent reasoning

Depth has been shown valuable in reasoning tasks that require multi-step compositional thinking, beyond the advantages of efficient parameter count via shared parameters. Latent reasoning involves iterative computation within the latent feature space, which has been investigated in language models. For instance, Pondering Language Models [@banino2021pondernet] perform iterative ponder cycles inside every token prediction, where weighted hidden states are fed back via a residual path for refinement. Tiny Networks [@jolicoeur2025less] and Hierarchical Reasoning Model (HRM) [@wang2025hierarchical] employ a hierarchical reasoning framework to recursively refine latent states for tackling various text-based puzzle tasks. Test-time Latent Reasoning [@geiping2025scaling] using a random number of latent steps demonstrates advantages in model depth reduction and enhanced model performance.

Furthermore, the latent reasoning approach has also recently been extended to multimodal models beyond language tasks. For instance, Chain of Continuous Vision-Language Thought (CoCoVa) [@ma2025cocova] iteratively refines a chain of latent states through gated cross-modal fusion. Additionally, Multimodal Chain of Continuous Thought (MCOUT) [@pham2025multimodal] enables reasoning in a joint latent space, with reasoning states iteratively refined and aligned with visual and textual features. Despite advances in multimodal models, the use of latent reasoning for enhanced vision--text alignment in conditioned image generation tasks of diffusion models remains largely unexplored.

# Method

![Architecture of the proposed recursive sparse reasoning mechanism integrated with SD3. A recursive modular component is introduced into the joint attention of vision and text branches. Visual token probabilities are computed across samples, and tokens are reassembled into original spatial order after passing through selected neural modules. At each latent step, outputs of the neural modules correlate with the text branch to progressively refine the representation using text tokens. This results in better-aligned visual generation with respect to the input prompt.](figures/scheme.png){#fig:scheme width="\\linewidth"}

This section discusses the key technical underpinnings of the Recursive Sparse Reasoning mechanism, focusing on the modular components using a mixture-of-adapters and recursive sparse joint attention tailored to diffusion models (Figure `\ref{fig:scheme}`{=latex}). Introducing such architecture in Multimodal Diffusion Transformer (MMDiT) blocks, specifically those responsible for structural layout, can help resolve prompt ambiguities and enhance semantic alignment in image generation results. Notably, we induce the recursive modular processing via a set of neural modules sparsely activated over several latent steps, using the current vision tokens, the diffusion timestep, and condition information. To be more concrete, we focus on MMDiT model SD3 in this section, where the model progressively refines the alignment between textual and visual features via joint attention. In addition, we testify on Diffusion Transformers (DiTs) and leave the detailed configurations to the experiment section.

## Multimodal Diffusion Transformers

Multimodal Diffusion Transformers (MMDiTs) process the vision and text tokens through two individual flows and employ joint attentions to unify the connections among the different modalities. Within each block, self-attention is first utilized intra-modally to refine the representations within each modality independently. A joint attention mechanism then models cross-modal dependencies, typically by projecting the concatenated sequence of vision and text tokens through a unified attention operation. This allows every token, regardless of modality, to attend to all others, effectively enabling direct interactions between visual and textual elements.

In particular, let $x$ be the vision branch input, $c$ be the text branch input, and $y$ be embeddings derived from the diffusion timestep and sinusoidal encoding of text prompts. Each block consists of adaptive normalization, joint attention, and MLP layers. We target the joint attention component in MMDiT blocks, where the normalized tokens are concatenated as $h = [\tilde{x};\tilde{c}]$ and used as the input to a multi-head attention mechanism Attn($\cdot$) as follows: $$\begin{aligned}
Q = h W_Q, \quad K = h W_K, \quad V = h W_V,\\
\mathrm{Attn}(h) = \mathrm{Softmax}\Big(\frac{Q K^\top}{\sqrt{D}}\Big) V,
\end{aligned}$$ where $W \in \mathbb{R}^{D \times D}$ for $W \in \{W_Q, W_K, W_V\}$ and $D$ is the dimension of the weight matrices. After the attention, the outputs are split back into the image and text streams, i.e., $[a_x;a_c] = \mathrm{Attn}(h)$. Then, the attention outputs are added to the original tokens from each modality as residuals, with learnable gating parameters $\gamma_x$ and $\gamma_c$ as follows: $$x' = x + \gamma_x \cdot A_x, \quad c' = c + \gamma_c \cdot A_c.$$

Finally, the residual outputs are passed through LayerNorm (LN) and an MLP with residual gating. Consequently, a complete single MMDiT block can be summarized as follows: $$h = [\tilde{x}, \tilde{c}], \quad
h \leftarrow h + \gamma \cdot \mathrm{Attn}(\alpha \cdot \mathrm{LN}(h) + \beta), \quad
h \leftarrow h + \zeta \cdot \mathrm{MLP}(\delta \cdot \mathrm{LN}(h) + \epsilon),$$ where $\alpha, \beta, \gamma, \delta, \epsilon, \zeta$ are modulation parameters separate for each modality $x$ and $c$.

## Neural modules with mixture-of-adapters {#sec:module}

We introduce multiple joint attention adapters that dynamically process visual tokens and then remap them back to their original spatial order. These visual tokens are recursively updated across latent steps, with each step passing them through distinct neural modules to progressively refine the representation. Note that only the visual tokens are routed through these modules, while the complete text tokens are retained exclusively for joint attention computations. This design creates a competitive bottleneck, forcing the neural modules to develop specialized processing functions as they compete to select and integrate the most relevant vision tokens under the guidance of text embeddings for enhancing cross-model alignment at each step.

To induce a recursive component in diffusion models, a naive choice is a monolithic model that takes its own output back to the input via layer repetition. Nevertheless, this design could be inefficient for recursive training, since the shared component goes through multiple iterations and a monolithic model could significantly increase the computational cost. Thus, we opt for a more efficient mixture-of-adapter architecture with a set of lightweight neural modules sparsely activated for each iteration. In particular, the recursive component $\theta_{\text{recursive}}$ consists of a total of $M$ neural modules $\{\theta_{\text{adapter}}^m\}_{m=1}^M$ and a gating network $\theta_\text{gate}$. To determine which module to activate, the gating network produces selecting probabilities for $M$ modules. To increase the capacity of neural modules for learning distinct features, we compute the probability for each visual token $v$ independently across samples. Then, after each latent step, we reassemble the tokens into their original spatial order within each image. This allows various modules to access tokens across samples, improving learning efficiency, while the reassembly process preserves the original spatial arrangement after each latent recursion step. Furthermore, the routing probability is computed based on (1) the current latent step's vision token $x^{t_\text{latent}}$, (2) the diffusion timestep, and (3) the conditioning text prompt embeddings. Note that, in MMDiTs, the text embeddings and the diffusion timestep is usually combined and represented as $y$. Finally, for each step, only one module is selected and activated, given by: $$\theta_{\text{adapter}}^{m^*} : \arg\max_m\, P_m^{t_\text{latent}}(\theta_\text{gate}(x^{t_\text{latent}}, y)),$$ where $P_m^{t_\text{latent}}$ denotes the selection probability for module $m$. In practice, we employ the **Gumbel-Softmax** to enable gradients back-propagated through the recursive component, for the hard selection of a single module. Thus, we compute logits $\hat{z}^{t_\text{latent}}_m$ for the module selection as follows: $$\begin{aligned}
z_m^{t_\text{latent}} = P_m^{t_\text{latent}} + \epsilon^{t_\text{latent}}&,\, \hat{z}^{t_\text{latent}}_m = \text{softmax}\left( z^{t_\text{latent}}_m / \tau \right),\\
\theta_{\text{adapter}}^{t_\text{latent}} :& \arg\max_m\, \hat{z}^{t_\text{latent}}_m,
\end{aligned}$$ where $\epsilon^{t_\text{latent}} \sim \text{Gumbel}(0,1)$ adds noise for exploration during training and the temperature $\tau$ controls the sparsity of selection.

## Recursive sparse joint attention {#sec:recursion}

In an MMDiT block, we target the joint attention component where the two modalities exchange information, as the alignment between vision and text modalities is built at this stage. We employ an adaptation scheme that recursively implements Low-Rank Adaptation (LoRA) [@hu2021lora] across $R$ latent steps to construct lightweight neural modules for efficient adaptation. In particular, for each joint attention component, trainable low-rank updates are applied to the query, key, and value projection matrices $W^Q_{\text{vision}}, W^K_{\text{vision}}, W^V_{\text{vision}}$ of the vision branch. For a weight matrix $W \in \{W^Q_{\text{vision}}, W^K_{\text{vision}}, W^V_{\text{vision}}\} \subset \mathbb{R}^{D \times D}$, the update $\Delta W$ is computed as $\Delta W = BA$, where $A \in \mathbb{R}^{r \times D}$, $B \in \mathbb{R}^{D \times r}$, and the rank $r \ll D$. Thus, for an input vision token $x^{t_{\text{latent}}}$, the output becomes $\hat{W} x^{t_{\text{latent}}} + \Delta W x^{t_{\text{latent}}} = \hat{W} x^{t_{\text{latent}}} + BA x^{t_{\text{latent}}}$, where $\hat{\cdot}$ denotes frozen parameters.

In initial experiments, we found that processing the input through the frozen base model repeatedly at each step corrupted the generation performance, as repeated exposure to the base model's fixed representations introduced compounding artifacts. To make this effective in image generation tasks, carefully designed fine-tuning is necessary. To this end, we use the low-rank adapter outputs for recursive computation throughout the intermediate latent steps, and the frozen base model output is integrated only at the final step. This design prevents the recursive computation results from diverging significantly from the original distribution space, enabling progressive refinement of the adaptation without degradation.

At each latent step $t_\text{latent} \in \{1, \dots, T_\text{latent}\}$, a specific set of LoRA parameters $\theta_{\text{adapter}}^{t_\text{latent}}: \{B^{t_\text{latent}}, A^{t_\text{latent}}\}$ is selected based on the predicted routing probability $P_m^{t_\text{latent}}$. Notably, after each latent step, we concatenate the vision update and text tokens from the text branch for the joint attention, then the computed result is separated into two branches. We then use the result of the vision branch as the input for the next latent step while remaining the same text tokens input.

Then, we recursively compute adaptation results at various latent steps $t_\text{latent}$: $$\begin{aligned}
\tilde{a}_{x^{0}} &= \tilde{x}, \\
 [a_{x^{t_\text{latent}}};a_{c^{t_\text{latent}}}] &= \text{Attn}(B^{t_\text{latent}} A^{t_\text{latent}}\, (\alpha_x\cdot \text{LN}(\tilde{a}_{x^{t_\text{latent}-1}})+\beta_x);\tilde{c}),\\
\tilde{a}_{x^{t_\text{latent}}} &= a_{x^{t_\text{latent}}} + \tilde{x},\, t_\text{latent} = 1, \dots, T_\text{latent}-1.
\end{aligned}$$ Finally, the recursive adaptation after $T_\text{latent}$ steps incorporates the output of the frozen base model as follows, $\text{Attn}(B^{t_\text{latent}} A^{t_\text{latent}}\, (\alpha_x\cdot \text{LN}(\tilde{a}_{x^{T_\text{latent}-1}})+\beta_x) +\hat{W}\tilde{x};\tilde{c}) = [a_{x^{T_\text{latent}}};a_{c^{T_\text{latent}}}]$, where $a_{x^{T_\text{latent}}}$ and $a_{c^{T_\text{latent}}}$ are the outputs of the recursive component for the vision and text branches. The complete algorithm for the proposed method is demonstrated in Appendix `\ref{appe:algo}`{=latex}.

# Experiments

In this section, we present the settings and extensive experiment results for class-conditioned generation, text-to-image generation, and visual navigation tasks. We evaluate the proposed method on class-conditional tasks using Diffusion Transformers (DiTs) with latent recursion trajectories analysis. We then extend our evaluation to more complex text-following tasks with Stable Diffusion 3 (SD3) models. Performance is assessed on the benchmarks of GenEval and DPG. Additionally, visual navigation in the FrozenLake environment is investigated, using the generated frames from various latent steps, to measure its extendability to action planning.

## Settings

#### Datasets

We consider image generation tasks for both class-conditioned and text-to-image generation. For class-conditioned generation, we employ the ImageNet [@deng2009imagenet] dataset at a resolution of 1024 × 1024. For the text-to-image (T2I) generation task, we finetune models with 1000 samples from the MSCOCO dataset [@lin2015microsoftcococommonobjects]. We employ the GenEval benchmark [@ghosh2023geneval] for evaluation, where various attributes are measured, including single object, two objects, counting, colors, color attributes, and position, with the DPG benchmark [@hu2024ella] as additional results. For the visual navigation task, we opt for the Frozen Lake environment [@wu2024vsp], where an agent learns to predict future actions to reach a goal without falling into holes in the environment.

#### Model variants and hyperparameters

We employ SD3-medium with a total of 24 layers. Besides SD3, we adapt the method for DiTs by applying the recursive component to the self-attention layer and using the class labels and diffusion timestep for predicting the module selection probabilities. In particular, we employ a DiT-XL model with a total of 24 layers. For the Gumbel Softmax, we set the temperature parameter $\tau$ to 5.0. Additionally, for more than three experts, we employ the expert balance loss [@sun2025associative] to encourage diversity in expert usage during recursion computation. For more detailed settings, please refer to Appendix `\ref{appe:hype}`{=latex}.

```{=latex}
\begin{table*}[t]\small
\setlength{\tabcolsep}{8pt}

\caption{Performance comparison for generation on the ImageNet dataset at $256 \times 256$ resolution.}
\label{tab:classcondition}
\renewcommand{\arraystretch}{1.2}
\begin{tabular}{lcccccc}
\toprule
Model & Size(M)$\downarrow$ & FID$\downarrow$ & sFID$\downarrow$ & IS$\uparrow$ & Precision$\uparrow$ & Recall$\uparrow$ \\
\midrule
Ours   &  675   & \textbf{2.27}  & 4.69& \textbf{275.64} & 0.82 & \textbf{0.62} \\
DiT-XL/2 \cite{peebles2023DIT} &  675  & 2.34   & 4.73   & 275.56 & 0.83 & 0.57\\
\hline
Other methods \\
\hline
ADM \cite{dhariwal2021diffusion}             & 608  & 4.59 & 5.25 & 186.87 & 0.82 & 0.52 \\
LDM-4 \cite{rombach2022high}         & 400 & 3.95 & --   & 178.22 & 0.81 & 0.55 \\
U-ViT-L/2 \cite{bao2023all}       & 287 & 3.40 & 6.63 & 219.94 & 0.83 & 0.52 \\
U-ViT-H/2 \cite{bao2023all}       & 501 & 2.29 & 5.68 & 263.88 & 0.82 & 0.57 \\
DiffuSSM-XL \cite{yan2024diffusion}     & 673   & 2.28 & \textbf{4.49} & 259.13 & \textbf{0.86} & 0.56 \\
\bottomrule
\end{tabular}
\end{table*}
```
![Image generation results of DiT-XL and the proposed method based on the recursive sparse reasoning mechanism. Using the recursive method provides richer object textures and finer background details in the generated images.](figures/dit_result.png){#fig:dit width="0.9\\linewidth"}

![Latent trajectories of vision tokens during five-step recursion (12th layer). Visualized via PCA compressing all tokens from a single image. Trajectory changes across diffusion timesteps: at early steps (high noise), trajectories are largely coherent, indicating uniform processing; at later steps (low noise), they exhibit divergent, patch-specific pathways, reflecting the neural modules' context-adaptive specialization.](figures/traj.png){#fig:trajectories width="1\\linewidth"}

![Adaptive neural module activation conditioned on the current vision tokens, the diffusion timestep, and the class label. Ablating the conditioning to only the current vision tokens results in modules that struggle to diversify across latent steps. Therefore, the optimal choice incorporates both the current states and the additional conditional information.](figures/routing.png){#fig:routing_stats width="0.85\\linewidth"}

## Class-conditioned image generation {#sec:class}

For evaluation in class-conditioned generation tasks, following [@teng2024dim], we sample 50,000 images to measure the Fréchet Inception Distance (FID) [@heusel2017gans], Inception Score (IS) [@salimans2016improved], sFID [@nash2021generating], precision, and recall. We compare model performance against the SD3-medium model as well as conventional generation models in Table `\ref{tab:classcondition}`{=latex}. The results demonstrate that the proposed recursive sparse reasoning framework achieves competitive performance with the baseline DiT-XL/2 model. DiffuSSM-XL achieved better sFID and precision scores. We demonstrate the generated samples in Figure `\ref{fig:dit}`{=latex}.

#### Expert Allocation and Latent Trajectory Analysis

To understand the evolvement of latent vision token states during different recursion steps, we performed a more detailed analysis of the routing behavior during inference. As shown in Figure `\ref{fig:trajectories}`{=latex}, the latent trajectories appear unified, with similar behavior early in the diffusion process but diverge in later diffusion time steps, indicating increased neural module specialization. This reveals how architectural specialization emerges dynamically throughout the generative process. Furthermore, the dynamic routing results illustrated in Figure `\ref{fig:routing_stats}`{=latex} demonstrate how neural module activation shifts across recursive latent time steps, reflecting a progressive refinement of visual tokens. Additionally, using the diffusion timestep and class label, and the current vision tokens enables effective and diverse module activations.

![Qualitative comparison of image generations from our recursive approach versus the SD3-medium baseline at 1024 × 1024 resolution. Our method achieves higher text alignment and fidelity, whereas SD3 fails to accurately reflect fine-grained prompt details.](figures/geneval_result.png){#fig:sd3-compare width="1\\linewidth"}

```{=latex}
\small
```
```{=latex}
\renewcommand{\arraystretch}{1.3}
```
```{=latex}
\setlength{\tabcolsep}{2.5pt}
```
::: {#tab:dpg}
  **Model**                           **Overall**   **Single object**   **Two object**   **Counting**   **Colors**   **Color attr**   **Position**
  ---------------------------------- ------------- ------------------- ---------------- -------------- ------------ ---------------- --------------
  SD3-medium [@esser2024sd3]             67.93         **100.00**           84.60           60.31         86.17          55.00           21.50
  **Target on the middle layer**
  SD3-medium + SFT                       68.33         **100.00**           83.59           61.56         84.57          53.50           26.75
  Ours (w/o recursion)                   69.88            99.38             84.85           59.06         86.97          55.00         **34.00**
  Ours (w/o modulation)                  70.36            99.38             88.13           59.38       **88.30**        56.00           31.00
  Ours $(M=2,\,T_\text{latent}=2)$       70.66         **100.00**           86.35           60.31         85.04        **61.75**         30.50
  **Target on multiple layers**
  SD3-medium + SFT                       69.55            99.38             87.12           59.69         85.64          55.25           30.25
  Ours $(M=2,\,T_\text{latent}=2)$     **71.18**       **100.00**         **89.65**       **64.38**       84.31          61.00           21.75

  : Evaluation results on the DPG benchmark.
:::

```{=latex}
\small
```
```{=latex}
\renewcommand{\arraystretch}{1.3}
```
```{=latex}
\setlength{\tabcolsep}{6pt}
```
::: {#tab:dpg}
  **Model**                           **Overall**   **Entity**   **Attribute**   **Relation**   **Global**   **Other**
  ---------------------------------- ------------- ------------ --------------- -------------- ------------ -----------
  SD3-medium [@esser2024sd3]             85.65        90.24          87.39          92.15         85.71        80.40
  SD3-medium + SFT                       85.72        89.90          88.14        **92.80**       85.51        80.72
  Ours $(M=2,\,T_\text{latent}=2)$       85.31        89.97          87.63          91.61         87.23        84.40
  Ours $(M=5,\,T_\text{latent}=5)$     **85.88**    **90.39**      **88.34**        91.53       **88.15**    **82.80**

  : Evaluation results on the DPG benchmark.
:::

## Text-to-image generation tasks {#sec:t2i}

Computational depth induced by the recursion component is expected to be valuable, especially for correlating complex information such as text prompts, which requires the understanding of compositional relationships such as position and color. To evaluate the recursion model's effectiveness in aligning vision tokens with more complex text prompts, e.g., in text-to-image tasks, we implement the proposed recursive sparse mechanism in a multimodal diffusion transformer. Our method can be built by replacing a single layer or multiple layers and fine-tuning on pretrained SD3 weights.

The GenEval benchmark [@ghosh2023geneval] is utilized to evaluate the model's text-following ability for the text-to-image generation task. Furthermore, to validate the design choices of recursive modulation, we examine the impact by varying the number of neural modules and latent steps. When there is only a single neural module and it takes a single latent step, this framework becomes a supervised fine-tuning (SFT) method based on the low-rank adaptation (LoRA).

Table `\ref{tab:geneval}`{=latex} demonstrates the ablation of the recursion and modulation components, respectively, with varying combinations of these two elements. The results show that both recursion and modulation contribute to the improvement of model performance. Targeting multiple layers of SD3 further enhances its benefits and achieves the best overall scores on the benchmark. Additionally, we present the generated images at a resolution of 1024 × 1024 with specific text prompts in Figure `\ref{fig:sd3-compare}`{=latex}, showing that the recursive approach maintains high sample fidelity and enhanced text alignment across diverse prompt inputs. By the contrast, SD3's generation results sometimes cannot reflect details described in the prompts correctly.

Furthermore, we evaluate on the more complex DPG benchmark with the trained model targeting the multiple layers. The settings are the same as in the GenEval evaluation. Table `\ref{tab:dpg}`{=latex} presents the comparison results. Our method demonstrates enhanced semantic alignment of generated images, achieving superior performance across the various categories and the best overall score.

## Visual navigation tasks {#sec:agent}

To further demonstrate the generality of our recursive sparse reasoning framework, we extend its application to a simplified visual navigation task in the Frozen Lake environment [@wu2024vsp]. The environment simulates a grid-based frozen lake, where the agent starts from the designated position and navigate its way to the destination. To generate a sequence of navigation frames, we decode each latent state in the recursion process by training a Transformer-based decoder network. Moreover, the agent's specific actions are used to select neural modules and train the gating network, whereas during inference, the agent selects neural modules over multiple latent steps with the trained gating network. The aim is to generate a sequence of intermediate frames that enable the agent to reach the destination given only the start frame, with planning performed entirely in the visual latent space.

We demonstrate the visual planning results in Figure `\ref{fig:agent}`{=latex}, which are a sequence of images decoded from the latent space of the recursive sparse reasoning component. The generated images show consistent action trajectories, demonstrating that the proposed method can learn action consequence correlations, distinguishing between the static environment and the dynamic agent. Note that the training is performed completely in the continual vision modality without using discrete information such as the agent's positions. Nevertheless, we also found some failure cases, where the agent accidentally falls into holes, especially when the arrangement of these obstacles is crowded (Figure `\ref{fig:agentf}`{=latex}). Some actions such as moving in the diagonal direction are not provided in the training data, yet the agent takes the actions occasionally.

![Decoded latent states for each recursion step in the simple agent visual navigation task. We demonstrate the generated frames of the agent given only the first frame as input. During training, the agent learns to reach the final goal through the recursive sparse reasoning mechanism (though the agent might perform some random walk before reaching the goal without further optimization).](figures/agent_result.png){#fig:agent width="0.95\\linewidth"}

![A failure case of the prediction model, where falling into a hole is predicted instead of reaching the final goal.](figures/agent_failure.png){#fig:agentf width="0.95\\linewidth"}

# Conclusions

In this work, we proposed a recursive sparse reasoning framework for enhanced image generation in diffusion models while maintaining lightweight computational cost using a set of sparsely activated neural modules. Empirical experiments on both diffusion transformers and Stable Diffusion models demonstrate that inducing recursive modulation in the latent space benefits class-conditioned generation and text-following capability. In addition to vision generation tasks, we further evaluate on a visual navigation task, demonstrating its emergent ability to learn distinct action effects from pure visual input. Despite these advances, the proposed recursive framework could potentially amplify misleading visual content. We therefore recommend deploying it with regular fairness audits as safety guardrails. We hope this work will contribute to the study of latent reasoning in multimodal models for better aligned cross-modal generation.

#### Limitations

Despite the model providing enhanced performance with lightweight adaptation, the proposed framework may not guarantee optimal latent computation depth during longer recursions. For example, a halting mechanism to decide when to stop latent reasoning could enable more efficient computation. Moreover, although our method is demonstrated on vision diffusion models, its applicability to other modalities, e.g., audio generation, and its extendability to broader model architectures remain for future study, as the gating policy will likely require adaptation to specific model configurations.

```{=latex}
\bibliographystyle{unsrt}
```
```{=latex}
\appendix
```
# Experimental settings and hyperparameters {#appe:hype}

Table `\ref{tab:hyperparameters}`{=latex} presents the hyperparameters used in this study. Unless otherwise noted, we employed the author-recommended settings and hyperparameters for the re-implementation of baseline models. All experiments are performed on four H800 GPUs with 80G memory each.

```{=latex}
\small
```
```{=latex}
\renewcommand{\arraystretch}{1.2}
```
  **Parameter**                      **Value**
  ---------------------------------- -----------------------------------------------
  Optimizer                          AdamW
  Learning rate                      $5 \times 10^{-4}$
  Betas                              (0.9, 0.999)
  Weight decay                       0.0
  Batch size                         128
  Input image size                   256 (DiTs)/ 1024 (SD3-medium)
  LoRA rank                          128
  Number of experts                  1/2/5
  Max latent steps                   1/2/5
  Epochs                             20 (DiTs)/ 50 (SD3-medium)
  Gumbel softmax temperature         5.0
  Differentiable top-k temperature   0.1
  Looped layers                      12 (middle layer)/ 4, 8, 12 (multiple layers)

  : Hyperparameter settings.

`\label{tab:hyperparameters}`{=latex}

# Algorithm {#appe:algo}

We present the complete pseudocode (Algorithm `\ref{algo:scheme}`{=latex}) for our proposed Recursive Sparse Reasoning mechanism integrated with Multimodal Diffusion Transformers, specifically SD3-medium architecture. The algorithm details the recursive computation process where visual tokens are iteratively refined through sparsely activated neural modules (LoRA adapters) while maintaining joint attention with text tokens.

```{=latex}
\begin{algorithm}[h]\small
\linespread{1.25}\selectfont
\caption{Recursive Sparse Reasoning in Multimodal Diffusion Latents}
\label{algo:scheme}
\begin{algorithmic}[1]
\STATE \textbf{Input:} Vision tokens $x$, text tokens $c$, conditioning $y$ (diffusion timestep + text embeddings)
\STATE \textbf{Parameters:} Number of experts $M$, maximum latent steps $T_{\text{latent}}$, temperature $\tau$, LoRA rank $r$
\STATE \textbf{Output:} Refined vision tokens $x'$, text tokens $c'$

\STATE \textit{// Modulate inputs with conditioning}
\STATE $\tilde{x} \gets \alpha_x \cdot \text{LayerNorm}(x) + \beta_x$
\STATE $\tilde{c} \gets \alpha_c \cdot \text{LayerNorm}(c) + \beta_c$
\STATE $x_{\text{residual}} \gets \tilde{x}$
\STATE $\text{context\_qkv} \gets \text{PreAttention}(\tilde{c})$

\STATE \textit{// Recursive sparse computation over latent steps}
\FOR{$t_{\text{latent}} = 1$ \TO $T_{\text{latent}}$}
    \STATE $\text{USE\_BASE} \gets (t_{\text{latent}} == T_{\text{latent}})$
    \STATE $\text{step\_embed} \gets \text{TimestepEmbedder}(t_{\text{latent}})$
    \STATE $\text{gate\_input} \gets \tilde{x} + \text{unsqueeze}(y) + \text{step\_embed}$

    \STATE \textit{// Gating network computes routing probabilities}
    \STATE $\text{logits} \gets \theta_{\text{gate}}(\text{gate\_input})$
    \STATE $z_m \gets \text{logits}_m + \text{Gumbel}(0,1)$ \quad \textit{// Add Gumbel noise}
    \STATE $\hat{z}_m \gets \text{softmax}(z_m / \tau)$ \quad \textit{// Gumbel-Softmax}
    \STATE $m^* \gets \arg\max_m \hat{z}_m$ \quad \textit{// Select expert}

    \STATE \textit{// Apply selected LoRA adapter}
    \STATE $\tilde{x}_{\text{scaled}} \gets \theta_{\text{adapter}}^{m^*}(\tilde{x})$

    \STATE \textit{// Joint attention with text branch}
    \STATE $q, k, v \gets \text{SplitQKV}(\tilde{x}_{\text{scaled}})$
    \STATE $q, k, v \gets \text{Concat}([\text{context\_qkv}[0], q], [\text{context\_qkv}[1], k], [\text{context\_qkv}[2], v])$
    \STATE $\text{attn} \gets \text{MultiHeadAttention}(q, k, v)$
    \STATE $\text{context\_attn}, \tilde{x} \gets \text{Split}(\text{attn})$

    \STATE \textit{// Residual connection (except final step)}
    \IF{$t_{\text{latent}} < T_{\text{latent}}$}
        \STATE $\tilde{x} \gets \tilde{x} + x_{\text{residual}}$
    \ENDIF
\ENDFOR

\STATE \textit{// Post-processing}
\STATE $c \gets \text{PostAttention}(\text{context\_attn}, c, \gamma_c)$
\STATE $x \gets \text{PostAttention}(\tilde{x}, x, \gamma_x)$

\STATE \RETURN $x, c$
\end{algorithmic}
\end{algorithm}
```
