---
abstract: |
  Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing distinct computational roles. In this work, we empirically investigate nonuniform capacity allocation across network depth by proposing a $\times$-shaped ><former architecture. This design maintains wider early and late layers while narrowing the middle layers, utilizing a parameter-free residual resizing mechanism. Across decoder-only language models ranging from 200M to 2B parameters (dense) and 3B parameters (MoE), our ><former consistently outperforms parameter-matched uniform baselines on language modeling loss. By reducing the average layer width, this architecture also requires fewer overall FLOPs (22% reduction under fitted loss-matched scaling curves) and smaller KV cache memory and I/O cost (15% reduction). In analysis, we show that this bottleneck structure results in qualitatively different representations in residual streams. Overall, our results demonstrate that nonuniform width allocation can result in more resource-optimal scaling of language models.
author:
- Zhaofeng Wu
- Oliver Sieberling
- Shawn Tan
- Rameswar Panda
- Yury Polyanskiy
- Yoon Kim
affiliations:
- MIT
- MIT-IBM Watson AI Lab
contact: zfw@csail.mit.edu
bibliography:
- references.bib
title: Variable-Width Transformers
---

# Abstract

Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing distinct computational roles. In this work, we empirically investigate nonuniform capacity allocation across network depth by proposing a $\times$-shaped ><former architecture. This design maintains wider early and late layers while narrowing the middle layers, utilizing a parameter-free residual resizing mechanism. Across decoder-only language models ranging from 200M to 2B parameters (dense) and 3B parameters (MoE), our ><former consistently outperforms parameter-matched uniform baselines on language modeling loss. By reducing the average layer width, this architecture also requires fewer overall FLOPs (22% reduction under fitted loss-matched scaling curves) and smaller KV cache memory and I/O cost (15% reduction). In analysis, we show that this bottleneck structure results in qualitatively different representations in residual streams. Overall, our results demonstrate that nonuniform width allocation can result in more resource-optimal scaling of language models.

![We propose ><former, where different layers have different widths. We specifically employ a $\times$-shaped architecture, where inactive dimensions are copied upward in the residual stream. We find that this improves performance and saves on training FLOPs, KV cache memory, and I/O cost. ](figures/overview_bowtie.png){#fig:overview width="0.5\\linewidth"}

# Introduction

Scaling has been a critical driver of progress in modern AI. One major axis of scaling is model size. In the context of transformers, model size is a function of the dimension of the transformer block (a model's "width")[^1] and the number of transformer blocks (a model's "depth"). Consequently, much prior work has investigated how to optimally scale model size through scaling transformer width and depth. While early research suggested that a model's shape (i.e., the ratio of width to depth) mattered less than the total parameter count for performance [@kaplan2020scalinglawsneurallanguage], subsequent work found that model shape can lead to nontrivial differences [@levine2020limits; @tay2021scale; @petty2024impact] and should be taken into account when fitting scaling laws [@mcleish2025gemstones].

Yet, even as the optimal global shape of transformers is debated, these studies generally preserve a less-examined assumption: a model's width is constant across depth. That is, once a hidden dimension size is chosen, every transformer block receives approximately the same computation/parameter budget. This uniform-width design is convenient, but not obviously optimal. Different layers may play different roles during computation, and a fixed total parameter or FLOP budget need not be allocated evenly across depth. This motivates a general question: under a fixed depth and parameter budget, should all layers have the same width, or should capacity be distributed nonuniformly?

We study this question empirically by training decoder-only transformer language models (LMs) with nonuniform allocation of parameters and compute across depth. Concretely, for a given parameter and depth constraint, we vary the model shape across several settings: growing ($\vee$-shaped), narrowing ($\wedge$-shaped), growing then narrowing ($\Diamond$-shaped), and narrowing then growing ($\times$-shaped). Across these settings, we find that $\times$-shaped models (wide in early and late layers but narrower in the middle) outperform parameter-matched constant-width transformers. We call them ><formers. This differs from prior work on layerwise allocation of only FFN intermediate dimensions, which found benefits from allocating more computation to middle layers [@ikeda2025layerwise]; our results instead suggest that reallocating the full block width leads to a different optimal profile.

A key implementation detail is how variable-width layers interact with the residual stream. Naïvely changing the residual dimension between layers introduces projection bottlenecks and changes the skip path. We instead keep a fixed global residual dimension and allow each block to read from and write to a layer-specific slice of the residual stream. Coordinates not used by a given block bypass that block and are projected upstream via copying. We find that this fixed-residual construction is important for realizing the gains from nonuniform width profiles.

Nonuniform width allocation also has efficiency benefits: it requires fewer training and inference FLOPs than constant-width transformers, while also reducing the KV cache memory and I/O cost for moving activations. Because a layer's parameter count scales quadratically with its width, while attention FLOPs and KV cache size scale linearly, matching the parameter count of a uniform baseline results in a reduction in average layer width (and hence the KV cache). Across models with 200M--2B parameters, ><formers achieve approximately a 3% relative improvement in perplexity over parameter-matched constant-width baselines while reducing KV-cache size by about 10% and FLOPs by about 3%. These benefits also extend to mixture-of-experts transformers. We further analyze how the optimal bottleneck width and bottleneck location depend on the model budget, providing empirical guidance for scaling nonuniform-width transformers. Finally, we also perform analyses to understand the benefit of ><former, showing that it employs a different representation strategy than the constant-width baseline and mitigates mid-layer representation collapse.

# Variable-Width Transformers {#sec:method}

A standard transformer contains a series of $L$ layers. In each layer $\ell\in[1,L]$, a transformer block $\mathcal{B}^\ell: \R^d \to \R^d$ transforms the input from the previous layer $\vx^{\ell-1}$ by $\vx^\ell = \mathcal{B}^\ell(\vx^{\ell-1}) + \vx^{\ell-1}$. $d$ is the model dimension. We define $\vx^0$ as the input embeddings.

In this work, we question why $d$ must be held constant. Much past work has shown that different layers of a transformer LM perform distinct functions, which naturally may require different amounts of capacity [@tenney-etal-2019-bert; @meng2022locating; @10.1016/j.csl.2022.101429]. This motivates each layer $\ell$ having a different dimension $d_\ell$.

One practical challenge, however, is that this requires resizing between layers: $\vx^\ell = \mathcal{B}^\ell(f^{\ell}(\vx^{\ell-1})) + f^\ell(\vx^{\ell-1})$ where $f^\ell: \R^{d_{\ell-1}}\to\R^{d_{\ell}}$ resizes the hidden state. We consider a parameter-free approach. When shrinking dimensions, i.e., $d_\ell < d_{\ell - 1}$, we simply truncate the extra dimensions, i.e., $f^\ell(\vx) = \vx[:d_\ell]$. When expanding dimensions, i.e., $d_\ell > d_{\ell - 1}$, we restore each previously truncated dimension from the most recent layer that actively processed it. Formally, for each coordinate index $i \in \{1, \dots, d_\ell\}$, the $i$-th element of the resized hidden state $f^\ell(\vx^{\ell-1}) \in \mathbb{R}^{d_\ell}$ is constructed as: $$_i = [\vx^{\ell'}]_i \quad \text{where} \quad \ell' = \max \{ \tilde{\ell} < \ell \mid d_{\tilde{\ell}} \ge i \}$$ If no such prior layer exists (i.e., if the required dimension exceeds the maximum width of all preceding layers), the coordinate is padded with $0$. See §the ablation section for ablations that show that these methods outperform alternatives such as training a projection layer or always padding with 0s.

Using this expansion method, we can mathematically conceptualize our variable-width model as a uniform-width model except (1) each layer only reads from/writes to a subset of residual stream dimensions, and (2) it has a larger residual stream width (equal to the width of the widest layer).

We investigate different shapes with two additional parameters, $\ell^*$, the layer index of a bottleneck layer, and $d_{\ell^*}$, its dimension. We parameterize the rest of the layer widths geometrically:[^2] $d_\ell = \alpha^- d_{\ell-1}$ with change rate $\alpha^{-}$ for $\ell\le\ell^*$ (the early layers) and $d_\ell = \alpha^+ d_{\ell-1}$ with change rate $\alpha^{+}$ for $\ell>\ell^*$ (the late layers). With different settings of $\alpha^+$, $\alpha^-$, and $\ell^*$, we recover different kinds of shapes: when $\alpha^-<1$ and $\alpha^+>1$, we obtain a $\times$-shaped model; when $\alpha^->1$ and $\alpha^+<1$, we obtain a $\Diamond$-shaped model; when $\ell^*$ is 1 or $L$, we obtain a $\vee$ or $\wedge$-shaped model. We keep the size of the input and output embeddings unchanged: the QKV projections of the first layer and the MLP down-projection of the last layer adjust for these size mismatches.[^3]

A compelling property of our variable-width architecture is that, when matched in parameter count to a constant-width baseline, it requires strictly fewer overall FLOPs and has a strictly lower average layer width (and thus, lower KV cache size and lower I/O cost for moving activations). Concretely, the parameter count of a transformer layer is dominated by the linear projection matrices (QKV, output, and MLP), which scale quadratically with the hidden dimension: $P_\ell \approx K d_\ell^2$, where $K$ is a constant depending on the number of projection matrices and the MLP expansion factor. Therefore, if we match the parameter count of a variable-width model to a baseline of constant width $d$, we effectively equate the sum of the squared dimensions: $$K \sum_{\ell=1}^L d_\ell^2 = K L d^2 \implies \frac{1}{L} \sum_{\ell=1}^L d_\ell^2 = d^2.$$ Because the square of the mean is upper-bounded by the mean of the squares, and because variable-width ensures the widths $d_\ell$ are not constant, the average layer size is strictly smaller: $$\left( \frac{1}{L} \sum_{\ell=1}^L d_\ell \right)^2 < \frac{1}{L} \sum_{\ell=1}^L d_\ell^2 = d^2 \implies \frac{1}{L} \sum_{\ell=1}^L d_\ell < d.$$ For FLOPs, first of all, the number of FLOPs per token in a linear projection is strictly proportional to the number of weights, and so when parameter-matched, the total dense FLOPs remain identical to the baseline. For attention dot-products, their FLOPs scale linearly with the hidden dimension: $\text{FLOPs}_\ell \propto N^2 d_\ell$, where $N$ is the sequence length. Therefore, the total attention compute $\sum_{\ell=1}^L N^2 d_\ell=N^2 \sum_{\ell=1}^L d_\ell$ is consequently strictly lower than the baseline $N^2 L d$.[^4]

In summary, there are 4 parameters for a variable-width transformer, $\ell^*$, $d_{\ell^*}$, $\alpha^+$, and $\alpha^-$. We set two constraints: $d_1=d_L$ (for $\Diamond$ and $\times$ shapes) and that the parameter count matches a constant-width baseline.[^5] So for our experiments, we consider $\ell^*$ and $d_{\ell^*}$ as two hyperparameters, and automatically solve for all layer widths.[^6] See §the parameter-matching appendix for our derivation.

# ><former

```{=latex}
```
::: {#tab:training-hparams}
  Parameters              Layers ($L$)       Hidden ($d$)             Batch Size                Tokens           Experts (Tot/Act)   MLP Interm. Size
  ---------------------- -------------- ----------------------- ----------------------- ----------------------- ------------------- ------------------
  *Dense Models*
  200M                         16        640   512   10B          --                  --
  500M                         24        960           1024            25B          --                  --
  1B                           32                1280                    2048            50B          --                  --
  2B                           40                1600                    4096                    100B                   --                  --
  *Mixture of Experts*
  3B (1B active)               40                1600                    4096                    100B                 22 / 3               512

  : Pre-training hyperparameters.
:::

In this section, we discuss training variable-width transformers. After introducing our training setup, we first establish that a $\times$-shaped model works best, and then identify a parameterization of the bottleneck layer index $\ell^*$ and dimension $d_{\ell^*}$ that works well across model sizes. Using this recipe, we pre-train ><formers and constant-width baselines across sizes and find that ><formers consistently achieve better loss and downstream task performance with a smaller pre-training FLOPs footprint and KV cache size.

## Training Setup {#sec:setup}

We pre-train four model sizes---200M, 500M, 1B, and 2B---with different numbers of layers and hidden sizes (Table 1). For each, we pre-train constant-width transformers and variable-width models. We also consider a Mixture-of-Experts (MoE) model with 3B total/1B active parameters. For parameter-matching the variable-width MoE model with the baseline, we match the number of total parameters---this results in the variable-width model having 3% fewer active parameters, but we show in §the results section that it still outperforms the constant-width baseline despite this.

For pre-training data, we train on DCLM [@li2024datacomplm]. For each model size, we train models to 2.5$\times$ Chinchilla-optimal [@10.5555/3600270.3602446], i.e., the number of trained tokens is equal to 50 times the parameter count, e.g., 100B tokens for the 2B model. We train on length-4096 sequences, with model-size-dependent batch sizes (Table 1). Inputs are tokenized with OpenAI's `cl100k_base`.[^7]

All models are trained with maximal update parametrization ($\mu$P; [@yang2024tensor]). We use $\mu$P-aware initialization and optimizer parameter groups, with the same AdamW hyperparameters across scales: learning rate $10^{-2}$, $\beta=(0.9,0.95)$, weight decay $0.1$, and $\epsilon=10^{-10}$. We use a power learning-rate decay schedule. The learning rate is linearly warmed up for approximately the first 8% of training steps and then decayed for the remaining steps. All models are trained in bfloat16 precision. Following common practice, we omit bias terms from all linear projections, including attention, MLP, and output projections [@chowdhery2022palmscalinglanguagemodeling; @groeneveld-etal-2024-olmo]. We use the SwiGLU activation [@shazeer2020gluvariantsimprovetransformer].

We measure the training loss of each model,[^8] averaged over the final 1,000 steps (with a 10-step increment) for smoothing. We also report pre-training FLOPs in PFLOP/s-days, the number of days required assuming 1 PFLOP per second [@kimiteam2026attentionresiduals]. Finally, we compare the average layer size, which is proportional to the KV cache size during inference.

## The $\times$ Shape Works Best

![Comparing variable-width transformers with different shapes, each sweeping over multiple hyperparameter choices. **The $\times$-shaped model performs the best.**](figures/shapes.png){#fig:shapes width="0.6\\linewidth"}

We first explore different shapes on the 500M-parameter scale. Beyond a regular constant-width transformer, we experiment with a $\Diamond$ shape, a $\times$ shape, a $\vee$ shape, and a $\wedge$ shape. Due to the optimal hyperparameter ($\ell^*$ and $d_{\ell^*}$, §the method section) potentially differing per shape, we experiment with 3 choices for each hyperparameter per shape. This amounts to 9 runs for the $\times$ shape and the $\Diamond$ shape, and 3 runs for the $\vee$ shape and the $\wedge$ shape, for which $\ell^*$ is irrelevant. In the figure, we plot their loss and the pre-training FLOPs. We see that the $\times$ shape consistently performs the best.[^9]

## Finding a Specific Width Schedule

![The effect of the bottleneck layer index and dimension on language modeling loss, parameterized as a ratio to the total number of layers and the base dimension. $\ell^*=r_\ell L$ and $d_{\ell^*}=r_d d$. We also show the baseline performance, indicated using $r_d=1$, and the resulting average layer size. **This parameterization yields a relatively consistent model performance pattern across model sizes.**](figures/hparams.png){#fig:hparams width="\\linewidth"}

As mentioned in §the method section, we need to choose a bottleneck layer index $\ell^*$ and the bottleneck dimension $d_{l^*}$. Ideally, we want to find a recipe that works well across model sizes so that we do not have to search for them individually at each model size. Therefore, we parameterize these two hyperparameters as ratios to the total number of layers $L$ and the hidden size $d$: $\ell^*=r_\ell L$ and $d_{l^*}=r_d d$. We sweep over different values of $r_\ell$ and $r_d$ at small model sizes: 200M, 500M, and 1B. the figure shows the results. While it is not the case that a single $(r_\ell, r_d)$ pair is consistently the best, the fact that such a ratio-based parameterization leads to *roughly* similar trends across model sizes is interesting. By default, based on this sweep, we use $\ell^*=0.75L$ and $d_{l^*}=0.3d$ going forward.

## ><former Outperforms Constant-Width Transformer {#sec:results}

```{=latex}
```
  Size        Model                   Loss        PFLOP/s-days                           Avg layer size
  ----------- ----------------------- ----------- -------------------------------------- --------------------------------------
  200M        Transformer             3.452       0.18                                   640
              ><former    **3.430**   **0.17 ($-$3.2%)**    **576 ($-$10.0%)**
  500M        Transformer             3.138       1.11                                   960
              ><former    **3.099**   **1.07 ($-$3.7%)**    **855 ($-$11.0%)**
  1B          Transformer             2.926       4.52                                   1280
              ><former    **2.890**   **4.41 ($-$2.6%)**    **1145 ($-$10.5%)**
  2B          Transformer             2.751       16.92                                  1600
              ><former    **2.726**   **16.49 ($-$2.5%)**   **1426 ($-$10.9%)**
  3B/1B MoE   Transformer             2.726       10.13                                  1600
              ><former    **2.710**   **9.66 ($-$4.6%)**    **1426 ($-$10.9%)**

  : The performance, pre-training FLOPs, and average layer size of ><formers vs. constant-width transformers. **><former consistently achieves lower loss with lower pre-training FLOPs and average layer size.**


![Language modeling loss vs. pre-training FLOPs (left) and average layer size (right). **><former produces lower loss at smaller FLOP and average layer size costs.**](figures/scaling_law.png){#fig:main-results width="\\linewidth"}

```{=latex}
\begin{table*}[t]

\resizebox{\textwidth}{!}{
\begin{tabular}{lcccccccccccccc}
\toprule
& Accuracy ($\uparrow$)
& Perplexity ($\downarrow$) \\
\cmidrule(lr){2-13}\cmidrule(lr){14-15}
Model
& ARC-C
& ARC-E
& BoolQ
& COPA
& HellaSwag
& LAMBADA
& OBQA
& PIQA
& RACE
& SciQ
& WinoGrande
& Avg.
& LAMBADA
& WikiText \\
\midrule
2B constant-width
& 33.0 & 59.5 & 59.4 & 76.0 & 55.9
& 55.4 & 33.8 & 73.3 & 34.3 & 79.5 & 57.0
& 56.1
& 8.18 & 16.96 \\
2B ><former
& 34.4 & \textbf{63.3} & 60.9 & 73.0 & \textbf{57.9}
& 56.1 & 33.6 & 74.4 & 33.4 & 82.0 & \textbf{60.2}
& \textbf{57.2}
& \textbf{7.43} & \textbf{16.32} \\

\midrule


MoE baseline
& 33.7 & 62.2 & \textbf{63.0} & 77.0 & 57.3 & 56.1
& 37.4 & 74.8 & 34.6 & \textbf{83.9} & 55.6 & 57.8
& 7.78 & 16.36 \\
MoE ><former
& 33.2 & 61.0 & 59.5 & 80.0 & \textbf{58.7} & 56.2
& 37.8 & 75.2 & 34.3 & 80.2 & \textbf{60.1} & 57.8
& \textbf{7.45} & \textbf{15.98} \\
\bottomrule
\end{tabular}
}
\caption{Model performance on standard LM evaluation datasets. For accuracy metrics, bold indicates the higher value when significant at $p < 0.05$ under a one-sided test; for perplexity, bold indicates the lower value. \textbf{><formers consistently outperform constant-width transformers on perplexity-based tasks, and the 2B ><former wins on most natural language understanding tasks.}}
\label{tab:dense-moe-eval}
\end{table*}
```
Table 2 shows, at all model sizes that we tested, ><former outperforms the constant-width transformer, while requiring fewer FLOPs and average layer size (i.e., with a reduction in KV cache size).

In Figure (main-results) (left), we fit a scaling law curve on loss vs. pre-training FLOPs to ><formers and constant-width transformers [@kaplan2020scalinglawsneurallanguage], finding a tight fit. Similarly, in Figure (main-results) (right), we also find a tight power-law fit on loss vs. the average layer size. From these scaling law curves, we compute that ><former can achieve the 2B constant-width transformer's loss (2.751) with 77.8% FLOPs and 85.1% average layer width. Furthermore, both scaling law curves show that not only does ><former have a smaller intercept, but it has a slightly steeper scaling exponent too, suggesting that the gaps might widen at larger sizes.

We also test these models on standard LM downstream evaluation benchmarks using the `lm-evaluation-harness` [@eval-harness] in the zero-shot setting. This suite covers natural language understanding (NLU) tasks such as common-sense reasoning, reading comprehension, etc., as well as perplexity-based tasks. For multiple-choice tasks, we report normalized accuracy when available, since it corrects for answer-length effects by normalizing choice likelihoods. When it is not provided, we report standard accuracy. We report the dataset statistics and metrics in Table (\1) in §the dataset statistics section. We evaluate the 2B models and the MoE models and show the results in Table 3. ><formers consistently outperform constant-width transformers on perplexity-based tasks. The 2B ><former also leads on most NLU tasks. The MoE ><former is mixed on NLU accuracy but improves both perplexity metrics; at these model sizes, we treat perplexity as the more informative metric of LM quality. We again note that ><former achieves this with fewer FLOPs and memory, and also fewer active parameters for the MoE model (§the training setup section).

![The utilization frequency of MLP activation dimensions in the 2B ><former vs. the 2B constant-width transformer, visualized separately for each layer. The shaded panel corresponds to the bottleneck layer. **><former more evenly utilizes MLP activation dimensions.**](figures/activation_utilization.png){#fig:activation-utilization width="\\linewidth"}

# Analysis

Transformers are known to use depths inefficiently [@gromov2025the], frequently developing "compression valleys" where their middle layers collapse in representational capacity and compress computations [@skean2025layer; @queipo-de-llano2026attention]. By inspecting both MLP intermediate activations and the residual stream after each layer, we find that ><former employs a different representation strategy, where it mitigates the collapse in middle layers and more effectively uses its capacity than a constant-width transformer.

## ><former Improves MLP Activation Utilization {#sec:analysis-activation}

Intuitively, ><former enforces an information bottleneck that may encourage the model to more effectively use its representation capacity. We operationalize this by inspecting the utilization of intermediate MLP activations. Prior interpretability work has viewed a transformer MLP layer as containing key-value memories: the up-projection layer encodes keys for distinct concepts, the intermediate activation represents the MLP input's affinities with the concepts, and the down-projection encodes values for those concepts, taking a linear combination of them weighted by the affinity scores (the activation) [@geva-etal-2021-transformer; @geva-etal-2022-transformer]. We measure the MLP activation density of the 2B ><former vs. the constant-width transformer on the WikiText-2 validation split with 252,986 tokens [@merity2017pointer]. Because SwiGLU is continuous, we consider a dimension as active iff its activation magnitude is larger than a certain threshold. In the figure, we show that ><former enforces denser activations within MLPs across thresholds.

Dense activation are not necessarily desirable, so we also inspect the marginal utilization of each MLP activation dimension: how often a dimension is activated across tokens (thresholded at 0.1). In the mechanistic interpretability literature, low marginal utilization and "dead" dimensions are strong indicators of under-utilized capacity [@Brickenetal2023; @gao2025scaling]. the figure shows this quantity across layers. While neither model has a perfectly even distribution, ><former consistently achieves substantially better load-balancing between activation dimensions. In §the additional-results section, we also show a similar trend if we additionally take activation magnitude into account, demonstrating that ><former more evenly uses the activation dimensions in the middle layers of the network.

<figure id="fig:matrix-entropy">
<img src="figures/activation_sparsity.png" />
<img src="figures/matrix_entropy.png" />
<figcaption>The normalized matrix entropy (§) of layer outputs in the 2B vs. the 2B constant-width transformer. <strong>has a higher matrix entropy in middle-to-final layers, which corresponds to more even usage of the residual dimensions in those layers.</strong></figcaption>
</figure>

## ><former Mitigates Middle-layer Representation Collapse {#sec:analysis-residual}

We now turn from studying MLP activations to the residual stream after each layer. Recent analyses of deep, constant-width LMs reveal the emergence of "compression valleys," where the LM's middle layers collapse in representational capacity, characterized by a severe drop in representational entropy [@skean2025layer; @queipo-de-llano2026attention]. Following @queipo-de-llano2026attention, we track the normalized matrix entropy of the residual stream across all layers: $$\frac{1}{\log r} \left( -\sum_{j=1}^{r} p_j \log p_j\right), \quad p_j = \sigma_j^2 / \|\mathbf{X}\|_F^2$$ where $\sigma_j$ are the sorted singular values of the input-feature representation matrix $\mathbf{X}$ with rank $r$, again computed using the WikiText-2 validation split. Closely related to the effective dimension metric [@hill; @7098875], a higher matrix entropy indicates a more "even" use of the representation space.

We consider the per-layer hidden states in this analysis. For ><former, recall our interpretation from §the method section that considers it having a wide residual stream where each layer reads from/writes to only a subset of dimensions. Accordingly, we consider this wide residual stream as its effective hidden states.

In the figure, we see that the baseline model exhibits a severe compression valley: in middle layers, its normalized entropy drops to near-zero, indicating that the token representations have collapsed into a highly degenerate, low-rank subspace despite the large width. This is consistent with prior findings  [@skean2025layer; @queipo-de-llano2026attention]. In contrast, ><former restructures this dynamic. While it actively lowers its entropy in the early layers to compress the representation (anticipating the width reduction), it avoids the middle-layer collapse. Throughout the bottleneck and final layers, ><former maintains a higher normalized entropy, potentially suggesting that physically constraining the parameter space encourages the network to maintain a high-entropy manifold.

## Predictive Dynamics via the Logit Lens {#sec:analysis-logit-lens}

![Logit lens analysis of the 2B ><former versus the constant-width baseline. **Left:** ><former assigns higher target-token probability through much of the network. **Middle:** ><former's decoded token distribution changes more gradually across middle layers. **Right:** ><former has lower entropy in early layers but declines less rapidly than the baseline in the final layers.](figures/logit_lens.png){#fig:logit-lens width="\\linewidth"}

To understand how these geometric differences in the residual stream affect model predictions, we project intermediate hidden states into vocabulary space using the logit lens [@logit-lens]. Specifically, at each layer, we decode hidden states by applying the final RMS normalization followed by the unembedding matrix. As in §the residual-stream analysis section, we treat ><former's effective wide residual stream as its hidden state, restricting to the residual dimensions visible to the unembedding.

For each layer, we measure the target-token log probability, the entropy of the decoded token distribution, and the layer-to-layer KL divergence between adjacent logit-lens distributions. We symmetrize this KL by averaging the two directions, using it as a proxy for how rapidly the decoded distribution changes with depth.

the figure shows that ><former assigns higher probability to the target token, with lower decoded-distribution entropy, through much of the early-to-middle network. At the same time, its decoded token distribution changes more gradually across layers, as reflected by lower layer-to-layer KL. In the final layers, the distribution changes rapidly again as probability mass concentrates on the target token.

```{=latex}
\begin{wraptable}[11]{R}{4.75cm} %


    \begin{tabular}{lr} %
        \toprule
        \textbf{Expansion Method} & \textbf{Loss} \\
        \midrule
        Constant-width    & 3.138 \\
        \midrule
        Carry-forward     & 3.099 \\
        Zero Padding        & 3.124 \\
        Projection & 3.150 \\
        \bottomrule
    \end{tabular}
    \caption{Performance comparison of different methods to expand extra dimensions at 500M. \textbf{Simply carrying forward features from lower layers performs the best.}}
    \label{tab:moveup-ablation}

\end{wraptable}
```
## Ablations {#sec:moveup-ablation}

We analyze alternative methods of expanding dimensions. Beyond our default method that carry forward features by copying coordinates through the residual stream, we also consider (1) padding with 0s and (2) training a projection layer to predict the extra dimensions from the previous layer representation.[^10] For each, we also sweep over multiple hyperparameter configurations and report the best loss. We ablate at the 500M scale, and Table (\1) shows that copying features performs the best.

# Limitations

A major caveat is that our approach adds significant complexity for efficient training. Concretely, for efficient training one would need to develop and optimize kernels for many different shapes, each of which has different latency, memory footprint, and compute profiles. The fixed-residual construction also potentially adds overhead, since the slicing, copying, and zero-padding around a global residual stream wider than the baseline $d$ introduce extra kernel launches, though much of this could be mitigated via kernel fusion. Heterogeneous per-layer widths are further in tension with standard tensor/pipeline parallelism techniques.

We stress, however, that these are implementation rather than algorithmic limitations: variable-width transformers are still matmul-rich, and the gap we describe reflects the fact that current infrastructure has been heavily optimized for the uniform-width regime rather than any intrinsic property of the architecture. We expect that purpose-built kernels would close much of the gap between theoretical and realized efficiency.

More broadly, while we are not calling for immediate adoption of ><formers, we hope that future architecture research can capitalize on this previously unnoticed degree of freedom in design.

# Related Work

#### Nonuniform allocation of width in transformers.

Several transformer variants allocate parameters nonuniformly across depth. DeLighT uses block-wise scaling, making earlier blocks shallower/narrower and later blocks deeper/wider [@mehta2020delight]. OpenELM adopts layerwise scaling in decoder-only language models by varying attention and feed-forward dimensions across layers [@mehta2024openelm]. Recent layerwise-scaling variants explore framed, reverse, and crown allocation profiles [@baroian2025crown]. @ikeda2025layerwise study the layerwise importance of feed-forward networks by reallocating MLP capacity and find benefits from concentrating MLPs in middle layers. Our work differs from these approaches in varying the full block hidden dimension, rather than only the attention-head count, MLP multiplier, or lightweight block internals. This requires addressing how variable-width blocks interact with the residual stream; our fixed-residual construction lets inactive coordinates bypass narrower blocks.

#### Bottleneck across sequence length.

There has also been work that performs compression across sequence length. Funnel-Transformer gradually shortens the sequence of hidden states and later recovers token-level representations for prediction [@dai2020funnel]. Hourglass Transformers downsample and upsample activations to build an explicit hierarchical language model [@nawrot2022hierarchical]. Perceiver models use cross-attention to distill high-dimensional inputs into a compact latent bottleneck before applying transformer-style processing [@jaegle2021perceiver]. These methods primarily bottleneck the number of tokens or latent slots. Our architecture instead preserves the token sequence length and introduces a bottleneck in hidden width across depth.

#### Bottleneck designs outside Transformers.

Bottleneck architectures have a long history outside of transformers. The U-Net and stacked hourglass networks use encoder--decoder structures that repeatedly reduce and recover spatial resolution, often with skip connections that preserve high-resolution information [@ronneberger2015u; @newell2016stacked]. Other architectures introduce bottlenecks along the channel dimension: ResNets use bottleneck residual blocks to reduce the cost of deep convolutional networks [@he2016deep], while MobileNetV2 uses inverted residual blocks with linear bottlenecks for efficient vision models [@sandler2018mobilenetv2]. However, transformer applications have mostly worked with non-bottleneck architectures in the channel dimension.

#### Hyper-Connections.

By expanding residual-stream capacity, ><former is conceptually related to Hyper-Connections (HC) [@zhu2025hyperconnections; @xie2026mhcmanifoldconstrainedhyperconnections; @deepseekai2026deepseekv4]. However, the mechanisms are different: HC uses learned mixing between multiple residual streams, whereas ><former uses deterministic slicing and carry-forward within a single global residual stream. In narrower layers, inactive coordinates bypass the block and are reintroduced when the width expands. Thus, ><former provides a complementary way to vary residual capacity without the learned residual-mixing matrices that @xie2026mhcmanifoldconstrainedhyperconnections identify as a source of large-scale HC instability.

# Conclusion

In this work, we challenge the standard assumption of uniform capacity allocation across transformer depth by introducing the ><former, a variable-width architecture. Across evaluations from 200M to 3B parameters (dense and MoE), parameter-matched ><formers outperform uniform baselines, while mathematically and empirically reducing both FLOPs and KV cache memory. Furthermore, our analyses reveal that this bottleneck design may act as a structural regularizer, forcing the network to utilize its representation space more evenly. These findings demonstrate that nonuniform width allocation is an efficient and promising strategy for scaling future language models.

# Acknowledgments {#acknowledgments .unnumbered}

This study was supported in part by the MIT-IBM Watson AI Lab and the National Science Foundation under CAREER Award No. 2441872 and NSF grant No. CCF-21-12665.

```{=latex}
```
```{=latex}
```
# Parameter-Matched Width Calculation {#app:param-match}

Here we derive how we instantiate a geometric width schedule from a bottleneck layer $\ell^*$ and bottleneck dimension $d_{\ell^*}$, while matching the parameter count of a constant-width baseline. The derivation is in continuous width space; integer rounding is applied only after the parameter-matched widths have been determined.

Let $L$ be the number of transformer layers, $d$ the hidden dimension of the constant-width baseline, and $v$ the vocabulary size. We assume the input and output embeddings maintain the baseline width $d$. The residual stream has a layer-dependent width $d_\ell$, and resizing between adjacent layers is parameter-free. We impose symmetric endpoint widths, $d_1=d_L=\bar{d}$.

The layer widths follow a geometric progression on each side of the bottleneck: $$d_\ell =
    \begin{cases}
        \alpha^- d_{\ell-1}, & 1 < \ell \leq \ell^*,\\
        \alpha^+ d_{\ell-1}, & \ell^* < \ell \leq L,
    \end{cases}$$ where $\alpha^- \leq 1$ and $\alpha^+ \geq 1$. Symmetric endpoints imply $(\alpha^-)^{\ell^*-1}(\alpha^+)^{L-\ell^*}=1$, which constrains $\alpha^+$ for any candidate $\alpha^-\in(0,1]$: $$\alpha^+ = (\alpha^-)^{-\frac{\ell^*-1}{L-\ell^*}}.$$ Thus, the shape is strictly determined by $\alpha^-$. We define the dimensionless factors $c_\ell(\alpha^-)$ such that $d_\ell = \bar{d}\,c_\ell(\alpha^-)$: $$c_\ell(\alpha^-) =
    \begin{cases}
        (\alpha^-)^{\ell-1}, & 1 \leq \ell \leq \ell^*,\\
        (\alpha^-)^{\ell^*-1}(\alpha^+)^{\ell-\ell^*}, & \ell^* < \ell \leq L.
    \end{cases}$$

To match the parameter count of the constant-width baseline, we must account for the dominant layer parameters and endpoint corrections. For a dense transformer block with SwiGLU, the per-layer parameter count scales with $K d_\ell^2$, where $K = 4 + N_m E$ ($E=4$ is the MLP expansion factor, and $N_m=3$ for SwiGLU is the number of MLP projections). Ignoring layer norm and bias terms, the baseline parameter count is $P_{\mathrm{base}} = 2vd + LKd^2$.

Because the embeddings are fixed at width $d$, if our schedule requires $\bar{d} > d$ (as is the case for ><former), we pad the initial embeddings with 0s and truncate the final unembeddings. This results in unused parameters in the first attention layer and the final MLP layer. Specifically, the first attention QKV map and the last MLP output map contain $3\bar{d}(\bar{d}-d)$ and $E\bar{d}(\bar{d}-d)$ unused parameters, respectively. The total endpoint correction is therefore: $$W_{\mathrm{end}}(\bar{d}) = \mathbf{1}\{\bar{d}>d\}\,(3+E)\bar{d}(\bar{d}-d).$$

Equating the valid parameters of our variable-width model to the baseline gives: $$K\bar{d}^2 S_2(\alpha^-) - W_{\mathrm{end}}(\bar{d}) = LKd^2,$$ where $S_2(\alpha^-) = \sum_{\ell=1}^{L} c_\ell(\alpha^-)^2$.

Substituting in $W_{\mathrm{end}}(\bar{d})$, we can simplify this to: $$\Big[ K S_2(\alpha^-) - \mathbf{1}\{\bar{d}>d\}\,(3+E) \Big] \bar{d}^2 + \Big[ \mathbf{1}\{\bar{d}>d\}\,d(3+E) \Big] \bar{d} - LKd^2 = 0.$$ Because the coefficients depend on the piecewise indicator $\mathbf{1}\{\bar{d}>d\}$, we solve this by assuming a state for the indicator (either $0$ or $1$), applying the standard quadratic formula to find the positive root, and selecting the root that is self-consistent with our assumption. This expresses the valid endpoint width as a function of $\alpha^-$, which we denote as $\bar{d}_{\alpha^-}$. The bottleneck width for the given $\alpha^-$ is then: $$b(\alpha^-) = \bar{d}_{\alpha^-}(\alpha^-)^{\ell^*-1}.$$

We use a 1D numerical solver over $\alpha^- \in (0,1]$ to solve for $b(\alpha^-)=d_{\ell^*}$, where $d_{\ell^*}$ is the desired bottleneck dimension. Finally, the continuous widths are rounded to the nearest multiple of the attention head dimension $Q$ to ensure compatibility: $$\widehat{d}_\ell = \mathrm{round}\!\left(\frac{d_\ell}{Q}\right)Q.$$

# Dataset Statistics {#sec:dataset-stats}

In this section, we report the statistics of the datasets we used in our evaluation in Table 3.

  **Task**         **Domain**                        **Split / Instances** **Metric**
  ---------------- ------------------------------- ----------------------- -------------------
  OpenBookQA       science QA                                          500 `acc_norm`
  PIQA             physical commonsense                              1,838 `acc_norm`
  SciQ             science QA                                        1,000 `acc_norm`
  ARC-Easy         grade-school science QA                           2,376 `acc_norm`
  ARC-Challenge    difficult science QA                              1,172 `acc_norm`
  BoolQ            yes/no reading comprehension                      3,270 `acc`
  COPA             causal commonsense                                  100 `acc`
  HellaSwag        commonsense completion                           10,042 `acc_norm`
  WinoGrande       pronoun/coreference reasoning                     1,267 `acc`
  RACE             reading comprehension                             1,045 `acc`
  WikiText         language modeling                          62 documents perplexity
  LAMBADA OpenAI   long-context word prediction                      5,153 `acc`, perplexity

  : Task evaluation configurations and metrics.


# Additional Results {#sec:additional-results}

```{=latex}
\begin{wrapfigure}[22]{R}{6.2cm} %

    \includegraphics[width=\linewidth]{figures/activation_PR.pdf}
    \caption{The Participation Ratio (PR; \S\ref{sec:analysis-activation}) of MLP activations in the 2B ><former vs. the 2B constant-width transformer. We show both the raw PR and the normalized PR by the layer width. \textbf{><former has a higher PR in the middle layers, corresponding to more even usage of the activation dimensions in those layers.}}
    \label{fig:activation-PR}
\end{wrapfigure}
```
The analysis in the activation analysis section shows that ><former achieves better activation density, but it does not account for activation magnitude. Large language models frequently develop severe outlier dimensions, rendering the remaining active dimensions computationally insignificant. To evaluate this, we compute the energy Participation Ratio (PR) over the MLP activations [@LITWINKUMAR20171153; @2jt7-c8cq]. Let $a_{t,i}$ denote the activation of dimension $i$ for token $t$ and $e_i = \sum_t a_{t,i}^2$ denote the total energy of dimension $i$ across all tokens $t$. The effective number of utilized dimensions is given by $N_{\text{eff}} = (\sum_i e_i)^2 / \sum_i e_i^2$. Intuitively, $N_{\text{eff}}$ acts as a continuous measure of representational equality: if a single outlier dimension hoards all the numerical energy, $N_{\text{eff}}$ collapses to 1, regardless of the actual width $d_\ell$. Conversely, if computational energy is distributed uniformly across all coordinates, $N_{\text{eff}}$ reaches its theoretical maximum of $d_\ell$. Thus, computing the width-normalized fraction ($N_{\text{eff}}/d_\ell$) provides a measure of the fraction of effectively utilized dimensions. We report both the absolute and normalized PR in the figure.

The results reveal a distinction between ><former and the constant-width transformer. For the baseline, its width-normalized energy utilization collapses to near zero ($<5\%$) by around layer 10. In contrast, ><former more evenly distributes energy utilization in the middle layers by maintaining an absolute PR of roughly 1,000 effective dimensions, yielding a richer representation manifold. By restricting parameter availability, the bottleneck in ><former may act as a structural regularizer that encourages the network to pack a denser representation into the available capacity.

[^1]: The intermediate MLP dimension is typically a fixed multiple of the model dimension (though see [@ikeda2025layerwise]).

[^2]: A geometric layer width schedule outperforms an arithmetic one in preliminary experiments.

[^3]: The residual connection in the final layer MLP is truncated accordingly.

[^4]: In practice, adjusting for the sizes of the input and output embeddings (see above; formalized in §the parameter-matching appendix) introduces a minor parameter correction. Nevertheless, this $O(1)$ boundary effect is heavily dwarfed by the $O(L)$ bulk layer parameters, preserving these theoretical results under meaningful width schedules.

[^5]: As shown in the proof, we cannot match the parameter count and the FLOP count at the same time.

[^6]: After solving the layer widths, we post-hoc round each width to the nearest 32, which is the number of attention heads we use (16) times 2 (for RoPE [@10.1016/j.neucom.2023.127063]).

[^7]: <https://github.com/openai/tiktoken>

[^8]: Because we do not repeat any data, this is in expectation the same as held-out loss.

[^9]: Anecdotally, our initial intuition was to pursue a $\Diamond$-shaped model, increasing the computation in middle layers which are often associated with semantic computations [@tenney-etal-2019-bert]. We nevertheless proceed with $\times$-shaped models due to these empirical results.

[^10]: We also tried training a projection to predict the entire new layer representation, not just the *extra* dimensions, but it is empirically unstable and diverges during training.
