Variable-Width Transformers
Source
- Raw Markdown: paper_variable-width-transformers-2026.md
- PDF: paper_variable-width-transformers-2026.pdf
- Preprint: arXiv 2606.18246
- Official code: ZhaofengWu/variable-width-transformers
- Official X thread: Zhaofeng Wu announcement
- Gonzo ML discussion: Telegram post 5611
- Review: ArXivIQ note
Local social and artifact snapshots are stored as raw provenance under papers/variable-width-transformers-2026/telegram-post-gonzo_ML-5611.md, papers/variable-width-transformers-2026/x-post-zhaofeng_wu-2067694654607507646.md, and papers/variable-width-transformers-2026/github-readme-variable-width-transformers.md.
Status And Credibility
This is a current arXiv preprint: v1 was submitted on 2026-06-16 in cs.CL. The authors list MIT and MIT-IBM Watson AI Lab affiliations, and the official repository releases model code, dense and MoE training configs, and the variable-width implementation. It is credible fresh preprint evidence for language-model architecture and scaling, but it is not yet peer-reviewed and does not release model checkpoints.
A verified author X announcement was found from Zhaofeng Wu on 2026-06-18. Public X discussion, including Grigory Sapunov / ArXivIQ commentary, is treated as community/review context rather than official source evidence.
Core Claim
Variable-Width Transformers challenge the standard assumption that every Transformer layer should use the same hidden width. The paper proposes ><former: a decoder-only Transformer whose early and late layers are wide while the middle layers are narrow, creating a -shaped or bowtie capacity profile across depth.
The key implementation trick is a fixed global residual stream. A layer reads and writes only the coordinates assigned to its current width. Coordinates that are inactive in a narrow layer bypass that layer and are carried forward until a later wider layer uses them again. This parameter-free carry-forward path avoids learned projection layers between widths; the paper’s 500M ablation reports carry-forward loss 3.099, zero padding 3.124, learned projection 3.150, and constant-width baseline 3.138.
flowchart LR E[wide early layers] --> B[narrow middle bottleneck] B --> L[wide late layers] E -. inactive residual coordinates carried forward .-> L B -->|active slice only| Mix[attention + MLP compute]
Evidence And Results
- The authors pretrain dense decoder-only LMs at 200M, 500M, 1B, and 2B parameters, plus a 3B-total/1B-active MoE model, on DCLM with length-4096 sequences.
- At the tested schedule,
><formerimproves final training loss at every reported scale while reducing measured pretraining PFLOP/s-days by about2.5%to4.6%and average layer width by about10%to11%versus parameter-matched constant-width baselines. - Fitted loss-vs-compute scaling curves estimate that
><formercan match the 2B constant-width Transformer’s loss using77.8%of its FLOPs and85.1%of its average layer width, which is the paper’s headline roughly22%FLOPs and roughly15%KV-cache-width reduction claim. - On zero-shot language-model evaluations, the 2B
><formerimproves average NLU accuracy and perplexity metrics; the MoE><formerhas mixed NLU accuracy but improves both reported perplexity metrics. - Representation analyses report better MLP activation utilization and higher middle-to-late residual-stream matrix entropy, which the authors interpret as mitigating the constant-width baseline’s middle-layer “compression valley.”
Why It Matters
This source is directly relevant to the wiki’s fixed-budget hierarchy thread because it shows that nonuniform capacity across depth is not only a theoretical design freedom: a simple static bowtie schedule can beat uniform-width Transformers under matched parameters and lower average layer width. The important update is not that it solves learned adaptive compression, but that it gives a strong static baseline for any learned fixed-FLOPs hierarchy.
For the Hierarchical Modeling with a Fixed FLOPs Budget idea, ><former closes part of the design space: layer width can be made a budgeted resource, and a return path for inactive coordinates can preserve information through a bottleneck without learned projections. It does not close the core research target: the compression schedule is still a manually swept static profile, not a data-dependent router trained under a global expected-FLOPs constraint.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Dynamic compute allocation | partially closes | Shows a fixed-budget, parameter-matched nonuniform layer-width schedule can improve language-model loss while reducing fitted FLOPs and average layer width. | Needs a learned global budget controller, input-dependent routing, hard serving budgets, and numeric time-series or action-conditioned world-model benchmarks. |
| Hierarchical modeling and compression | adjacent | Builds a width bottleneck across depth with a carry-forward residual path, making hidden-width hierarchy a concrete architecture knob. | Does not choose temporal patches, channel groups, event windows, or modality fragments; no fine-resolution time-series return-path evaluation. |
| Representation quality | warning | The paper’s compression-valley analysis suggests a structural bottleneck can mitigate mid-layer collapse, but it is optimized for LM loss. | Need preservation probes for rare regimes, dense numeric detail, channel-specific deviations, exogenous variables, and action history. |
| Serving cost | warning | Average layer width implies lower KV cache and I/O, and fitted curves imply 22% lower loss-matched FLOPs. | The paper says current GPU/TPU infrastructure is optimized for uniform shapes; realized latency needs custom kernels, fusion, and parallelism tests. |
Limitations
- The evidence is language modeling, not numeric time series, event streams, high-channel multivariate forecasting, or action-conditioned world models.
- The schedule is static: the paper sweeps bottleneck location and width, then uses a ratio recipe such as
0.75Lbottleneck location and0.3dbottleneck width. - The system is not a learned per-input compute allocator and does not optimize a global expected-FLOPs Lagrangian during pretraining.
- The measured table-level FLOPs reduction at the tested schedule is smaller than the headline fitted loss-matched reduction; both numbers are useful but should not be conflated.
- Real wall-clock speedups are not demonstrated. Heterogeneous layer widths complicate kernels, memory layouts, tensor/pipeline parallelism, and batching.
- The official repository releases code and configs, but no trained model weights.
Links Into The Wiki
- Variable-Width Transformers
- Hierarchical Modeling with a Fixed FLOPs Budget
- Time-Series Scaling And Efficiency
- Representation Collapse
- Foundation Time-Series Model Research Agenda
- Compress & Attend Transformers
- Compute Optimal Tokenization
- Hyperloop Transformers
Open Questions
- Can a learned router beat a static
><formerwidth schedule under the same expected training FLOPs, hard serving latency, and kernel implementation? - Does the carry-forward residual path preserve rare but decision-relevant state when the inputs are multivariate time series or trajectories rather than text tokens?
- Should width, token count, channel-group count, and expert activation compete in one global budget, or should each axis get a separate controller?
- Can variable-width architectures be compiled into stable high-throughput kernels without losing the theoretical KV-cache and FLOPs gains to layout overhead?