Variable-Width Transformers

Source

Local social and artifact snapshots are stored as raw provenance under papers/variable-width-transformers-2026/telegram-post-gonzo_ML-5611.md, papers/variable-width-transformers-2026/x-post-zhaofeng_wu-2067694654607507646.md, and papers/variable-width-transformers-2026/github-readme-variable-width-transformers.md.

Status And Credibility

This is a current arXiv preprint: v1 was submitted on 2026-06-16 in cs.CL. The authors list MIT and MIT-IBM Watson AI Lab affiliations, and the official repository releases model code, dense and MoE training configs, and the variable-width implementation. It is credible fresh preprint evidence for language-model architecture and scaling, but it is not yet peer-reviewed and does not release model checkpoints.

A verified author X announcement was found from Zhaofeng Wu on 2026-06-18. Public X discussion, including Grigory Sapunov / ArXivIQ commentary, is treated as community/review context rather than official source evidence.

Core Claim

Variable-Width Transformers challenge the standard assumption that every Transformer layer should use the same hidden width. The paper proposes ><former: a decoder-only Transformer whose early and late layers are wide while the middle layers are narrow, creating a -shaped or bowtie capacity profile across depth.

The key implementation trick is a fixed global residual stream. A layer reads and writes only the coordinates assigned to its current width. Coordinates that are inactive in a narrow layer bypass that layer and are carried forward until a later wider layer uses them again. This parameter-free carry-forward path avoids learned projection layers between widths; the paper’s 500M ablation reports carry-forward loss 3.099, zero padding 3.124, learned projection 3.150, and constant-width baseline 3.138.

flowchart LR
    E[wide early layers] --> B[narrow middle bottleneck]
    B --> L[wide late layers]
    E -. inactive residual coordinates carried forward .-> L
    B -->|active slice only| Mix[attention + MLP compute]

Evidence And Results

  • The authors pretrain dense decoder-only LMs at 200M, 500M, 1B, and 2B parameters, plus a 3B-total/1B-active MoE model, on DCLM with length-4096 sequences.
  • At the tested schedule, ><former improves final training loss at every reported scale while reducing measured pretraining PFLOP/s-days by about 2.5% to 4.6% and average layer width by about 10% to 11% versus parameter-matched constant-width baselines.
  • Fitted loss-vs-compute scaling curves estimate that ><former can match the 2B constant-width Transformer’s loss using 77.8% of its FLOPs and 85.1% of its average layer width, which is the paper’s headline roughly 22% FLOPs and roughly 15% KV-cache-width reduction claim.
  • On zero-shot language-model evaluations, the 2B ><former improves average NLU accuracy and perplexity metrics; the MoE ><former has mixed NLU accuracy but improves both reported perplexity metrics.
  • Representation analyses report better MLP activation utilization and higher middle-to-late residual-stream matrix entropy, which the authors interpret as mitigating the constant-width baseline’s middle-layer “compression valley.”

Why It Matters

This source is directly relevant to the wiki’s fixed-budget hierarchy thread because it shows that nonuniform capacity across depth is not only a theoretical design freedom: a simple static bowtie schedule can beat uniform-width Transformers under matched parameters and lower average layer width. The important update is not that it solves learned adaptive compression, but that it gives a strong static baseline for any learned fixed-FLOPs hierarchy.

For the Hierarchical Modeling with a Fixed FLOPs Budget idea, ><former closes part of the design space: layer width can be made a budgeted resource, and a return path for inactive coordinates can preserve information through a bottleneck without learned projections. It does not close the core research target: the compression schedule is still a manually swept static profile, not a data-dependent router trained under a global expected-FLOPs constraint.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Dynamic compute allocationpartially closesShows a fixed-budget, parameter-matched nonuniform layer-width schedule can improve language-model loss while reducing fitted FLOPs and average layer width.Needs a learned global budget controller, input-dependent routing, hard serving budgets, and numeric time-series or action-conditioned world-model benchmarks.
Hierarchical modeling and compressionadjacentBuilds a width bottleneck across depth with a carry-forward residual path, making hidden-width hierarchy a concrete architecture knob.Does not choose temporal patches, channel groups, event windows, or modality fragments; no fine-resolution time-series return-path evaluation.
Representation qualitywarningThe paper’s compression-valley analysis suggests a structural bottleneck can mitigate mid-layer collapse, but it is optimized for LM loss.Need preservation probes for rare regimes, dense numeric detail, channel-specific deviations, exogenous variables, and action history.
Serving costwarningAverage layer width implies lower KV cache and I/O, and fitted curves imply 22% lower loss-matched FLOPs.The paper says current GPU/TPU infrastructure is optimized for uniform shapes; realized latency needs custom kernels, fusion, and parallelism tests.

Limitations

  • The evidence is language modeling, not numeric time series, event streams, high-channel multivariate forecasting, or action-conditioned world models.
  • The schedule is static: the paper sweeps bottleneck location and width, then uses a ratio recipe such as 0.75L bottleneck location and 0.3d bottleneck width.
  • The system is not a learned per-input compute allocator and does not optimize a global expected-FLOPs Lagrangian during pretraining.
  • The measured table-level FLOPs reduction at the tested schedule is smaller than the headline fitted loss-matched reduction; both numbers are useful but should not be conflated.
  • Real wall-clock speedups are not demonstrated. Heterogeneous layer widths complicate kernels, memory layouts, tensor/pipeline parallelism, and batching.
  • The official repository releases code and configs, but no trained model weights.

Open Questions

  • Can a learned router beat a static ><former width schedule under the same expected training FLOPs, hard serving latency, and kernel implementation?
  • Does the carry-forward residual path preserve rare but decision-relevant state when the inputs are multivariate time series or trajectories rather than text tokens?
  • Should width, token count, channel-group count, and expert activation compete in one global budget, or should each axis get a separate controller?
  • Can variable-width architectures be compiled into stable high-throughput kernels without losing the theoretical KV-cache and FLOPs gains to layout overhead?