><former / Variable-Width Transformers

Summary

><former is a variable-width decoder-only Transformer architecture from Variable-Width Transformers. It keeps early and late layers wide, narrows the middle layers, and lets inactive residual-stream coordinates bypass narrow layers through a parameter-free carry-forward mechanism.

The entity matters because it turns hidden width into a budgeted architecture dimension. Instead of scaling only total parameters, depth, token count, or attention sparsity, it reallocates full block width across depth while matching the parameter count of a uniform baseline.

Architecture Contract

Shape: $\times$ -shaped / bowtie layer-width schedule, with a middle bottleneck.
Default paper recipe: bottleneck location around 0.75L and bottleneck width around 0.3d after small-scale sweeps.
Residual interface: fixed maximum-width residual stream; each layer reads/writes only its active slice.
Expansion method: carry forward inactive coordinates from the most recent layer that processed them; zero padding and learned projection alternatives perform worse in the reported ablation.
Budget story: parameter-matched models have lower average layer width, which lowers attention FLOPs and KV-cache size even when dense projection parameters are matched.

Official Artifacts

Official code: ZhaofengWu/variable-width-transformers
Preprint: arXiv 2606.18246
Local README snapshot: papers/variable-width-transformers-2026/github-readme-variable-width-transformers.md

The repository exposes training code, WidthVaryingConfig, dense configs from 200M to 2B, and a 3B-total/1B-active MoE config. No model weights were released during the ingest pass.

Relevance To This Wiki

For time-series and world-model work, ><former is a strong static baseline for Hierarchical Modeling with a Fixed FLOPs Budget. It suggests that a model can profit from nonuniform capacity allocation before introducing a fully learned router. Any learned global FLOPs controller should therefore compare against a static variable-width schedule, not only against a uniform Transformer.

The transfer is still limited: ><former changes hidden width across depth, not time patching, channel grouping, event-window selection, or action-conditioned rollout compute. Its carry-forward residual path is nonetheless an instructive return-path primitive because it preserves some information through a bottleneck without learned projections.

Caveats

Evidence is from decoder-only language models, not numeric time-series models.
The compute schedule is static and manually selected after sweeps.
Real wall-clock speedups need heterogeneous-shape kernels, memory-layout support, and parallelism tests.
The representation-collapse results are promising but do not show preservation of rare events, dense numeric detail, exogenous variables, or action history.

Alex Open Research Wiki

Explorer

><former / Variable-Width Transformers

><former / Variable-Width Transformers

Summary

Architecture Contract

Official Artifacts

Relevance To This Wiki

Caveats

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

><former / Variable-Width Transformers

><former / Variable-Width Transformers

Summary

Architecture Contract

Official Artifacts

Relevance To This Wiki

Caveats

Related Pages

Graph View

Table of Contents

Backlinks