><former / Variable-Width Transformers
Summary
><former is a variable-width decoder-only Transformer architecture from Variable-Width Transformers. It keeps early and late layers wide, narrows the middle layers, and lets inactive residual-stream coordinates bypass narrow layers through a parameter-free carry-forward mechanism.
The entity matters because it turns hidden width into a budgeted architecture dimension. Instead of scaling only total parameters, depth, token count, or attention sparsity, it reallocates full block width across depth while matching the parameter count of a uniform baseline.
Architecture Contract
- Shape: -shaped / bowtie layer-width schedule, with a middle bottleneck.
- Default paper recipe: bottleneck location around
0.75Land bottleneck width around0.3dafter small-scale sweeps. - Residual interface: fixed maximum-width residual stream; each layer reads/writes only its active slice.
- Expansion method: carry forward inactive coordinates from the most recent layer that processed them; zero padding and learned projection alternatives perform worse in the reported ablation.
- Budget story: parameter-matched models have lower average layer width, which lowers attention FLOPs and KV-cache size even when dense projection parameters are matched.
Official Artifacts
- Official code: ZhaofengWu/variable-width-transformers
- Preprint: arXiv 2606.18246
- Local README snapshot:
papers/variable-width-transformers-2026/github-readme-variable-width-transformers.md
The repository exposes training code, WidthVaryingConfig, dense configs from 200M to 2B, and a 3B-total/1B-active MoE config. No model weights were released during the ingest pass.
Relevance To This Wiki
For time-series and world-model work, ><former is a strong static baseline for Hierarchical Modeling with a Fixed FLOPs Budget. It suggests that a model can profit from nonuniform capacity allocation before introducing a fully learned router. Any learned global FLOPs controller should therefore compare against a static variable-width schedule, not only against a uniform Transformer.
The transfer is still limited: ><former changes hidden width across depth, not time patching, channel grouping, event-window selection, or action-conditioned rollout compute. Its carry-forward residual path is nonetheless an instructive return-path primitive because it preserves some information through a bottleneck without learned projections.
Caveats
- Evidence is from decoder-only language models, not numeric time-series models.
- The compute schedule is static and manually selected after sweeps.
- Real wall-clock speedups need heterogeneous-shape kernels, memory-layout support, and parallelism tests.
- The representation-collapse results are promising but do not show preservation of rare events, dense numeric detail, exogenous variables, or action history.