><former / Variable-Width Transformers

Summary

><former is a variable-width decoder-only Transformer architecture from Variable-Width Transformers. It keeps early and late layers wide, narrows the middle layers, and lets inactive residual-stream coordinates bypass narrow layers through a parameter-free carry-forward mechanism.

The entity matters because it turns hidden width into a budgeted architecture dimension. Instead of scaling only total parameters, depth, token count, or attention sparsity, it reallocates full block width across depth while matching the parameter count of a uniform baseline.

Architecture Contract

  • Shape: -shaped / bowtie layer-width schedule, with a middle bottleneck.
  • Default paper recipe: bottleneck location around 0.75L and bottleneck width around 0.3d after small-scale sweeps.
  • Residual interface: fixed maximum-width residual stream; each layer reads/writes only its active slice.
  • Expansion method: carry forward inactive coordinates from the most recent layer that processed them; zero padding and learned projection alternatives perform worse in the reported ablation.
  • Budget story: parameter-matched models have lower average layer width, which lowers attention FLOPs and KV-cache size even when dense projection parameters are matched.

Official Artifacts

The repository exposes training code, WidthVaryingConfig, dense configs from 200M to 2B, and a 3B-total/1B-active MoE config. No model weights were released during the ingest pass.

Relevance To This Wiki

For time-series and world-model work, ><former is a strong static baseline for Hierarchical Modeling with a Fixed FLOPs Budget. It suggests that a model can profit from nonuniform capacity allocation before introducing a fully learned router. Any learned global FLOPs controller should therefore compare against a static variable-width schedule, not only against a uniform Transformer.

The transfer is still limited: ><former changes hidden width across depth, not time patching, channel grouping, event-window selection, or action-conditioned rollout compute. Its carry-forward residual path is nonetheless an instructive return-path primitive because it preserves some information through a bottleneck without learned projections.

Caveats

  • Evidence is from decoder-only language models, not numeric time-series models.
  • The compute schedule is static and manually selected after sweeps.
  • Real wall-clock speedups need heterogeneous-shape kernels, memory-layout support, and parallelism tests.
  • The representation-collapse results are promising but do not show preservation of rare events, dense numeric detail, exogenous variables, or action history.