Byte-Level Language Models

Summary

Byte-level language modeling removes fixed subword vocabularies, but the corpus shows several incompatible ways to make that practical.

What The Wiki Currently Believes

  • Bolmo byteifies existing subword LMs through distillation, recovering much of their behavior with less than 1% of typical pretraining token budget.
  • Compute Optimal Tokenization adds the scaling-law view: byte-level and latent-token models should be compared by bytes per parameter, compression rate, and inference FLOPs per byte, not by raw token count.
  • H-Net learns dynamic hierarchical byte chunking end to end.
  • Synergy learns abstraction routing over bytes and reports emergent token-like concepts.

Evidence

Bolmo emphasizes transfer from strong subword models; H-Net and Synergy emphasize end-to-end learned segmentation or abstraction. Compute Optimal Tokenization adds that the compression choice itself changes the scaling-law unit, so byte-level models should not be judged only by whether they remove vocabularies. The relevant budget also includes how many bytes are covered per token and how much inference compute is spent per byte.

Relation To Foundation TSFM Agenda

Byte-level language modeling is adjacent to the Foundation Time-Series Model Research Agenda through dynamic tokenization rather than through time-series evidence. H-Net, Synergy, and Compute Optimal Tokenization are useful analogs for learned chunking, abstraction, and compression-aware scaling, but they do not by themselves close numeric tokenization, physical-time representation, multivariate encoding, or control slots for time-series foundation models.

Open Questions

  • Does byteification mainly solve practical migration, while learned chunking solves long-term architecture?
  • Which approach gives the best multilingual, code, and DNA scaling?
  • What is the time-series analogue of bytes per parameter: samples, channel-time cells, events, compressed bits, or another information-density unit?