Bolmo: Byteifying The Next Generation Of Language Models

Source

Core Claim

Bolmo shows that competitive byte-level language models can be obtained by byteifying existing subword LMs through a purpose-built architecture and exact distillation objective.

Key Contributions

  • Introduces a fully open byte-level LM family at 1B and 7B scales.
  • Treats byteification as tokenizer transfer from a source subword LM.
  • Claims conversion can use less than 1% of a typical pretraining token budget.
  • Shows gains in character understanding and some coding settings while approaching source-LM performance elsewhere.

Method Notes

Bolmo is the main source for Tokenizer Transfer. It complements H-Net and Synergy, which emphasize end-to-end learned chunking rather than distillation from subword models.

Evidence And Results

The paper compares Bolmo against byte-level baselines and source subword models, with attention to character understanding, coding, general tasks, inference speed, and post-training transfer.

Limitations

Bolmo’s strength depends on strong source subword LMs and a byteification recipe. It does not by itself settle whether future models should train from raw bytes end to end.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Patch size, dynamic tokenization, and point-wise numeric embeddingsadjacentBolmo byteifies subword LMs with a latent tokenizer, non-causal boundary prediction, pooling, and variable compression ratios.Evidence is language-only; no numeric time-series tokenization or adaptive temporal-resolution test.
Dynamic compute allocationadjacentVariable compression changes how much latent processing is spent on different spans.No time-series serving or training evidence that compute is allocated to hard windows, channels, or candidate futures.
Point-wise numeric embeddingswarningByte-level units preserve character detail and reduce vocabulary bottlenecks, which is an analogy for atomic numeric/event units.Does not show that byte-style units preserve magnitude, sampling time, or channel semantics for time series.

Open Questions

  • Can byteification combine with dynamic learned chunking?
  • Which capabilities remain bottlenecked by imperfect boundary prediction?