Bolmo: Byteifying The Next Generation Of Language Models
Source
- Raw Markdown: paper_bolmo-2025.md
- PDF: paper_bolmo-2025.pdf
Core Claim
Bolmo shows that competitive byte-level language models can be obtained by byteifying existing subword LMs through a purpose-built architecture and exact distillation objective.
Key Contributions
- Introduces a fully open byte-level LM family at 1B and 7B scales.
- Treats byteification as tokenizer transfer from a source subword LM.
- Claims conversion can use less than 1% of a typical pretraining token budget.
- Shows gains in character understanding and some coding settings while approaching source-LM performance elsewhere.
Method Notes
Bolmo is the main source for Tokenizer Transfer. It complements H-Net and Synergy, which emphasize end-to-end learned chunking rather than distillation from subword models.
Evidence And Results
The paper compares Bolmo against byte-level baselines and source subword models, with attention to character understanding, coding, general tasks, inference speed, and post-training transfer.
Limitations
Bolmo’s strength depends on strong source subword LMs and a byteification recipe. It does not by itself settle whether future models should train from raw bytes end to end.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Patch size, dynamic tokenization, and point-wise numeric embeddings | adjacent | Bolmo byteifies subword LMs with a latent tokenizer, non-causal boundary prediction, pooling, and variable compression ratios. | Evidence is language-only; no numeric time-series tokenization or adaptive temporal-resolution test. |
| Dynamic compute allocation | adjacent | Variable compression changes how much latent processing is spent on different spans. | No time-series serving or training evidence that compute is allocated to hard windows, channels, or candidate futures. |
| Point-wise numeric embeddings | warning | Byte-level units preserve character detail and reduce vocabulary bottlenecks, which is an analogy for atomic numeric/event units. | Does not show that byte-style units preserve magnitude, sampling time, or channel semantics for time series. |
Links Into The Wiki
- Bolmo
- Byte-Level Language Models
- Tokenizer Transfer
- Latent Tokenization
- Foundation Time-Series Model Research Agenda
Open Questions
- Can byteification combine with dynamic learned chunking?
- Which capabilities remain bottlenecked by imperfect boundary prediction?