Bolmo: Byteifying The Next Generation Of Language Models

Source

Raw Markdown: paper_bolmo-2025.md
PDF: paper_bolmo-2025.pdf

Core Claim

Bolmo shows that competitive byte-level language models can be obtained by byteifying existing subword LMs through a purpose-built architecture and exact distillation objective.

Key Contributions

Introduces a fully open byte-level LM family at 1B and 7B scales.
Treats byteification as tokenizer transfer from a source subword LM.
Claims conversion can use less than 1% of a typical pretraining token budget.
Shows gains in character understanding and some coding settings while approaching source-LM performance elsewhere.

Method Notes

Bolmo is the main source for Tokenizer Transfer. It complements H-Net and Synergy, which emphasize end-to-end learned chunking rather than distillation from subword models.

Evidence And Results

The paper compares Bolmo against byte-level baselines and source subword models, with attention to character understanding, coding, general tasks, inference speed, and post-training transfer.

Limitations

Bolmo’s strength depends on strong source subword LMs and a byteification recipe. It does not by itself settle whether future models should train from raw bytes end to end.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Patch size, dynamic tokenization, and point-wise numeric embeddings	adjacent	Bolmo byteifies subword LMs with a latent tokenizer, non-causal boundary prediction, pooling, and variable compression ratios.	Evidence is language-only; no numeric time-series tokenization or adaptive temporal-resolution test.
Dynamic compute allocation	adjacent	Variable compression changes how much latent processing is spent on different spans.	No time-series serving or training evidence that compute is allocated to hard windows, channels, or candidate futures.
Point-wise numeric embeddings	warning	Byte-level units preserve character detail and reduce vocabulary bottlenecks, which is an analogy for atomic numeric/event units.	Does not show that byte-style units preserve magnitude, sampling time, or channel semantics for time series.

Links Into The Wiki

Open Questions

Can byteification combine with dynamic learned chunking?
Which capabilities remain bottlenecked by imperfect boundary prediction?

Alex Open Research Wiki

Explorer

Bolmo: Byteifying The Next Generation Of Language Models

Bolmo: Byteifying The Next Generation Of Language Models

Source

Core Claim

Key Contributions

Method Notes

Evidence And Results

Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks