Compute Optimal Tokenization

Summary

Compute Optimal Tokenization is a Meta FAIR and University of Washington scaling-law study that treats token compression rate as a first-class model-design variable. Its practical takeaway is that data/model scaling rules should be stated in bytes per parameter when tokenization changes.

Interface

Scaling unit: bytes of training data per parameter.
Tokenization variable: compression rate $T$ , measured as average bytes per token.
Main law: for English text, compute-optimal training is close to $ρ^{⋆} \approx 60$ bytes per parameter across several compute budgets and compression rates.
Compression result: there is an optimal compression rate, and it slowly decreases as training compute increases.
Released artifacts: arXiv paper, rendered project page, Meta AI publication page, and facebookresearch result/fitting code repository.

Role In The Wiki

This entity is the local object card for tokenization-aware scaling laws. It should be used when a page needs the quantitative claim that token counts are not portable across tokenizers, or when comparing byte-level, latent-token, subword, and superword schemes under a compute budget.

For time-series and world-model work, this is upstream language-model evidence rather than direct TSFM evidence. Its main transfer value is the design question: what is the right information-density unit for dense numeric streams, event streams, graph time series, and action trajectories?

Evidence

Compute Optimal Tokenization

Alex Open Research Wiki

Explorer

Compute Optimal Tokenization

Compute Optimal Tokenization

Summary

Interface

Role In The Wiki

Evidence

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

Compute Optimal Tokenization

Compute Optimal Tokenization

Summary

Interface

Role In The Wiki

Evidence

Related Pages

Graph View

Table of Contents

Backlinks