Compute Optimal Tokenization

Summary

Compute Optimal Tokenization is a Meta FAIR and University of Washington scaling-law study that treats token compression rate as a first-class model-design variable. Its practical takeaway is that data/model scaling rules should be stated in bytes per parameter when tokenization changes.

Interface

  • Scaling unit: bytes of training data per parameter.
  • Tokenization variable: compression rate , measured as average bytes per token.
  • Main law: for English text, compute-optimal training is close to bytes per parameter across several compute budgets and compression rates.
  • Compression result: there is an optimal compression rate, and it slowly decreases as training compute increases.
  • Released artifacts: arXiv paper, rendered project page, Meta AI publication page, and facebookresearch result/fitting code repository.

Role In The Wiki

This entity is the local object card for tokenization-aware scaling laws. It should be used when a page needs the quantitative claim that token counts are not portable across tokenizers, or when comparing byte-level, latent-token, subword, and superword schemes under a compute budget.

For time-series and world-model work, this is upstream language-model evidence rather than direct TSFM evidence. Its main transfer value is the design question: what is the right information-density unit for dense numeric streams, event streams, graph time series, and action trajectories?

Evidence