---
author:
- Tomasz Limisiewicz
- Artidoro Pagnoni
- Srini Iyer
- Mike Lewis
- Sachin Mehta
- Alisa Liu
- Margaret Li
- Gargi Ghosh
- Luke Zettlemoyer
bibliography:
- custom.bib
date: 'May 4, 2026'
title: Compute Optimal Tokenization
---

```{=latex}
\newcommand{\theHalgorithm}{\arabic{algorithm}}
```
```{=latex}
\newcommand{\compr}{compression rate}
```
```{=latex}
\newcommand{\bpp}{bytes per parameter}
```
```{=latex}
\newcommand{\pari}{parity}
```
```{=latex}
\newcommand{\tomasz}[1]{{\color{forestgreen} [Tomasz]: #1}}
```
```{=latex}
\newcommand{\luke}[1]{{\color{maroon} [Luke]: #1}}
```
```{=latex}
\newcommand{\arti}[1]{{\color{brickred} [arti]: #1}}
```
```{=latex}
\newcommand{\commenterC}[1]{{\color{yellow} [commenterC]: #1}}
```
```{=latex}
\newcommand{\commenterD}[1]{{\color{darkblue} [commenterD]: #1}}
```
```{=latex}
\maketitle
```
```{=latex}
\begin{figure*}[!htb]
    
    \begin{subfigure}{0.32\textwidth}
        \includegraphics[width=\linewidth]{figures/blt_entropy_1e20_results_compression_contour_global_parameters.pdf}
        \caption{Optimal \bpp{} ratio across compression rates. Fixed training budget $10^{20}$ FLOPs.}
        \label{fig:main-bpp}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.32\textwidth}
        \includegraphics[width=\linewidth]{figures/blt_entropy_loss_profile.pdf}         % \includegraphics[width=\linewidth]{figures/blt_entropy_optimal_configs_global_parameters_loss.pdf}
        \caption{Optimal \compr{} based on a scaling law fit. Training budget marked by color.}
        \label{fig:main-cr}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.32\textwidth}
        \includegraphics[width=\linewidth]{figures/tokenization_compression_vs_optimal.pdf}
        \caption{Optimal \compr{} differs from compression of subword tokenizers}
        \label{fig:main-tokenizers}
    \end{subfigure}
    \caption{Key findings of this work: \ref{fig:main-bpp}
    in compute optimal scaling: bytes (not tokens) of data  increase proportionally to parameter count; \ref{fig:main-cr} for each training budget, we find optimal \compr{}, its value decreases with scale;
    %, which slowly decreases with the training compute
    \ref{fig:main-tokenizers} the optimal \compr{} varies across languages and differs from compression of popular BPE tokenizers.}
    \label{fig:main_figure_idea}
\end{figure*}
```
```{=latex}
\clearpage
```
`\tableofcontents`{=latex} 

Introduction {#sec:introduction}
============

Scaling laws have informed the efficient design of language models, prescribing the optimal balance between model size and training data [@kaplan2020scaling; @hoffmann2022training]. Standard approaches estimate the optimal amount of data in tokens for a given compute budget (and model size). However, expressing data volume in tokens overlooks a critical aspect: the information density that each token represents. Consequently, scaling findings inherently depend on specific tokenizers and their key property: the `\compr{}`{=latex}.

To fill the research gap, we introduce laws that are aware of the `\compr{}`{=latex} $T$, defined as the average number of bytes per token in a given dataset. For that purpose, we need to vary the `\compr{}`{=latex} without changing the vocabulary size (and thus the number of parameters). Therefore, in our experiments we rely on Byte Latent Transformer (BLT, [@pagnoni-etal-2025-byte]), a recent architecture that segments byte-level input in a latent space. BLT's latent tokenization is a robust tool for this purpose, as it allows us to precisely adjust the `\compr{}`{=latex} by setting an average segment size.[^1] Additionally, compression plays a significant role in subword tokenization. We can order popular subword methods by their `\compr{}`{=latex}: from pure byte or character-level segmentation ($T\approx1$) [@xue-etal-2022-byt5; @wang2024mambabytetokenfreeselectivestate], through widely used BPE ($T\approx4.57$) [@sennrich-etal-2015-neural], to SuperBPE ($T\approx6.16$) [@liu2025superbpespacetravellanguage], which achieves high compression by allowing multi-word tokens.[^2]

In the context of scaling, `\compr{}`{=latex} impacts model efficiency in both training and inference. Increasing compression allows the same data to be represented with fewer tokens, directly reducing the computational cost of processing. The unlocked savings in FLOPs (unit of computation) can be used to increase training data, model size, or both, without increasing the total computation budget.

To the best of our knowledge, this is the first thorough study of the effect of `\compr{}`{=latex} on the compute efficiency of language models. We pose the following research questions:

#### **\[R1\]**: How does compression rate impact the compute-optimal ratio between parameters and data?

This question concerns the unit of data we should use in model scaling. We investigate whether the compute-optimal ratio is best expressed in tokens or bytes (which are the underlying unit of text encoding). For example, given the Chinchilla rule of thumb of training on $\approx 20$ tokens per parameter [@hoffmann2022training], does this ratio hold as we increase compression, or should the ratio of bytes to parameters remain constant given a dataset of English texts?

#### **\[R2\]**: Is there an optimal compression rate for specific datasets?

We investigate whether there exists a `\compr{}`{=latex} that yields the lowest loss for a fixed compute budget, assuming the optimal data to parameter ratio. Furthermore, we examine whether this optimal `\compr{}`{=latex} shifts with the compute budget or dataset domain.

#### **\[R3\]**: Is the impact of compression rate on scaling trends similar for latent and subword tokenized models?

Does the answer to the previous questions depend on the tokenization method? We conduct experiments on subword-tokenized models to validate if the scaling trends match those observed for BLT.

#### **\[R4\]**: Is optimal compression rate language specific?

We extend our experiments to languages other than English to test whether optimal data to parameter ratio and compression rate change depending on language. We hypothesize that both will grow proportionally to *parity*, defined as the ratio of byte length of parallel sentences expressed in two languages [@petrov2023token_unfairness; @ahia-etal-2023-languages].

The structure of the paper is as follows. In Section `\ref{sec:methodology}`{=latex}, we describe our experimental setting, including details of the datasets, models, and methods for deriving power laws. In Section `\ref{sec:compression_scaling}`{=latex}, we present experiments scaling BLT across a wide range of `\compr{}`{=latex}s to answer **\[R1\]** and **\[R2\]**. In Section `\ref{sec:subword_scaling}`{=latex}, we examine subword-tokenized models to compare with the findings from the previous section and address **\[R3\]**. Finally, in Section `\ref{sec:beyond_english}`{=latex}, we extend our scaling experiments to languages other than English to answer **\[R4\]**.

Methodology {#sec:methodology}
===========

In this section, we provide details on the language models used and experimental setup. We also describe the evaluation and the procedure for fitting power laws to estimate the optimal data-to-parameter ratio and loss.

Model Architectures
-------------------

```{=latex}
\begin{wrapfigure}{r}{0.5\textwidth}
  
  
  \includegraphics[width=\linewidth]{figures/blt_entropy_1e20_results_experimental_grid_global_parameters.pdf}
  \caption{The grid of experiments for the budget of $C=10^{20}$ FLOPs. For each \compr{} $T$ (x-axis) and model size $N$ (y-axis) we can read amount of training data $B$ (color) and the corresponding \bpp{} ratio $\rho$ (values in squares).}
  \label{fig:experimental_grid}
  
  
\end{wrapfigure}
```
For all experiments, we train Transformer models [@vaswani2017attention] of varying parameter sizes, adhering to Llama 3 architectural choices [@llama3herd2024]. We follow a standard scaling recipe: increasing models' width and depth in a 1:1 ratio, meaning the number of heads equals the number of layers. The latent dimension size is set to 128 times the number of heads, and the feed-forward network uses 4$\times$ upscaling.

#### Latent Tokenized Models

These models feature a hierarchical architecture comprising three modules: (1) an encoder that aggregates byte-level representations into latent tokens; (2) a global module operating on these latent tokens (the Transformer model described above); and (3) a decoder that maps latent representations back to the byte level for next-byte prediction. We adopt the Byte Latent Transformer (BLT) architecture [@pagnoni-etal-2025-byte]. BLT utilizes entropy spikes to segment byte sequences into latent tokens, allowing us to control the `\compr{}`{=latex} by adjusting the entropy threshold. Cross-attention mechanisms implement the mapping between latent and byte embeddings. A key deviation from the original BLT implementation is the omission of hash embeddings for byte n-grams. We omit these because n-grams can span more bytes than the latent tokens themselves, potentially interfering with the target `\compr{}`{=latex}. We also introduce a modified scaling recipe for the local modules (encoder and decoder), observing that prioritizing width over depth gives better performance. The exact scaling recipe is presented in Appendix `\ref{sec:model-scaling-details}`{=latex}.

#### Subword Tokenized Models

We employ standard isotropic models following the Llama 3 architecture [@llama3herd2024].[^3] By analogy to hierarchical models, subword embedding and de-embedding layers correspond to local modules. Unlike latent models, the `\compr{}`{=latex} ($T$) of subword models is not directly controllable but is determined by the tokenization method, tokenizer's training corpus and vocabulary size ($V$). To obtain a wide range of `\compr{}`{=latex}s, we train language models using different subword tokenization algorithms. Specifically: character tokenization ($T=1.01$, $V=148,000$) the BPE tokenizer of Llama 3 ($T=4.57$, $V=126,000$); and the SuperBPE tokenizer, which allows merging multiple words into one token [@liu2025superbpespacetravellanguage] ($T=6.16$, $V=200,000$).[^4] We also analyze versions of the Llama 3 tokenizer with $75\%$ and $90\%$ of vocabulary masked, obtaining compression rates of $T=4.16$ and $T=3.71$ respectively. Even though the vocabulary is masked in these models, we still consider the original $V=126,000$ for FLOPs computation. The `\compr{}`{=latex}s $T$ are estimated on the DCLM dataset used in training.

The exact specifications for all models used in our study are presented in Appendix `\ref{sec:model-scaling-details}`{=latex}.

Training and Evaluation
-----------------------

We train models under compute budgets ($C$) expressed in FLOPs, ranging from $5\times10^{18}$ to $2\times10^{21}$ FLOPs. If not stated otherwise, we use exact computation of training FLOPs, instead of an approximation. In total, we train 988 latently and 320 subword tokenized models with sizes from 50M to 6.7B parameters on training data of sizes from 4B to 1.1T bytes.

For each budget $C$, we vary the parameter size ($N$) and compression rate ($T$). The parameters $N$ and `\compr{}`{=latex} $T$ uniquely determine the training data amount in bytes ($B$). Consequently, for each compute budget, we obtain a grid of models corresponding to the cartesian product of $T$ and $N$. For each of the configurations, we compute the `\bpp{}`{=latex} ratio ($\rho$), as shown in Figure `\ref{fig:experimental_grid}`{=latex}. For BLT, we test six `\compr{}`{=latex} values $T \in \{1,2,4,6,8,12\}$, while for subword models, `\compr{}`{=latex} is determined by the tokenizer $T \in \{1.01,3.71,4.16,4.57,6.16\}$. For all training runs, we fix the batch size at 2 million bytes and the learning rate at $4\times10^{-4}$. We use the AdamW optimizer [@loshchilov2019decoupled] with a warmup-stable-decay learning rate schedule.

Unless stated otherwise, we train on DCLM [@li2024datacomp], a dataset of plain English texts selected to limit data mixing across domains and languages. Data mixing could cause non-uniform granularity of information and thus confound our analysis. We evaluate models on the C4 validation split [@raffel2020exploring].

To compare loss across different models with various tokenization methods, we evaluate models using bits-per-byte (BPB), which is loss divided by the number of bytes in the evaluation texts. In each training and evaluation example, we fix context to contain the same number of 8192 bytes (e.g., with `\compr{}`{=latex} $T=4$ we evaluate on 2048 tokens per example, then with compression rate $T=8$ we evaluate on 1024 tokens).

Fitting Power Laws
------------------

We fit the parameters for power laws presented in the next section using the BFGS optimizer [@liu1989bfgs; @zhu1997bfgsb] minimizing sum of squares loss. To ensure reliability, we initiate optimization from multiple random seeds and compute confidence intervals using a numerical approximation of the Hessian. Further details on the fitting procedure can be found in Appendix `\ref{sec:power-law-details}`{=latex}.

Scaling Laws and Data Compression {#sec:compression_scaling}
=================================

```{=latex}
\begin{figure*}[!tb]
    
    \begin{subfigure}{0.49\textwidth}
        \includegraphics[width=\linewidth]{figures/blt_entropy_1e20_results_global_parameters_loss_curves.pdf}
        \caption{2D IsoFLOP}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.49\textwidth}
        \includegraphics[width=\linewidth]{figures/blt_entropy_1e20_results_compression_contour_global_parameters.pdf}
        \caption{3D IsoFLOP (heatmap)}
    \end{subfigure}
    \caption{
    Evaluation scores of latent tokenized models on C4 test set with fixed FLOPs budget ($C=10^{20}$), compared against \bpp{} ratio.
    2-dimensional IsoFLOP (parabola) were fitted for each compression rate, while 3-dimensional IsoFLOP jointly for all compression rates (on x-axis). 
    Minima of both fits show that minimal loss is obtained at almost constant value of \bpp{} ratio $\rho\approx60$.
    For IsoFLOPs as function of data, parameters, and for other compute budgets, refer to Appendix~\ref{sec:isoflop-across-budgets} 
    % Each dot represents the validation loss of a model with a fixed FLOPs budget ({\boldmath $C=10^{20}$}); all models use latent tokenization to achieve the set compression. The upper row presents loss values as a function of a tested variable (B, N, or B/N) corresponding to each compression rate. A parabola (i.e., 1-dimensional polynomial) is fitted for each compression line to interpolate values of the loss. The lower row shows a heatmap of loss as a function of the tested variable (y-axis) and compression rate (x-axis). A paraboloid (i.e., 2-dimensional polynomial) is fitted jointly for all compression rates. With both fits (1- and 2-dimensional) we can find clear minima of the function, denoted by red triangles. Analogical figures for other compute budgets are presented in Appendix~\ref{sec:isoflop-across-budgets}.
    }
    \label{fig:stage_1_blt_1e20}
\end{figure*}
```
In this section, we present scaling results for BLT models revealing the role of data `\compr{}`{=latex}. We fit scaling laws in two stages, as such an approach shows more faithful approximations [@li2025misfitting]. In the first stage, we estimate the optimal training data size in bytes $B^{\star}$ and model size $N^{\star}$ as a power law function of compute budget $C$ and compression rate $T$, addressing research question **R1**. Subsequently, in the second stage, we model the dynamics of the optimal loss $L^\star$ obtained for the found $B^{\star}$ and $N^{\star}$ configuration. We examine the effect of `\compr{}`{=latex} $T$ on $L^\star$ to answer research question **R2**.

Scaling Law I: Optimal Data and Parameters
------------------------------------------

![Amount of data](figures/blt_entropy_optimal_configs_global_parameters_data.png){#fig:optimal_data_blt width="\\linewidth"}

```{=latex}
\hfill
```
![Model size](figures/blt_entropy_optimal_configs_global_parameters_params.png){#fig:optimal_params_blt width="\\linewidth"}

For each compute budget $C$ and compression rate $T$, we identify the optimal training data size by fitting a second-degree polynomial (i.e. IsoFLOP) to the relationship between log-data $\log(B)$ and validation loss $L$. The optimal data size $B^{\star}$ corresponds to the minimum of this parabola. We determine the corresponding optimal parameter count $N^{\star}$ via log-linear interpolation.

In the first power law we estimate the optimal training data size $B^{\star}$ as a power law function of compute budget $C$ and compression rate $T$:

$$\label{eq:optimal_data}
B^{\star}(C,T) \cong  B_0C^\alpha T^\beta$$

This equation involves three parameters: $B_0$ (initial optimal data), $\alpha$ (scaling with compute), and $\beta$ (scaling with compression). In this fit, for simplicity and better generalization across tokenizers, we consider only the parameters of the latent module (i.e., excluding encoder/decoder parameters for BLT and embedding parameters for subword models). Importantly, given our fixed scaling recipe, the number of the model's \`\`latent" parameters determines the \`\`total" parameter count. We can approximate the latent module's compute $C$ as:

$$\label{eq:compute_approximation}
C \approx 6N\frac{B}{T}$$

Where $\frac{B}{T}$ is the amount of data expressed in tokens, typically denoted as $D$ in other scaling laws works. Solving Approximation `\ref{eq:optimal_data}`{=latex} allows us to obtain a power trend for optimal global parameter count:

$$\label{eq:optimal_param}
N^{\star}(C,T) \cong \frac{1}{6B_0}C^{1-\alpha}T^{1-\beta} = N_0C^{1-\alpha}T^{1-\beta}$$

We also define the optimal Byte-per-Parameter ratio, $\rho^\star = B^\star / N^{\star}$. Based on the derived power laws, this ratio has the following form:

$$\label{eq:optimal_bpp}
\rho^\star(C,T) \cong \frac{B_0}{N_0}C^{2\alpha-1}T^{2\beta -1}$$

Before observing the actual fit, we can describe the meaning of specific hypothetical values of $\alpha$ and $\beta$.

-   When $\alpha\approx 0.5$, $\rho^\star$ would remain constant for varying values of compute budget $C$. This would mean that data and parameters should be scaled in 1:1 proportion. Similar equivalence was observed in @hoffmann2022training.

-   Analogously $\beta\approx 0.5$, would indicate that compute unlocked with higher compression should be allocated equally in increase of parameters and training data. Hence, the optimal `\bpp{}`{=latex} $\rho^\star$ would remain constant across varying compression rates $T$.

-   $\beta\approx 1$ would indicate that we can omit the notion of compression from scaling laws and replace $B$ (amount of data in bytes) with used $D = \frac{B}{T}$ (amount of data in tokens). Such observation would suggest that we should simplify the scaling law to consider data amount in tokens $D^\star$ and neglect the impact of compression (as done in previous scaling studies).

Scaling Law I: Results
----------------------

The IsoFLOPs analysis shows that for a set compute budget $C$, a second degree fit faithfully describes the relationship between logarithm of data size $\log(B)$ and validation loss $L$ (see Figure `\ref{fig:1d_isoflops_blt}`{=latex} in Appendix). Therefore, we can easily identify the optimal data size $B^\star$ by finding the minimum of the parabola (or paraboloid in the three-dimensional case).

Moreover, the results empirically confirm that the optimal data and parameter count gradually increase with increasing `\compr{}`{=latex} $T$, thanks to a decrease in compute cost per byte. Figure `\ref{fig:stage_1_blt_1e20}`{=latex} indicates that across compression rates the optimal byte-per-parameter ratio $\rho^\star$ is close to constant. This implies that modifying tokenization (and thus `\compr{}`{=latex}) changes the compute optimal relation between tokens and parameters, whereas the relationship between bytes and parameters remains constant. Therefore, the latter is a more robust way to express the optimal data-to-model-size ratio, and we recommend considering it when designing language models with different tokenizers or vocabularies.

Plotting the values of $B^\star$ and $N^\star$ in Figure `\ref{fig:optimal_data_params_blt}`{=latex}, across $C$ and $T$, we observe a log-log linear relationship proving the adequacy of the power law form in Equation `\ref{eq:optimal_data}`{=latex}. The fit reveals the following values of parameters: $B_0=17.5$, $N_0=9.5\times10^{-3}$, $\alpha=0.465$, $\beta=0.471$. Crucially, both the values of $\alpha$ and $\beta$ are close to 0.5, indicating that the optimal byte-per-parameter ratio is close to constant across varying compute budget and compression rates. This allows us to answer the first research question **R1**:

```{=latex}
\begin{keyfinding}[title={Finding 1}]The optimal ratio between bytes of data and model parameters ($\rho^\star$) remains close to constant across variable compute budget and compression rates.
Therefore, when generalizing a scaling recipe to a model with a different tokenizer, we advise matching the ratio of training bytes (not tokens) to model parameters.
\end{keyfinding}
```
Scaling Law II: Optimal Loss Dynamics {#sec:scaling_law_2}
-------------------------------------

![Power law fit for loss prediction based on compute budget and compression rate for BLT models. The slices of the fitted manifold for each compute budget (lines) are compared with the optimal loss values (triangles).](figures/blt_entropy_optimal_configs_global_parameters_loss.png){#fig:cr_vs_loss_blt width="\\linewidth"}

```{=latex}
\hfill
```
![Power law fit for loss prediction based on compute budget and compression rate for BLT models. The slices of the fitted manifold for each compute budget (lines) are compared with the optimal loss values (triangles).](figures/blt_entropy_loss_profile.png){#fig:cr_vs_loss_blt width="\\linewidth"}

In the next stage, we model the optimal loss $L^\star$, defined as the loss obtained with the optimal data $B^\star$ and parameter count $N^\star$ for a given compute budget and compression rate:

$$\label{eq:optimal_loss_def}
L^\star(C,T) \overset{\mathrm{def}}{=} L(B^\star(C,T), N^\star(C,T))$$

We posit that the optimal loss can be approximated by a power law of the form:

$$\label{eq:optimal_loss}
    L^\star(C,T) \cong  L_0 \times C^\gamma + f(C,T)$$

This stage involves fitting three variables: $L_0$ (initial loss), $\gamma < 0$ (scaling with compute), and $f(T)$ (a function representing compression-specific residuals, including irreducible loss). We do not make a priori assumptions about the form of $f(C,T)$; instead, we fit it empirically based on the results obtained for each compression rate separately.

Scaling Law II: Results
-----------------------

We plot the optimal loss $L^\star$ as a function of compute budget $C$ and compression rate $T$ in Figure `\ref{fig:optimal_loss_blt}`{=latex}. While expectedly, the loss decreases with increasing compute budget, we observe that the relation between `\compr{}`{=latex} $T$ and $L^\star$ is non-monotonic. Specifically, the loss obtains a minimum for $T^{\star}\approx4$ and rises for both higher and lower `\compr{}`{=latex}s. We observe a slow decrease of optimal `\compr{}`{=latex} with increase of compute budget.

The power law fit gives us the following values of parameters: $L_0=3342$, $\gamma=-0.206$. We further examine the distribution of compression-specific offsets $f(C,T)$ in Figures `\ref{fig:optimal_loss_blt}`{=latex} and `\ref{fig:cr_vs_loss_blt}`{=latex}. Based on the polynomial profile for $f(\cdot)$ we can estimate with high confidence its form as:[^5]

$$\label{eq:optimal_compression_rate}
    f(C,T) = F \times \log^2\left(\frac{C^\delta T}{T_0} \right) + E$$

The best fit was obtained with $F=0.032$, $\delta=0.035$, $T_0=18.2$, and $E=0.70$. Both visual and power law evidence support the claim that the optimal compression rate $T^\star=\frac{T_0}{C^\delta}$ slowly decreases with training budget, e.g. $T^\star= 3.69$ for $C=10^{20}$ and $T^\star= 3.33$ for $C=2\times10^{21}$. This allows us to answer the second research question **R2**:

```{=latex}
\begin{keyfinding}[title={Finding 2}]
% Its value slowly decreases with more compute.
At each training compute budget, there is an optimal \compr{} $T^\star$.
Diverging from its value in either direction increases loss.
We observe decreasing optimal \compr{} for higher training budgets.
\end{keyfinding}
```
Optimal Tokenization during Inference {#sec:inference-flops}
-------------------------------------

![BPB on C4 test-set](figures/endtask_c4.png){#fig:endtask_c4 width="\\linewidth"}

```{=latex}
\hfill
```
![0-shot Accuracy on HellaSwag](figures/endtask_hellaswag.png){#fig:endtask_hellswag width="\\linewidth"}

To further study the role of optimal tokenization in inference, we compare the performance of models trained under $C=2\times10^{21}$ budget with different compression rates against their inference cost. Specifically, we consider the results on language modeling and 0-shot accuracy on HellaSwag generative benchmark [@zellers2019hellaswag]. In Figure `\ref{fig:endtask_hellswag}`{=latex}, we observe that a higher compression rate decreases the inference compute cost for models of the same size (e.g. 3.3B parameter model with $T=8$ is cheaper to run than a model of the same size $T=4$). However, we also observe that `\compr{}`{=latex} closer to the optimal value improves the results of the inference-compute-matched setting. For instance, the 3.3B model with `\compr{}`{=latex} $T=4$ has a similar inference cost of $2.1\times10^{9}$ FLOPs/Byte as the 6.7B model with `\compr{}`{=latex} $T=8$, while the former achieves higher score on the endtask accuracy (74.1% vs. 68.2%). We present further results for AI2 Reasoning Challenge [@clark2018arc] in Appendix `\ref{sec:arc-results}`{=latex}.

Compute Optimal Subword Tokenization {#sec:subword_scaling}
====================================

```{=latex}
\rowcolors{2}{metabg}{white}
```
```{=latex}
\setlength{\tabcolsep}{4pt}
```
```{=latex}
\renewcommand{\arraystretch}{1.2}
```
::: {#tab:power_law_fits}
  **Parameter**    **Latent**   **Subword**       **95% CI**
  --------------- ------------ ------------- ---------------------
  $\alpha$           0.465         0.501       \[0.471, 0.532\]
  $\beta$            0.471         0.446       \[0.387, 0.506\]
  $B_0$               17.5          2.8          \[0.7, 11.0\]
  $N_0$              0.0095        0.059       \[0.015, 0.229\]
  $\gamma$           -0.206       -0.181      \[-0.226, -0.1352\]
  $L_0$               3342         1087         \[ 171, 6896\]

  : Fitted power law parameters for the families of latent and subword tokenized models. The 95% confidence intervals were computed with numeric Hessian for the subword tokenized models.
:::

In this section, we validate the observations from the previous section for subword tokenized models. We train models with different subword tokenization algorithms: character-level tokenization, BPE, BPE with vocabulary masking, and SuperBPE to differentiate the values of compression rate $T$. Then we repeat the analysis of optimal data and parameters configurations and compare the fits of Scaling Laws I and II between latent and subword tokenized models, in order to answer the last research question **R3**.

Results
-------

```{=latex}
\begin{figure*}[!htb]
    
    \begin{subfigure}{0.48\textwidth}
        \includegraphics[width=\linewidth]{figures/isotropic_1e20_results_global_parameters_loss_curves.pdf}
        \caption{2D IsoFLOP}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.48\textwidth}
        \includegraphics[width=\linewidth]{figures/isotropic_1e20_results_compression_contour_global_parameters.pdf}
        \caption{3D IsoFLOP (heatmap)}
    \end{subfigure}
    % \caption{Each dot represents the validation loss of a model with a fixed FLOPs budget ({$C=10^{20}$}). Configurations use different subword tokenization algorithms resulting in varying compression rates: 1.01 for character-level tokenization, 3.71, 4.16, and 4.57 for BPE, 6.16 for SuperBPE. The interpolation and minimum estimation methods are the same as described in Figure~\ref{fig:stage_1_blt_1e20}.}
    % % For IsoFLOP figures for data, parameters, andother compute budgets, refer to Appendix~\ref{sec:isoflop-across-budgets}.
    \caption{
    Evaluation scores of subword tokenized models on C4 test set with fixed FLOPs budget ($C=10^{20}$), compared against \bpp{} ratio.
    Different subword tokenization algorithms are obtain varying compression rates: 1.01 for character-level tokenization, 3.71, 4.16, and 4.57 for BPE, 6.16 for SuperBPE
    2-dimensional IsoFLOP (parabola) were fitted for each compression rate, while 3-dimensional IsoFLOP jointly for all compression rates (on x-axis). 
    Similar to latent tokenized models, minima of both fits show that minimal loss is obtained at almost constant value.
    For IsoFLOPs as function of data, parameters, and for other compute budgets, refer to Appendix~\ref{sec:isoflop-across-budgets}}
    \label{fig:stage_1_iso_1e20}
\end{figure*}
```
![Power law fit for loss prediction based on compute budget and compression rate for isotropic models. The slices of the fitted manifold for each compute budget (lines) are compared with the optimal loss values (triangles).](figures/isotropic_optimal_configs_global_parameters_loss.png){#fig:cr_vs_loss_iso width="\\linewidth"}

```{=latex}
\hfill
```
![Power law fit for loss prediction based on compute budget and compression rate for isotropic models. The slices of the fitted manifold for each compute budget (lines) are compared with the optimal loss values (triangles).](figures/isotropic_loss_profile.png){#fig:cr_vs_loss_iso width="\\linewidth"}

Similar to the latent tokenization case, the IsoFLOPs curves allow us to identify the optimal data amount in bytes $B^\star$ for a fixed compute budget $C$. Figure `\ref{fig:stage_1_iso_1e20}`{=latex} shows the optimal values for a specific compute budget. We observe that the optimal byte-per-parameter ratio $\rho^\star$ is similar across tokenizers.

The Scaling Law I fit shows results close to the latent tokenization case: $B_0=2.8$, $N_0=59\times10^{-3}$, $\alpha=0.501$, $\beta=0.446$. Also, loss dynamics presented in Figure `\ref{fig:optimal_loss_iso}`{=latex} and the Scaling Law II fit show similar results as the latent tokenization case: $L_0=1087$, $\gamma=-0.181$, $F=0.0575$, $\delta=0.129$, $T_0=1577$, and $E=0.680$. The fit values are compared with latent tokenized models, as shown in Table `\ref{tab:power_law_fits}`{=latex}.

Similar to last section, we observe the presence of compute optimal `\compr{}`{=latex} that decrease for higher compute budgets (Figure `\ref{fig:cr_vs_loss_iso}`{=latex}). Surprisingly, under a high compute budget, the models with $90\%$ and $75\%$ of vocabulary masked (yet still included in FLOPs computation) outperform the models with original BPE tokenizers, as shown in Table `\ref{tab:isotropic_budget_bpb_comparison}`{=latex}. It empirically shows that lower compression is beneficial for training of larger scale models.

These observations allow us to answer the question **R3**:

```{=latex}
\begin{keyfinding}[title={Finding 3}]Discovered scaling trends for models with latent tokenization (BLT) hold for models with subword tokenization (BPE, SuperBPE).
\end{keyfinding}
```
```{=latex}
\rowcolors{4}{metabg}{white}
```
```{=latex}
\setlength{\tabcolsep}{4pt}
```
```{=latex}
\renewcommand{\arraystretch}{1.2}
```
::: {#tab:isotropic_budget_bpb_comparison}
  ---------------------------------------- --------------- ------------- -------------- ------------ --------
  **Compute**                               **Character**     **BPE**     **SuperBPE**               
  `\cmidrule`{=latex}(lr)3-5 **(FLOPs)**                    V. mask=90%   V. mask=75%     Original   
  $1 \times 10^{19}$                           1.2678         1.0819         1.0709      **1.0635**   1.0682
  $2 \times 10^{19}$                           1.1812         1.0381         1.0281      **1.0214**   1.0273
  $5 \times 10^{19}$                           1.0989         0.9887         0.9819      **0.9769**   0.9840
  $1 \times 10^{20}$                           1.0519         0.9554         0.9502      **0.9461**   0.9532
  $2 \times 10^{20}$                           1.0126         0.9254         0.9220      **0.9186**   0.9272
  $5 \times 10^{20}$                           0.9556         0.8942         0.8916      **0.8891**   0.8976
  $1 \times 10^{21}$                           0.9253         0.8665       **0.8658**      0.8659     0.8763
  $2 \times 10^{21}$                           0.9027       **0.8466**       0.8469        0.8479     0.8582
  **Compression:**                              1.01           3.71           4.16          4.57       6.16
  ---------------------------------------- --------------- ------------- -------------- ------------ --------

  : Comparison of the lowest BPB obtained by subword tokenized models for specific compute budgets.
:::

Compute Optimal Tokenization Beyond English {#sec:beyond_english}
===========================================

```{=latex}
\begin{figure*}[!htb]
    
    \begin{subfigure}{0.48\textwidth}
        \includegraphics[width=\linewidth]{figures/blt_entropy_fra_Latn_1e20_results_compression_contour_global_parameters_fra_Latn.pdf}
        \caption{French (Latin)}
    \end{subfigure}
    \begin{subfigure}{0.48\textwidth}
        \includegraphics[width=\linewidth]{figures/blt_entropy_vie_Latn_1e20_results_compression_contour_global_parameters_vie_Latn.pdf}
        \caption{Vietnamese (Latin)}
    \end{subfigure}
    \begin{subfigure}{0.48\textwidth}
        \includegraphics[width=\linewidth]{figures/blt_entropy_arb_Arab_1e20_results_compression_contour_global_parameters_arb_Arab.pdf}
        \caption{Arabic (Arabic)}
    \end{subfigure}
    \begin{subfigure}{0.48\textwidth}
        \includegraphics[width=\linewidth]{figures/blt_entropy_rus_Cyrl_1e20_results_compression_contour_global_parameters_rus_Cyrl.pdf}
        \caption{Russian (Cyrillic)}
    \end{subfigure}
        \begin{subfigure}{0.48\textwidth}
        \includegraphics[width=\linewidth]{figures/blt_entropy_bytes2_1e20_results_compression_contour_global_parameters.pdf}
        \caption{English $\times2$ (Latin)}
    \end{subfigure}
    \begin{subfigure}{0.48\textwidth}
        \includegraphics[width=\linewidth]{figures/blt_entropy_hin_Deva_1e20_results_compression_contour_global_parameters_hin_Deva.pdf}
        \caption{Hindi (Devanagari)}
    \end{subfigure}
    \caption{3D IsoFLOP (heatmap) fits across languages ($C=10^{20}$) as function of \bpp{} and \compr{} for six languages. All models use latent tokenization to achieve the set compression. IsoFLOPs are fitted jointly for all compression rates. 
    %The analogous 2D IsoFLOP are presented in Appendix~\ref{sec:xling_1d_isoflop}.
    }
    \label{fig:2d_xlingual_1e20}
\end{figure*}
```
To test how the language choice affects the compute-optimal compression rate and `\bpp{}`{=latex} ratio, we extend our experiments to five languages with diverse writing scripts: French (Latin), Vietnamese (Latin), Russian (Cyrillic), Arabic (Arabic), Hindi (Devanagari). We also create an artificially inflated version of English data by adding a dummy byte between pairs of original UTF-8 bytes. Such English $\times2$ data represent the same information at half the density.

For this purpose, we train latent-tokenized models (BLT) on monolingual data from FineWeb-2 [@penedo2025fineweb], training a separate set of models for each language.[^6] We evaluate each model on the corresponding test split from the same source. For English $\times2$, we use an inflated version of the C4 test set used in the previous experiments.

Our training setup is analogous to the one described in Section `\ref{sec:compression_scaling}`{=latex} for English. For each language $l$, we fix the training budget to $C=10^{20}$ FLOPs. The IsoFLOPs analysis (similar to that presented in Figure `\ref{fig:stage_1_blt_1e20}`{=latex}) allows us to identify the compute optimal `\bpp{}`{=latex} ratio $\rho_l^\star$ and `\compr{}`{=latex} $T_l^\star$, for each language $l$ at fixed compute budget.

Further, we compare these values to cross-lingual `\pari{}`{=latex}, defined as the proportion between the amount of bytes required to express the same information in different languages [@petrov2023token_unfairness]. We estimate `\pari{}`{=latex} by dividing the byte length of sentences in each language by the byte length of their English translations. We use translations from FLORES-200 multi-parallel corpus [@goyal2021flores; @nllb2022], which test split contains 1000 English sentences and their translations in a wide range of languages.

These experiments address our last research question **R4**.

Results
-------

![Language specific optimal `\bpp{}`{=latex} ratio ($\rho^\star_l$). Lower information density (high `\pari{}`{=latex}) correlates with preference of large training data size over model size (high $\rho^\star_l$).](figures/multilingual_bpb.png){#fig:xlingual_bpp_1e20 width="\\linewidth"}

```{=latex}
\hfill
```
![Language specific optimal `\bpp{}`{=latex} ratio ($\rho^\star_l$). Lower information density (high `\pari{}`{=latex}) correlates with preference of large training data size over model size (high $\rho^\star_l$).](figures/multilingual_byte_per_parameter.png){#fig:xlingual_bpp_1e20 width="\\linewidth"}

```{=latex}
\rowcolors{3}{metabg}{white}
```
```{=latex}
\setlength{\tabcolsep}{4pt}
```
```{=latex}
\renewcommand{\arraystretch}{1.2}
```
::: {#tab:xlingual_optimals_1e20}
  ------------------------------------------------------- ------------ -------------------- ----------------- ----------- ----------- -------
                       **Language**                        **Parity**   **$\rho^\star_l$**   **$T^\star_l$**    **BPB**               
   `\cmidrule`{=latex}(lr)3-4 `\cmidrule`{=latex}(lr)5-6                    **Value**           **Ratio**      **Value**   **Ratio**  
                          English                             1.0              62.1                1.0           3.71         1.0      0.960
                          French                              1.2              57.8               0.93           4.16        1.12      0.795
                        Vietnamese                            1.4              61.2               0.99           5.11        1.38      0.602
                          Arabic                              1.6              75.8               1.22           4.58        1.23      0.564
                          Russian                             2.0              96.3               1.55           5.67        1.52      0.461
                       English (x2)                           2.0              91.1               1.47           6.97        1.87      0.492
                           Hindi                              2.6              95.5               1.54           8.09        2.18      0.337
  ------------------------------------------------------- ------------ -------------------- ----------------- ----------- ----------- -------

  : Compute-optimal byte-per-parameter ($\rho^\star_l$), `\compr{}`{=latex} ($T^\star_l$) compared to cross-lingual `\pari{}`{=latex}. Results for monolingual models, with $C=10^{20}$ FLOPs budget. The `\pari{}`{=latex} and compute-optimal ratios are proportions between each language and English baseline.
:::

Figures `\ref{fig:1d_xlingual_1e20}`{=latex} and `\ref{fig:2d_xlingual_1e20}`{=latex} present the results of the IsoFLOPs analysis across all analyzed languages. Similarly to English, we observe that the minimal loss is achieved by models with close to constant `\bpp{}`{=latex} ($\rho^\star_l$). From the polynomial fit, we estimate the compute-optimal `\bpp{}`{=latex} ratio and `\compr{}`{=latex} by analytically finding the coordinates of the global minimum (i.e. lowest loss).

![Compute optimal compression differs from the data compression obtained by popular language models. We observe that popular pre-trained subword tokenizers tend to over-compress high-resource languages, e.g.: English, Arabic, while significantly undercompressing the less resourced ones, e.g.: Vietnamese, Hindi.](figures/multilingual_compression_rate.png){#fig:tokenizers_crs width="\\linewidth"}

```{=latex}
\hfill
```
![Compute optimal compression differs from the data compression obtained by popular language models. We observe that popular pre-trained subword tokenizers tend to over-compress high-resource languages, e.g.: English, Arabic, while significantly undercompressing the less resourced ones, e.g.: Vietnamese, Hindi.](figures/tokenization_compression_vs_optimal.png){#fig:tokenizers_crs width="\\linewidth"}

Notably, the $\rho^\star_l$ is language dependent (e.g. $\rho^\star_\text{AR}\approx75.8$; $\rho^\star_\text{RU}\approx96.3$). We also observe language-dependent differences in the compute-optimal compression rate (e.g. $T^\star_\text{AR}\approx4.58$; $T^\star_\text{RU}\approx5.67$). In Table `\ref{tab:xlingual_optimals_1e20}`{=latex}, we compare these compute-optimal values to cross-lingual `\pari{}`{=latex}. We observe that the optimal values depend on language and its `\pari{}`{=latex}. Figure `\ref{fig:xlingual_bpbs_1e20}`{=latex} shows that language-specific BPB scales inversely with parity. This confirms the observation from @limisiewicz-etal-2024-myte: under optimal tokenization, similar information expressed across languages has similar likelihood. We further observe that `\pari{}`{=latex} correlates with optimal `\bpp{}`{=latex}, which is explained by the fact that more coarsely encoded languages tend to benefit more from additional training data than from larger models (Figure `\ref{fig:xlingual_bpp_1e20}`{=latex}). While, in joint multilingual training the optimal `\bpp{}`{=latex} ratios converge to the same value across languages (see Appendix `\ref{sec:beyond_english_mix}`{=latex}).

Lastly, we observe that higher `\pari{}`{=latex} translates to a higher optimal `\compr{}`{=latex}. For Latin-script languages, this relationship is close to a 1:1 increase (Figure `\ref{fig:xlingual_crs_1e20}`{=latex}). Importantly, the compression achieved by popular multilingual tokenizers: Llama 3 [@llama3herd2024], Qwen 3 [@qwen3technicalreport], and EuroLLM [@martins2024eurollm], differs from the optimal value, as seen in Figure `\ref{fig:tokenizers_crs}`{=latex}. These tokenizers tend to over-compress high-resource languages while under-compressing lower-resource ones.

These results bring an answer to the last research question **\[R4\]**:

```{=latex}
\begin{keyfinding}[title={Finding 4}]The  optimal byte of data to parameter ratio ($\rho^\star$) and \compr{} ($T^\star$) vary across different languages.
Both are correlated with average information value of bytes in a given language (measured by \pari{}).
\end{keyfinding}
```
Related Work
============

Data Compression in Scaling Laws
--------------------------------

Foundational studies on neural scaling laws, such as those by @kaplan2020scaling and @hoffmann2022training, have primarily focused on the relationship between model size, dataset size (in tokens), and compute. Subsequent works [@Pearce2024reconciling; @porian2024resolving] have pointed out that the decision of whether to include vocabulary embeddings in the analysis was one of the causes of divergence between scaling laws derived in these studies. @hoffmann2022training propose a compute-optimal training ratio of approximately $20$ tokens per parameter. However, they assume a fixed tokenization scheme, overlooking the information content of the tokens themselves. We generalize this scaling rule across tokenizers and express it as a comprehensive byte-per-parameter ratio: $\rho^\star\approx60$ (for English data).

@tao2024scaling derived scaling laws for vocabulary size in BPE-tokenized models. Their study explores how varying vocabulary size impacts computational cost and performance. They also consider the importance of `\compr{}`{=latex} in model scaling, which is indirectly controlled by the vocabulary size. By considering a broad scope of compression values and compute budgets, we show that the benefits of scaling up vocabulary diminish at larger scales. We further discuss differences between experimental settings in Appendix `\ref{sec:comparison_tao_et_al}`{=latex}.

Multiple recent works discussed language model scaling trends across domains and languages. @yang2025scalinglawscode derived scaling laws across programming languages, showing that language-specific data composition significantly affects scaling behavior. In the multilingual space, @he-etal-2025-scaling established per-language scaling laws, while @longpre2026atlas studied the dynamics of cross-lingual transfer at scale. Overall, these works demonstrate that scaling laws differ across domains and languages, as we have also observed in our multilingual experiments.

Search for Optimal Tokenization
-------------------------------

The research community has long sought to identify tokenizer properties that correlate with language model performance. The compression rate, or its proxies such as *fertility*, have been identified as a significant factor, especially in the multilingual setting. @rust-etal-2021-good observed that in multilingual language models, monolingual tokenizers with higher in-language compression outperform multilingual ones. Similarly, @limisiewicz-etal-2023-tokenization [@goldman-etal-2024-unpacking] noted the benefits of higher `\compr{}`{=latex} for certain downstream tasks in multilingual models. @galle-2019-investigating show that higher compression is also beneficial for machine translation. However, in the subword tokenizers considered in these works, language-specific compression depends on the representation of the language in the training corpora. Thus, compression could be a proxy for the root cause of the performance differences, namely language frequency in the data mix.

In monolingual (English) models, @schmidt-etal-2024-tokenization argued that higher compression is not inherently beneficial. However, @liu2025superbpespacetravellanguage observed an upward trend in downstream task performance with higher compression, even when perplexity (measured in bits per byte) degraded. This discrepancy underscores the importance of evaluating downstream performance alongside language modeling metrics.

In contrast to prior work, our extensive search reveals that the impact of `\compr{}`{=latex} on performance is non-monotonic: there exists an optimal `\compr{}`{=latex} beyond which performance degrades. We also observe a preference for lower compression in longer training. The recurring prior-work assumption linking higher `\compr{}`{=latex} with performance improvement in the multilingual setting may stem from the fact that subword tokenizers typically result in a lower-than-optimal `\compr{}`{=latex} for low-resource languages, as shown in Figure `\ref{fig:tokenizers_crs}`{=latex}.

Scaling Latent Tokenized Models
-------------------------------

We employ BLT [@pagnoni-etal-2025-byte] as our primary framework for studying scaling laws with variable compression. Techniques such as entropy-based or static patching allow precise control of the compression rate across a wide range. While promising, the data efficiency of training and inference for dynamically tokenized models has not yet been comprehensively studied in the context of scaling laws.

Recent results in latent tokenization suggest that this approach yields greater gains at scale. The works of @pagnoni-etal-2025-byte [@hwang2025dynamicchunkingendtoendhierarchical; @neitemeier2025hierarchical; @nawrot-etal-2023-efficient] demonstrate that, given sufficient training compute, hierarchical models can surpass their subword-tokenized counterparts. Furthermore, latent tokenization allows adjusting compression for specific languages [@ahia-etal-2024-magnet; @owodunni2025flexitokens]. Based on our findings, we expect such approaches to be particularly beneficial for multilingual language modeling.

Discussion
==========

The relationship between scaling laws and data compression highlights the importance of considering tokenizer `\compr{}`{=latex} in the optimal design of large language models. Our observations overlap with the @hoffmann2022training (Chinchilla) recipe, suggesting that data and model parameters should be scaled proportionally. Generalizing the Chinchilla rule, we show that the appropriate unit for data quantity is bytes, not tokens. Therefore, the widely accepted rule of using approximately $20$ tokens per parameter for compute-optimal training holds only under `\compr{}`{=latex} specific to a BPE tokenizer. We generalize this rule (Scaling Law I) by empirically showing that compute-optimal architectures for English text should use approximately $60$ `\bpp{}`{=latex}, regardless of data compression. This generalization makes it easy to transfer efficient training settings across different tokenization schemes, spanning from byte-level to superword-level tokens.

Furthermore, Scaling Law II reveals the existence of an optimal `\compr{}`{=latex} that depends on the training domain and the compute budget. Interestingly, when training on English data with a small FLOP budget, the optimal `\compr{}`{=latex} is close to that of a BPE tokenizer. However, we observe a slow decrease as training compute increases. This observation also holds for subword-tokenized models. Strikingly, a model with $90\%$ of the BPE vocabulary masked performs slightly better than standard BPE in our largest runs (even though both spend the same compute in the embedding and de-embedding layers). This surprising result suggests that, for compute-efficient training of large models, it could be beneficial to decrease vocabulary size or apply techniques such as BPE-dropout [@provilkov-etal-2020-bpe]. Why do we observe such a counterintuitive result? Our hypothesis is that less-compressed tokenizers allow the model to use more compute at inference time by dividing each evaluation sample into more tokens that are processed by the model. It is important to keep in mind that lower compression naturally increases the cost of model usage (as shown in Section `\ref{sec:inference-flops}`{=latex}). Therefore, when controlling for `\compr{}`{=latex}, we should consider the trade-off between performance and inference cost. Specifically, it is advisable to use higher `\compr{}`{=latex} to decrease model usage cost, similarly to how model developers opt for over-training language models to boost the performance of relatively smaller (and thus cheaper) models compared to their training-compute-optimal counterparts.

The search for compute-optimal `\compr{}`{=latex} is especially important for languages other than English, where the compression obtained by subword tokenizers tends to diverge from the optimal value to a more extreme extent. Previously, it was thought that multilingual performance of language models is affected by over-segmentation, and many studies focused on increasing compression for better multilingual performance [@rust-etal-2021-good; @limisiewicz-etal-2023-tokenization]. We observe, across all considered languages, that overly high compression deteriorates results. Furthermore, for each language we find a specific optimal `\compr{}`{=latex}, the value of which is correlated with the relative information density of the text, i.e., *parity*. This observation highlights the importance of identifying and achieving an optimal `\compr{}`{=latex} for each of the modeled languages. For statistics-based subword tokenizers (such as BPE), `\compr{}`{=latex} is heavily impacted by the amount of in-language data in the training corpus [@ahia-etal-2023-languages] and encoding efficiency [@limisiewicz-etal-2024-myte], and thus cannot be easily controlled in a massively multilingual setting. This limitation provides a strong argument for latent tokenizers in multilingual language modeling, whose compression can be adapted for specific languages [@ahia-etal-2024-magnet; @owodunni2025flexitokens].

Future Work
-----------

While our study is the most comprehensive investigation of the impact of tokenization on scaling laws to date, there are multiple directions for future work in this area:

#### Optimal Compression for Other Modalities.

In this work, we focused on text data. We expect the impact of data compression to be equally relevant for other modalities, such as vision, speech, and code. Currently, each modality utilizes a different set of tokenization techniques, such as variational autoencoders [@oord2017vqvae] or vision transformers [@yu2024image] for images. Therefore, the scaling analysis requires considering the impact of modality-specific tokenization artifacts.

#### Sparse Architectures.

Another direction is considering architectures other than dense (hierarchical) transformer models, such as Mixture of Experts (MoE) models. Studying the role of data compression could answer the question of how it interacts with parameter sparsity, and thus could be an important contribution to MoE scaling laws [@ludziejewski2024moe].

Limitations
-----------

To keep the study tractable, we fixed several training hyperparameters across all runs. In particular, we did not tune the learning rate for specific training budgets or adapt any hyperparameters other than those named.

While we have examined a wide range of tokenization methods, spanning both latent and subword families, there could be other design choices that affect the results. Examples of such aspects include pre-tokenization rules, other subword algorithms (e.g., Unigram; [@kudo-2018-subword]), or token boundary prediction for latent tokenizers. We expect that such changes would have a minor effect on the main findings of this work. In Appendix `\ref{sec:tokenization_method_comparison}`{=latex}, we provide a further comparison across tokenization methods.

Conclusion
==========

In this work, we have systematically studied the role of data compression on the scaling trend for large language models. We have shown that the optimal ratio between training data bytes and model parameters, denoted as $\rho^\star$, remains approximately constant across varying compute budgets and compression rates. Consequently, when generalizing scaling recipes to models with different tokenizers, we advise matching the ratio of training bytes (not tokens) to model parameters. Additionally, we find the optimal compression rate $T^\star$ that is specific to the training domain and slowly decreases with the training budget. Finally, we show that these scaling trends with `\compr{}`{=latex} hold consistently for both latent and subword-tokenized models.

Acknowledgments {#acknowledgments .unnumbered}
===============

We thank Jonathan Hayase, Julie Kallini, Pedro Rodriguez, and Rylan Schaeffer for insightful discussions that helped shape and improve this work. We are grateful to Artyom Kozhevnikov, David Dale, and Marta R. Costa-jussà for their advice and practical assistance with multilingual experiments. Finally, we express special gratitude to Cody Ohlsen for going to great lengths to resolve technical obstacles at one of the project's most critical moments.

```{=latex}
\clearpage
```
```{=latex}
\newpage
```
```{=latex}
\bibliographystyle{assets/plainnat}
```
```{=latex}
\clearpage
```
```{=latex}
\newpage
```
```{=latex}
\beginappendix
```
Model Scaling: Technical Details {#sec:model-scaling-details}
================================

In this section, we describe the model architectures in detail.

The core experiments were conducted with BLT models [@pagnoni-etal-2025-byte]. We followed the original implementation with a few notable exceptions. As noted in Section `\ref{sec:methodology}`{=latex}, we find that the local modules should be wide (high number of heads) and shallow (low number of layers). To achieve such a shape, we set the number of layers in each local module to the ceiling of one-fourth of the number of global layers, and the local head count to the ceiling of one-fourth of the number of global heads, plus 8. The cross-attention key-query duplication factor $k$ is set to the ceiling of one-eighth of the global module's head count. The hidden dimension of the local modules is set to 64 times the number of heads. This scaling recipe ensures that the compute overhead introduced by the local modules is comparable to the embedding layers found in isotropic models of similar scale.

An important divergence from the original BLT architecture of @pagnoni-etal-2025-byte is the omission of hash embeddings. To compensate for the reduced capacity for encoding the input, we increase the number of layers in the encoder to match the decoder (originally, the encoder has only one layer). Table `\ref{tab:blt_scales}`{=latex} presents the scales and architecture hyperparameters of all BLT models used in this work.

Similarly, Table `\ref{tab:isotropic_scales}`{=latex} outlines the scaling recipe for subword tokenized models.

We compare the compute spend in the latent module as a percentage of the total inference compute for both families of models in Table `\ref{tab:scales_flops_share}`{=latex}. We observe that with our scaling recipes, the global module takes up a similar share of compute in the BLT architecture as in the isotropic model when the model scale and `\compr{}`{=latex} are matched. We observe that decreasing `\compr{}`{=latex} or increasing model size correlates with a higher relative utilization of the global model.

```{=latex}
\setlength{\tabcolsep}{4pt}
```
```{=latex}
\renewcommand{\arraystretch}{1.2}
```
```{=latex}
\rowcolors{6}{metabg}{white}
```
::: {#tab:blt_scales}
  ---------------------------------------------------------------------------------------------- ----------------------------- --------------------- ------------ ------------ ----------- --------- ------------ ----------- ------- ------------ -- --
                                    **Global (Latent Module)**                                    **Local (Encoder/Decoder)**   **Cross-Attention**   **Total**                                                                                       
   `\cmidrule`{=latex}(lr)1-4 `\cmidrule`{=latex}(lr)5-8 `\cmidrule`{=latex}(lr)9-10 **Layers**            **Heads**                  **Dim**         **Params**   **Layers**   **Heads**   **Dim**   **Params**   **Heads**   **k**   **Params**     
                                                5                                                              5                        640              25M           2           10         640        10M          10         1        50M         
                                                6                                                              6                        768              43M           2           10         640        10M          10         1        68M         
                                                7                                                              7                        896              67M           2           10         640        10M          10         1        93M         
                                                8                                                              8                       1024              101M          2           10         640        10M          10         1        127M        
                                                9                                                              9                       1152              143M          3           12         768        21M          12         2        199M        
                                                10                                                            10                       1280              197M          3           12         768        21M          12         2        253M        
                                                11                                                            11                       1408              262M          3           12         768        21M          12         2        318M        
                                                12                                                            12                       1536              340M          3           12         768        21M          12         2        396M        
                                                13                                                            13                       1664              432M          4           12         768        28M          12         2        506M        
                                                14                                                            14                       1792              540M          4           12         768        28M          12         2        613M        
                                                15                                                            15                       1920              664M          4           12         768        28M          12         2        738M        
                                                16                                                            16                       2048              805M          4           12         768        28M          12         2        880M        
                                                17                                                            17                       2176              966M          5           14         896        48M          14         3        1.1B        
                                                18                                                            18                       2304              1.1B          5           14         896        48M          14         3        1.3B        
                                                19                                                            19                       2432              1.3B          5           14         896        48M          14         3        1.5B        
                                                20                                                            20                       2560              1.6B          5           14         896        48M          14         3        1.7B        
                                                21                                                            21                       2688              1.8B          6           14         896        58M          14         3        2.0B        
                                                22                                                            22                       2816              2.1B          6           14         896        58M          14         3        2.2B        
                                                23                                                            23                       2944              2.4B          6           14         896        58M          14         3        2.5B        
                                                24                                                            24                       3072              2.7B          6           14         896        58M          14         3        2.9B        
                                                25                                                            25                       3200              3.1B          7           16        1024        88M          16         4        3.3B        
                                                26                                                            26                       3328              3.5B          7           16        1024        88M          16         4        3.7B        
                                                27                                                            27                       3456              3.9B          7           16        1024        88M          16         4        4.1B        
                                                28                                                            28                       3584              4.3B          7           16        1024        88M          16         4        4.6B        
                                                29                                                            29                       3712              4.8B          8           16        1024        101M         16         4        5.1B        
                                                30                                                            30                       3840              5.3B          8           16        1024        101M         16         4        5.6B        
                                                31                                                            31                       3968              5.9B          8           16        1024        101M         16         4        6.1B        
                                                32                                                            32                       4096              6.4B          8           16        1024        101M         16         4        6.7B        
  ---------------------------------------------------------------------------------------------- ----------------------------- --------------------- ------------ ------------ ----------- --------- ------------ ----------- ------- ------------ -- --

  : The configuration of latent tokenized models (BLT architecture) used in scaling experiments.
:::

```{=latex}
\rowcolors{6}{metabg}{white}
```
```{=latex}
\setlength{\tabcolsep}{4pt}
```
```{=latex}
\renewcommand{\arraystretch}{1.2}
```
::: {#tab:isotropic_scales}
  ---------------------------------------------------------------------------------------------- ----------------------------------- ---------------------- ------------ ---------- --------- -------------- ---------- --------- -------------- -- --
                                    **Global (Latent Module)**                                    **Local Parameters (Embeddings)**   **Total Parameters**                                                                                          
   `\cmidrule`{=latex}(lr)1-4 `\cmidrule`{=latex}(lr)5-7 `\cmidrule`{=latex}(lr)8-10 **Layers**               **Heads**                     **Dim**          **Params**   **Char**   **BPE**   **SuperBPE**   **Char**   **BPE**   **SuperBPE**     
                                                5                                                                 5                           640               25M         96M        82M         128M         121M      107M         153M         
                                                6                                                                 6                           768               43M         115M       98M         154M         158M      141M         196M         
                                                7                                                                 7                           896               67M         134M      115M         179M         202M      182M         247M         
                                                8                                                                 8                           1024              101M        154M      131M         205M         254M      232M         306M         
                                                9                                                                 9                           1152              143M        173M      148M         230M         316M      291M         374M         
                                                10                                                               10                           1280              197M        192M      164M         256M         389M      360M         453M         
                                                11                                                               11                           1408              262M        211M      180M         282M         473M      442M         543M         
                                                12                                                               12                           1536              340M        230M      197M         307M         570M      536M         647M         
                                                13                                                               13                           1664              432M        250M      213M         333M         682M      645M         765M         
                                                14                                                               14                           1792              540M        269M      229M         358M         808M      769M         898M         
                                                15                                                               15                           1920              664M        288M      246M         384M         952M      909M         1.0B         
                                                16                                                               16                           2048              805M        307M      262M         410M         1.1B      1.1B         1.2B         
                                                17                                                               17                           2176              966M        326M      279M         435M         1.3B      1.2B         1.4B         
                                                18                                                               18                           2304              1.1B        346M      295M         461M         1.5B      1.4B         1.6B         
                                                19                                                               19                           2432              1.3B        365M      311M         486M         1.7B      1.7B         1.8B         
                                                20                                                               20                           2560              1.6B        384M      328M         512M         2.0B      1.9B         2.1B         
                                                21                                                               21                           2688              1.8B        403M      344M         538M         2.2B      2.2B         2.4B         
                                                22                                                               22                           2816              2.1B        422M      360M         563M         2.5B      2.5B         2.7B         
                                                23                                                               23                           2944              2.4B        442M      377M         589M         2.8B      2.8B         3.0B         
                                                24                                                               24                           3072              2.7B        461M      393M         614M         3.2B      3.1B         3.3B         
                                                25                                                               25                           3200              3.1B        480M      410M         640M         3.6B      3.5B         3.7B         
                                                26                                                               26                           3328              3.5B        499M      426M         666M         4.0B      3.9B         4.1B         
                                                27                                                               27                           3456              3.9B        518M      442M         691M         4.4B      4.3B         4.6B         
                                                28                                                               28                           3584              4.3B        538M      459M         717M         4.9B      4.8B         5.0B         
                                                29                                                               29                           3712              4.8B        557M      475M         742M         5.4B      5.3B         5.5B         
                                                30                                                               30                           3840              5.3B        576M      492M         768M         5.9B      5.8B         6.1B         
                                                31                                                               31                           3968              5.9B        595M      508M         794M         6.5B      6.4B         6.7B         
                                                32                                                               32                           4096              6.4B        614M      524M         819M         7.1B      7.0B         7.3B         
  ---------------------------------------------------------------------------------------------- ----------------------------------- ---------------------- ------------ ---------- --------- -------------- ---------- --------- -------------- -- --

  : The configuration of subword tokenized models (isotropic). Parameter differences across tokenizers arise from varying vocabulary sizes $V$. For Character: $V=148,000$, for BPE: $V=128,000$, for SuperBPE $V=200,000$.
:::

```{=latex}
\setlength{\tabcolsep}{4pt}
```
```{=latex}
\renewcommand{\arraystretch}{1.1}
```
::: {#tab:scales_flops_share}
  ------------------------------------------------------------------ ------------------------------------ ------------------------------------ ------------------------------------ ------------------------------------ ------------------------------------ ------------------------------------ ------------------------------------ ------------------------------------ ------------------------------------
                                                                             Latent Tokenization                  Subword Tokenizetion                                                                                                                                                                                                                                       
   `\cmidrule`{=latex}(lr)2-7 `\cmidrule`{=latex}(lr)8-10 **Scale**              **$T{=}1$**                          **$T{=}2$**                          **$T{=}4$**                          **$T{=}6$**                          **$T{=}8$**                          **$T{=}12$**                        **$T{=}1.01$**                       **$T{=}4.57$**                       **$T{=}6.16$**
                                  5                                   `\cellcolor{metabg!66}`{=latex}66%   `\cellcolor{metabg!43}`{=latex}43%   `\cellcolor{metabg!24}`{=latex}24%   `\cellcolor{metabg!17}`{=latex}17%   `\cellcolor{metabg!13}`{=latex}13%    `\cellcolor{metabg!9}`{=latex}9%    `\cellcolor{metabg!34}`{=latex}34%   `\cellcolor{metabg!27}`{=latex}27%   `\cellcolor{metabg!18}`{=latex}18%
                                  6                                   `\cellcolor{metabg!75}`{=latex}75%   `\cellcolor{metabg!55}`{=latex}55%   `\cellcolor{metabg!35}`{=latex}35%   `\cellcolor{metabg!25}`{=latex}25%   `\cellcolor{metabg!20}`{=latex}20%   `\cellcolor{metabg!14}`{=latex}14%   `\cellcolor{metabg!41}`{=latex}41%   `\cellcolor{metabg!34}`{=latex}34%   `\cellcolor{metabg!24}`{=latex}24%
                                  7                                   `\cellcolor{metabg!82}`{=latex}82%   `\cellcolor{metabg!65}`{=latex}65%   `\cellcolor{metabg!45}`{=latex}45%   `\cellcolor{metabg!35}`{=latex}35%   `\cellcolor{metabg!28}`{=latex}28%   `\cellcolor{metabg!20}`{=latex}20%   `\cellcolor{metabg!47}`{=latex}47%   `\cellcolor{metabg!41}`{=latex}41%   `\cellcolor{metabg!30}`{=latex}30%
                                  8                                   `\cellcolor{metabg!87}`{=latex}87%   `\cellcolor{metabg!73}`{=latex}73%   `\cellcolor{metabg!55}`{=latex}55%   `\cellcolor{metabg!44}`{=latex}44%   `\cellcolor{metabg!36}`{=latex}36%   `\cellcolor{metabg!27}`{=latex}27%   `\cellcolor{metabg!52}`{=latex}52%   `\cellcolor{metabg!47}`{=latex}47%   `\cellcolor{metabg!35}`{=latex}35%
                                  9                                   `\cellcolor{metabg!79}`{=latex}79%   `\cellcolor{metabg!63}`{=latex}63%   `\cellcolor{metabg!44}`{=latex}44%   `\cellcolor{metabg!34}`{=latex}34%   `\cellcolor{metabg!27}`{=latex}27%   `\cellcolor{metabg!20}`{=latex}20%   `\cellcolor{metabg!57}`{=latex}57%   `\cellcolor{metabg!52}`{=latex}52%   `\cellcolor{metabg!41}`{=latex}41%
                                  10                                  `\cellcolor{metabg!84}`{=latex}84%   `\cellcolor{metabg!70}`{=latex}70%   `\cellcolor{metabg!52}`{=latex}52%   `\cellcolor{metabg!41}`{=latex}41%   `\cellcolor{metabg!34}`{=latex}34%   `\cellcolor{metabg!25}`{=latex}25%   `\cellcolor{metabg!61}`{=latex}61%   `\cellcolor{metabg!57}`{=latex}57%   `\cellcolor{metabg!45}`{=latex}45%
                                  11                                  `\cellcolor{metabg!87}`{=latex}87%   `\cellcolor{metabg!75}`{=latex}75%   `\cellcolor{metabg!58}`{=latex}58%   `\cellcolor{metabg!48}`{=latex}48%   `\cellcolor{metabg!41}`{=latex}41%   `\cellcolor{metabg!31}`{=latex}31%   `\cellcolor{metabg!65}`{=latex}65%   `\cellcolor{metabg!62}`{=latex}62%   `\cellcolor{metabg!50}`{=latex}50%
                                  12                                  `\cellcolor{metabg!89}`{=latex}89%   `\cellcolor{metabg!79}`{=latex}79%   `\cellcolor{metabg!64}`{=latex}64%   `\cellcolor{metabg!54}`{=latex}54%   `\cellcolor{metabg!47}`{=latex}47%   `\cellcolor{metabg!37}`{=latex}37%   `\cellcolor{metabg!68}`{=latex}68%   `\cellcolor{metabg!65}`{=latex}65%   `\cellcolor{metabg!54}`{=latex}54%
                                  13                                  `\cellcolor{metabg!89}`{=latex}89%   `\cellcolor{metabg!78}`{=latex}78%   `\cellcolor{metabg!63}`{=latex}63%   `\cellcolor{metabg!53}`{=latex}53%   `\cellcolor{metabg!46}`{=latex}46%   `\cellcolor{metabg!36}`{=latex}36%   `\cellcolor{metabg!71}`{=latex}71%   `\cellcolor{metabg!69}`{=latex}69%   `\cellcolor{metabg!58}`{=latex}58%
                                  14                                  `\cellcolor{metabg!91}`{=latex}91%   `\cellcolor{metabg!82}`{=latex}82%   `\cellcolor{metabg!68}`{=latex}68%   `\cellcolor{metabg!58}`{=latex}58%   `\cellcolor{metabg!51}`{=latex}51%   `\cellcolor{metabg!41}`{=latex}41%   `\cellcolor{metabg!73}`{=latex}73%   `\cellcolor{metabg!72}`{=latex}72%   `\cellcolor{metabg!62}`{=latex}62%
                                  15                                  `\cellcolor{metabg!92}`{=latex}92%   `\cellcolor{metabg!84}`{=latex}84%   `\cellcolor{metabg!72}`{=latex}72%   `\cellcolor{metabg!63}`{=latex}63%   `\cellcolor{metabg!56}`{=latex}56%   `\cellcolor{metabg!46}`{=latex}46%   `\cellcolor{metabg!76}`{=latex}76%   `\cellcolor{metabg!74}`{=latex}74%   `\cellcolor{metabg!65}`{=latex}65%
                                  16                                  `\cellcolor{metabg!93}`{=latex}93%   `\cellcolor{metabg!87}`{=latex}87%   `\cellcolor{metabg!76}`{=latex}76%   `\cellcolor{metabg!68}`{=latex}68%   `\cellcolor{metabg!61}`{=latex}61%   `\cellcolor{metabg!51}`{=latex}51%   `\cellcolor{metabg!78}`{=latex}78%   `\cellcolor{metabg!77}`{=latex}77%   `\cellcolor{metabg!67}`{=latex}67%
                                  17                                  `\cellcolor{metabg!90}`{=latex}90%   `\cellcolor{metabg!82}`{=latex}82%   `\cellcolor{metabg!69}`{=latex}69%   `\cellcolor{metabg!59}`{=latex}59%   `\cellcolor{metabg!52}`{=latex}52%   `\cellcolor{metabg!42}`{=latex}42%   `\cellcolor{metabg!80}`{=latex}80%   `\cellcolor{metabg!79}`{=latex}79%   `\cellcolor{metabg!70}`{=latex}70%
                                  18                                  `\cellcolor{metabg!91}`{=latex}91%   `\cellcolor{metabg!84}`{=latex}84%   `\cellcolor{metabg!72}`{=latex}72%   `\cellcolor{metabg!63}`{=latex}63%   `\cellcolor{metabg!57}`{=latex}57%   `\cellcolor{metabg!46}`{=latex}46%   `\cellcolor{metabg!81}`{=latex}81%   `\cellcolor{metabg!81}`{=latex}81%   `\cellcolor{metabg!72}`{=latex}72%
                                  19                                  `\cellcolor{metabg!93}`{=latex}93%   `\cellcolor{metabg!86}`{=latex}86%   `\cellcolor{metabg!75}`{=latex}75%   `\cellcolor{metabg!67}`{=latex}67%   `\cellcolor{metabg!60}`{=latex}60%   `\cellcolor{metabg!50}`{=latex}50%   `\cellcolor{metabg!83}`{=latex}83%   `\cellcolor{metabg!82}`{=latex}82%   `\cellcolor{metabg!74}`{=latex}74%
                                  20                                  `\cellcolor{metabg!93}`{=latex}93%   `\cellcolor{metabg!88}`{=latex}88%   `\cellcolor{metabg!78}`{=latex}78%   `\cellcolor{metabg!70}`{=latex}70%   `\cellcolor{metabg!64}`{=latex}64%   `\cellcolor{metabg!54}`{=latex}54%   `\cellcolor{metabg!84}`{=latex}84%   `\cellcolor{metabg!84}`{=latex}84%   `\cellcolor{metabg!76}`{=latex}76%
                                  21                                  `\cellcolor{metabg!93}`{=latex}93%   `\cellcolor{metabg!87}`{=latex}87%   `\cellcolor{metabg!77}`{=latex}77%   `\cellcolor{metabg!70}`{=latex}70%   `\cellcolor{metabg!63}`{=latex}63%   `\cellcolor{metabg!53}`{=latex}53%   `\cellcolor{metabg!85}`{=latex}85%   `\cellcolor{metabg!85}`{=latex}85%   `\cellcolor{metabg!78}`{=latex}78%
                                  22                                  `\cellcolor{metabg!94}`{=latex}94%   `\cellcolor{metabg!89}`{=latex}89%   `\cellcolor{metabg!80}`{=latex}80%   `\cellcolor{metabg!72}`{=latex}72%   `\cellcolor{metabg!66}`{=latex}66%   `\cellcolor{metabg!57}`{=latex}57%   `\cellcolor{metabg!86}`{=latex}86%   `\cellcolor{metabg!86}`{=latex}86%   `\cellcolor{metabg!79}`{=latex}79%
                                  23                                  `\cellcolor{metabg!95}`{=latex}95%   `\cellcolor{metabg!90}`{=latex}90%   `\cellcolor{metabg!82}`{=latex}82%   `\cellcolor{metabg!75}`{=latex}75%   `\cellcolor{metabg!69}`{=latex}69%   `\cellcolor{metabg!60}`{=latex}60%   `\cellcolor{metabg!87}`{=latex}87%   `\cellcolor{metabg!87}`{=latex}87%   `\cellcolor{metabg!81}`{=latex}81%
                                  24                                  `\cellcolor{metabg!95}`{=latex}95%   `\cellcolor{metabg!91}`{=latex}91%   `\cellcolor{metabg!84}`{=latex}84%   `\cellcolor{metabg!77}`{=latex}77%   `\cellcolor{metabg!72}`{=latex}72%   `\cellcolor{metabg!63}`{=latex}63%   `\cellcolor{metabg!88}`{=latex}88%   `\cellcolor{metabg!88}`{=latex}88%   `\cellcolor{metabg!82}`{=latex}82%
                                  25                                  `\cellcolor{metabg!93}`{=latex}93%   `\cellcolor{metabg!88}`{=latex}88%   `\cellcolor{metabg!79}`{=latex}79%   `\cellcolor{metabg!71}`{=latex}71%   `\cellcolor{metabg!65}`{=latex}65%   `\cellcolor{metabg!56}`{=latex}56%   `\cellcolor{metabg!89}`{=latex}89%   `\cellcolor{metabg!89}`{=latex}89%   `\cellcolor{metabg!83}`{=latex}83%
                                  26                                  `\cellcolor{metabg!94}`{=latex}94%   `\cellcolor{metabg!89}`{=latex}89%   `\cellcolor{metabg!81}`{=latex}81%   `\cellcolor{metabg!74}`{=latex}74%   `\cellcolor{metabg!68}`{=latex}68%   `\cellcolor{metabg!59}`{=latex}59%   `\cellcolor{metabg!89}`{=latex}89%   `\cellcolor{metabg!89}`{=latex}89%   `\cellcolor{metabg!84}`{=latex}84%
                                  27                                  `\cellcolor{metabg!95}`{=latex}95%   `\cellcolor{metabg!90}`{=latex}90%   `\cellcolor{metabg!82}`{=latex}82%   `\cellcolor{metabg!76}`{=latex}76%   `\cellcolor{metabg!70}`{=latex}70%   `\cellcolor{metabg!61}`{=latex}61%   `\cellcolor{metabg!90}`{=latex}90%   `\cellcolor{metabg!90}`{=latex}90%   `\cellcolor{metabg!85}`{=latex}85%
                                  28                                  `\cellcolor{metabg!95}`{=latex}95%   `\cellcolor{metabg!91}`{=latex}91%   `\cellcolor{metabg!84}`{=latex}84%   `\cellcolor{metabg!78}`{=latex}78%   `\cellcolor{metabg!73}`{=latex}73%   `\cellcolor{metabg!64}`{=latex}64%   `\cellcolor{metabg!91}`{=latex}91%   `\cellcolor{metabg!91}`{=latex}91%   `\cellcolor{metabg!86}`{=latex}86%
                                  29                                  `\cellcolor{metabg!95}`{=latex}95%   `\cellcolor{metabg!91}`{=latex}91%   `\cellcolor{metabg!83}`{=latex}83%   `\cellcolor{metabg!77}`{=latex}77%   `\cellcolor{metabg!72}`{=latex}72%   `\cellcolor{metabg!63}`{=latex}63%   `\cellcolor{metabg!91}`{=latex}91%   `\cellcolor{metabg!91}`{=latex}91%   `\cellcolor{metabg!87}`{=latex}87%
                                  30                                  `\cellcolor{metabg!95}`{=latex}95%   `\cellcolor{metabg!92}`{=latex}92%   `\cellcolor{metabg!85}`{=latex}85%   `\cellcolor{metabg!79}`{=latex}79%   `\cellcolor{metabg!74}`{=latex}74%   `\cellcolor{metabg!66}`{=latex}66%   `\cellcolor{metabg!92}`{=latex}92%   `\cellcolor{metabg!92}`{=latex}92%   `\cellcolor{metabg!88}`{=latex}88%
                                  31                                  `\cellcolor{metabg!96}`{=latex}96%   `\cellcolor{metabg!92}`{=latex}92%   `\cellcolor{metabg!86}`{=latex}86%   `\cellcolor{metabg!81}`{=latex}81%   `\cellcolor{metabg!76}`{=latex}76%   `\cellcolor{metabg!68}`{=latex}68%   `\cellcolor{metabg!92}`{=latex}92%   `\cellcolor{metabg!92}`{=latex}92%   `\cellcolor{metabg!88}`{=latex}88%
                                  32                                  `\cellcolor{metabg!96}`{=latex}96%   `\cellcolor{metabg!93}`{=latex}93%   `\cellcolor{metabg!87}`{=latex}87%   `\cellcolor{metabg!82}`{=latex}82%   `\cellcolor{metabg!78}`{=latex}78%   `\cellcolor{metabg!70}`{=latex}70%   `\cellcolor{metabg!92}`{=latex}92%   `\cellcolor{metabg!93}`{=latex}93%   `\cellcolor{metabg!89}`{=latex}89%
  ------------------------------------------------------------------ ------------------------------------ ------------------------------------ ------------------------------------ ------------------------------------ ------------------------------------ ------------------------------------ ------------------------------------ ------------------------------------ ------------------------------------

  : The compute cost per byte by global model as percentage of compute cost per byte of the whole model. The first column (Scale) denotes number of layers and heads of global module. In latent tokenization `\compr{}`{=latex} $T \in \{1,2,4,6,8,12\}$ is set as hyperparameter, whereas in subword tokenization it is determined by the tokenizer (Character $T=1.01$, BPE $T=4.57$, SuperBPE $T=6.16$)
:::

Scaling Laws: Technical Details {#sec:power-law-details}
===============================

We characterize compute-optimal scaling through a two-stage fitting procedure.

Scaling Law I
-------------

We fit the scaling laws to find optimal data and parameters as described in Equations `\ref{eq:optimal_data}`{=latex} and `\ref{eq:optimal_param}`{=latex}. As noted in the methodology, we restrict this fit to the parameters of the global latent model (excluding encoder/decoder and embeddings) to ensure consistency across tokenization methods.

We perform the fit using the L-BFGS-B [@zhu1997bfgsb] algorithm with a gradient tolerance of $10^{-10}$. To ensure robust convergence, we employ a grid search for initialization:

1.  We first compute an Ordinary Least Squares (OLS) solution ($\alpha_\text{OLS}, \beta_\text{OLS}, B_\text{OLS}, N_\text{OLS}$) on the log-transformed data to serve as a prior.

2.  We define a search grid by perturbing the OLS solution. We test 13 values for each parameter, resulting in $13^4$ total initialization points (though we fix $\alpha$ and $\beta$ ranges tighter than $B_0$ and $N_0$).

The grid is constructed as follows:

-   $\log(B_\text{init}) \in \{\log(B_\text{OLS}) + \epsilon : \epsilon \in [-3, 3]\}$

-   $\log(N_\text{init}) \in \{\log(N_\text{OLS}) + \epsilon : \epsilon \in [-3, 3]\}$

-   $\alpha_\text{init} \in \{\alpha_\text{OLS} + \epsilon : \epsilon \in [-0.3, 0.3]\}$

-   $\beta_\text{init} \in \{\beta_\text{OLS} + \epsilon : \epsilon \in [-0.3, 0.3]\}$

We select the solution that minimizes the sum of squares loss objective. The BFGS algorithm obtained the parameter values similar to OLS regardless of its starting point.

Scaling Law II
--------------

In the second stage, we fit the power law for optimal loss $L^\star(C,T) \simeq L_0 C^\gamma + f(T)$. Unlike Stage I, we use the total compute budget, for this fit, including the cost of the encoder, decoder, and embeddings.

We fit the parameters $L_0$, $\gamma$, and the compression-specific offsets $f(T)$ simultaneously. We again use BFGS with a grid search for initialization. The grid spans 13 values for $L_0$ and $\gamma$ (169 combinations):

-   $\log(L_\text{init}) \in [-3, 3]$

-   $\gamma_\text{init} \in [-0.6, 0.0]$

The initial value for $f(T)$ is set to the mean loss observed at compression rate $T$. During optimization, we bound the parameters to physically plausible ranges: $\log(L_0) \in [-30, 30]$, $\gamma \in [-2, 0]$, and $f(T) \in [-5, 5]$.

Derivation and Validation of Scaling Law II {#sec:scaling_law_2_derivation}
-------------------------------------------

```{=latex}
\rowcolors{2}{metabg}{white}
```
```{=latex}
\setlength{\tabcolsep}{4pt}
```
```{=latex}
\renewcommand{\arraystretch}{1.2}
```
::: {#tab:scaling_law_validation}
  **Residual model**                                                               **$E$**   **$F$**   **$T_0$**   **$\delta$**      RMSE        $R^2$     $\bar{R}^2$
  ------------------------------------------------------------------------------- --------- --------- ----------- -------------- ------------ ----------- -------------
  Mean of residuals (Eq. `\ref{eq:optimal_loss_irreducible}`{=latex})              0.7075      ---        ---          ---          0.0260       0.903        0.896
  Constant $T^\star$ (Eq. `\ref{eq:optimal_const_compression}`{=latex})            0.7075    0.0341      3.73          ---          0.0115       0.996        0.995
  Compute-dependent $T^\star$ (Eq. `\ref{eq:optimal_compression_rate}`{=latex})    0.7075    0.0341      14.9         0.0302      **0.0086**   **0.997**    **0.996**

  :  Comparison of the three considered forms for modeling $f(C,T)$ residuals in Equation `\ref{eq:optimal_loss}`{=latex}. All functions were fitted using the $48$ runs with compute budgets less than or equal to $1 \times 10^{21}$ FLOPs. To test extrapolation accuracy, Root Mean Square Error was computed for models trained at $2 \times 10^{21}$ FLOPs across 8 different compression rates. All evaluations of extrapolation performance and goodness-of-fit (standard and adjusted for the number of variables) indicate that the model with compute-dependent `\compr{}`{=latex} offers the best fit and extrapolation accuracy in loss estimation.
:::

As described in Section `\ref{sec:scaling_law_2}`{=latex}, we begin the search for the scaling law equation by assuming the classical form from @kaplan2020scaling, disregarding the role of compression. It is presented by Equation `\ref{eq:optimal_loss}`{=latex}: $$L^\star(C,T) \simeq L_0 \times C^\gamma + f(C,T)$$

First, we fit the first part of the scaling law, and then we examine the functions that would give the best approximation of $f(C,T)$ (residuals of the fit) with minimal complexity. We consider the following candidates for $f(C,T)$:

#### Mean of the residuals

is equivalent to the \`\`irreducible loss" term or intercept used in many scaling fits. It is the simplest form of $f(C,T)$, yet it still completely disregards the role of compression on loss. We consider the following form of irreducible loss: $$\label{eq:optimal_loss_irreducible}
    f(C,T) = E$$

#### Constant optimal compression ($T^\star$)

is an assumption that the loss is always minimal for one compression rate, regardless of compute budget $C$. By an inspection of $f(C,T)$ residuals in Figure `\ref{fig:cr_vs_loss_blt}`{=latex}, we observe that they are distributed along a non-monotonic convex function of $T$, with a minimum at some point $T_0$. We assume that a quadratic function fits this relation well. Considering that $T$ is on a logarithmic scale, we propose the following equation for residuals (including also irreducible loss): $$\label{eq:optimal_const_compression}
    f(C,T) = F \times \left(\log(T) - \log(T_0)\right)^2 +E = F \times \log^2\left(\frac{T}{T_0}\right) + E$$

#### Compute-dependent optimal compression ($T^\star$)

is based on a hypothesis that the optimal compression depends on compute budget. We observe that the minimum of the quadratic function modeling $f(T,C)$ described in the last paragraph shifts to a lower value with an increase of the training budget. To account for that, we include the effect of the compute budget in the log-quadratic function, arriving at the following formulation of Equation `\ref{eq:optimal_compression_rate}`{=latex}: $$f(C,T) = F \times \log^2\left(\frac{C^\delta T}{T_0} \right) + E$$

To validate the extrapolation accuracy of the three candidate formulas, we fitted scaling laws for results of models trained for 8 computation budgets from $5 \times 10^{18}$ to $1 \times 10^{21}$, and 6 compression rates. For each compression rate and budget, we use the optimal model size (and training data) estimated by the Scaling Law I. Then we validate the obtained scaling laws by comparing expected vs. obtained loss for models trained with a higher compute budget: $2 \times 10^{21}$. In Table `\ref{tab:scaling_law_validation}`{=latex}, we observe that the last formulation, making an assumption that the optimal `\compr{}`{=latex} is compute-dependent, obtains significantly lower mean square error in extrapolation than other candidate formulations. Moreover, the fit using this formula obtains the highest goodness-of-fit coefficient, both standard ($R^2$) and adjusted for the number of fitted variables ($\bar{R}^2$). Therefore, we decided to choose this formulation for the final version of the scaling law.

Loss Sensitivity to Compression Rate {#sec:loss_sensitivity}
------------------------------------

```{=latex}
\begin{figure*}[!htb]
    
    \begin{subfigure}{0.49\textwidth}
        \includegraphics[width=\linewidth]{figures/blt_entropy_marginal_sensitivity.pdf}
        \caption{Latent Tokenization }
    \end{subfigure}
    \begin{subfigure}{0.49\textwidth}
        \includegraphics[width=\linewidth]{figures/isotropic_marginal_sensitivity.pdf}
        \caption{Subword Tokenization}
    \end{subfigure}
    \caption{The BPB deterioration across compression compared to the value at optimal \compr{}.
    $\Delta L ^\star$ function was predicted based on Scaling Law II fit.
    }
    \label{fig:loss_sensitivity}
\end{figure*}
```
Figure `\ref{fig:loss_sensitivity}`{=latex} shows marginal sensitivity of loss to the choice of compression rate. We observe that `\compr{}`{=latex} close to optimal has minimal impact on loss, yet diverging further from the optimum can cause up to $0.2$ and $0.1$ deterioration in test BPB for subword and latent tokenized models respectively.

Confidence Intervals
--------------------

We compute $95\%$ confidence intervals for the fitted parameters $\hat{\boldsymbol{\theta}} \in \mathbb{R}^p$ from $n$ data points, where $p$ is the number of parameters. $\mathcal{L}(\boldsymbol{\theta})$ denotes the sum of squares loss evaluated at $\boldsymbol{\theta}$, and $\mathbf{e}_k$ be the $k$-th standard basis vector in $\mathbb{R}^p$.

The Hessian $H \in \mathbb{R}^{p \times p}$ of $\mathcal{L}$ is estimated via central finite differences with step size $\epsilon = 10^{-5}$: $$H_{ij}
  =
  \frac{
    \mathcal{L}_{ij}^{++} - \mathcal{L}_{ij}^{+-} - \mathcal{L}_{ij}^{-+} + \mathcal{L}_{ij}^{--}
  }{4\,\epsilon^{2}}$$ where $$\mathcal{L}_{ij}^{s_1 s_2}
  =
  \mathcal{L}\!\bigl(\boldsymbol{\theta}
    + s_1\,\epsilon\,\mathbf{e}_i
    + s_2\,\epsilon\,\mathbf{e}_j\bigr)
  \qquad s_1, s_2 \in \{+, -\}$$ The residual variance is estimated as $$\hat{\sigma}^{2}
  =
  \frac{\displaystyle\sum_{i=1}^{n} r_i^{2}}{n - p}$$ where $r_i = y_i - \hat{y}_i$ is the $i$-th residual (observed minus predicted value). The parameter covariance matrix is $$\hat{\Sigma} = \hat{\sigma}^{2}\,H^{-1}$$ The $95\%$ confidence interval for each parameter $\hat{\boldsymbol{\theta}}_{k}$ is: $$\hat{\boldsymbol{\theta}}_{k} \pm t \cdot \sqrt{\hat{\Sigma}_{kk}}$$ where $\sqrt{\hat{\Sigma}_{kk}}$ is the standard error for estimation of $\hat{\boldsymbol{\theta}}_{k}$ and $t$ is the two-sided $95\%$ critical value of Student's $t$-distribution with $n - p$ degrees of freedom.

```{=latex}
\clearpage
```
Impact of Tokenization Method {#sec:tokenization_method_comparison}
=============================

```{=latex}
\begin{figure*}[!htb]
    
    \begin{subfigure}{0.32\textwidth}
        \includegraphics[width=\linewidth]{figures/blt_entropy_1e20_results_compression_contour_global_parameters.pdf}
        \caption{Latent Tokenization (Entropy)}
    \end{subfigure}
    \begin{subfigure}{0.32\textwidth}
        \includegraphics[width=\linewidth]{figures/blt_static_1e20_results_compression_contour_global_parameters.pdf}
        \caption{Latent Tokenization (Fixed Size)}
    \end{subfigure}
    \begin{subfigure}{0.32\textwidth}
        \includegraphics[width=\linewidth]{figures/isotropic_1e20_results_compression_contour_global_parameters.pdf}
        \caption{Subword Tokenization }
    \end{subfigure}
    \caption{Comparison of three-dimensional IsoFLOPs 
    ($C=10^{20}$) for three methods of tokenization: Latent Entropy, Latent Fixed Size (each latent token has fixed size of $T$ bytes), and Subword.
    The loss profile is visibly similar across the methods, with optimal loss achieved along constant \bpp{}.}
    \label{fig:stage_1_comparison_1e20}
\end{figure*}
```
```{=latex}
\rowcolors{2}{metabg}{white}
```
```{=latex}
\setlength{\tabcolsep}{4pt}
```
```{=latex}
\renewcommand{\arraystretch}{1.2}
```
::: {#tab:tokenization_method_comparison}
  **Tokenization Method**    **$\rho^\star$**   **$T^\star$**   **BPB**
  ------------------------- ------------------ --------------- ---------
  Latent (Entropy)                 62.1             3.71         0.960
  Latent (Fixed)                   60.0             3.87         0.973
  Subword                          58.8             5.36         0.947

  : Compute-optimal `\bpp{}`{=latex} ($\rho^\star$) and `\compr{}`{=latex} ($T^\star$) for different methods. The values are close to each other, except for subword $T^\star$.
:::

We compare our results for different methods of tokenization: latent (with entropy supervision) and subword, as described in the main text. Moreover, we compare the results with another method of latent tokenization, where all latent tokens are of the same fixed size in bytes equal to `\compr{}`{=latex}. In Figure `\ref{fig:stage_1_comparison_1e20}`{=latex}, we see similar loss profiles across different methods. For all the methods and `\compr{}`{=latex}s, the optimal configurations fall at $\approx 60$ bytes per parameter ratio $(\rho)$. In Table `\ref{tab:tokenization_method_comparison}`{=latex}, we further observe that for two latent tokenization methods the optimal `\compr{}`{=latex} is similar, while in subword tokenization it is higher. This is due to an imperfect IsoFLOP paraboloid fit caused by poor performance of character-level models ($T=1.01$) under the considered budget, skewing optimal $T$ to be higher than in reality. Based on the more reliable Scaling Law II estimation (see Section `\ref{sec:subword_scaling}`{=latex}) we expect to observe lower optimal `\compr{}`{=latex} for this budget: $T^\star=4.11$. For comparison, optimal `\compr{}`{=latex} for latent models based on Scaling Law is $T^\star=3.67$.

```{=latex}
\clearpage
```
Impact of Mixing Languages {#sec:beyond_english_mix}
==========================

```{=latex}
\rowcolors{3}{metabg}{white}
```
```{=latex}
\setlength{\tabcolsep}{4pt}
```
```{=latex}
\renewcommand{\arraystretch}{1.2}
```
::: {#tab:xlingual_mix_optimals_1e20}
  ------------------------------------------------------- ------------ -------------------- ----------------- ----------- ----------- -------
                       **Language**                        **Parity**   **$\rho^\star_l$**   **$T^\star_l$**    **BPB**               
   `\cmidrule`{=latex}(lr)3-4 `\cmidrule`{=latex}(lr)5-6                    **Value**           **Ratio**      **Value**   **Ratio**  
                          English                             1.0              71.8               1.00           3.38        1.00      1.101
                          French                              1.2              72.5               1.01           3.65        1.08      0.931
                        Vietnamese                            1.4              70.3               0.98           4.12        1.22      0.720
                          Arabic                              1.6              76.5               1.07           3.84        1.14      0.667
                          Russian                             2.0              77.6               1.08           5.03        1.49      0.532
                           Hindi                              2.6              68.9               0.96           6.32        1.87      0.387
  ------------------------------------------------------- ------------ -------------------- ----------------- ----------- ----------- -------

  : Compute-optimal `\bpp{}`{=latex} ($\rho^\star_l$), `\compr{}`{=latex} ($T^\star_l$) compared to cross-lingual `\pari{}`{=latex}. Results for multilingual models, trained jointly on all six languages with $C=10^{20}$ FLOPs budget. The `\pari{}`{=latex} and compute-optimal ratios are proportions between each language and English baseline.
:::

![`\bpp{}`{=latex}](figures/mix_multilingual_byte_per_parameter.png){#fig:mix_xlingual_bpp_1e20 width="\\linewidth"}

```{=latex}
\hfill
```
![`\compr{}`{=latex}](figures/mix_multilingual_compression_rate.png){#fig:mix_xlingual_crs_1e20 width="\\linewidth"}

```{=latex}
\hfill
```
![BPB](figures/mix_multilingual_bpb.png){#fig:mix_xlingual_bpbs_1e20 width="\\linewidth"}

To examine the impact of mixing languages during training, we train a set of models jointly on multilingual data in six languages (including English), described in Section `\ref{sec:beyond_english}`{=latex}. To enforce an equitable training signal across languages, we sample languages with weights equal to their `\pari{}`{=latex}. For instance, we train on $2.6$ more bytes in Hindi than in English, but we expect the two samples to be matched in information value. All training runs are constrained to a fixed budget of $C = 10^{20}$ FLOPs; thus, multilingual models see less in-language data per language than their monolingual counterparts.

The optimal values of `\bpp{}`{=latex} and `\compr{}`{=latex} for each language are computed based on fits to the in-language test set, the results are gathered in Table `\ref{tab:xlingual_mix_optimals_1e20}`{=latex}. Figure `\ref{fig:mix_xlingual_bpp_1e20}`{=latex} shows that the optimal `\bpp{}`{=latex} is similar across languages. This contrasts with the findings in Section `\ref{sec:beyond_english}`{=latex}, where the optimal `\bpp{}`{=latex} was language-dependent and correlated with `\pari{}`{=latex}. Notably, the multilingual optimal `\bpp{}`{=latex} ($\rho^\star \approx 70$) is close to the median of the language-specific optimal values, $\rho^\star_l$. As in the monolingual experiments, we observe that the optimal `\compr{}`{=latex} (Figure `\ref{fig:mix_xlingual_crs_1e20}`{=latex}) is correlated with `\pari{}`{=latex}. The multilingual optimal values are lower than the corresponding monolingual ones. Test BPB (Figure `\ref{fig:mix_xlingual_bpbs_1e20}`{=latex}) is inversely correlated with `\pari{}`{=latex}, in line with the results of Section `\ref{sec:beyond_english}`{=latex}. As expected, multilingual models perform worse than monolingual ones due to the smaller amount of in-language data.

```{=latex}
\clearpage
```
Comparison with \`\`Scaling Laws with Vocabulary" {#sec:comparison_tao_et_al}
=================================================

@tao2024scaling posited similar research questions to ours regarding the role of tokenization in scaling laws, yet reached significantly different conclusions, showing that vocabularies (and thus `\compr{}`{=latex}) should increase with model scale. Meanwhile, we observe that the compute optimal `\compr{}`{=latex} does not increase with model scale. We identify the following methodological differences that explain discrepancies:

#### Approach to embedding-layer compute and vocabulary size

The main difference is how compression is connected to the size of the embedding layers. @tao2024scaling control `\compr{}`{=latex} by changing the vocabulary size, which affects the size of the embedding layer. This leads to a preference for smaller vocabularies at low compute and parameter budgets, so the FLOPs saved in embedding layers can be used for significantly longer training. In our experiments, vocabulary cost is (almost) the same regardless of compression, thanks to the use of BLT [@pagnoni-etal-2025-byte] or alternative subword methods such as SuperBPE [@liu2025superbpespacetravellanguage]. Therefore, our results extrapolate better to larger scales, where the cost of the embedding layer is negligible, as seen in Table `\ref{tab:scales_flops_share}`{=latex}.

#### Considered compression range

BPE achieves a narrow `\compr{}`{=latex} range (by our estimates, $T\in[3,4.5]$ bytes per token). Considering only compressions attainable by BPE allows us to observe only a portion of the loss profile, one that falls below the optimal compression value.

#### Evaluation

Both works use normalized negative log likelihood enabling a fair comparison across tokenizers. @tao2024scaling match validation context length in tokens, so the number of bytes in an evaluation example varies with vocabulary. We match the number of bytes across compression levels (e.g., if with `\compr{}`{=latex} $T=4$ we evaluate on 2048 tokens per example, then with `\compr{}`{=latex} $T=8$ we evaluate on 1024 tokens). Because early bytes are harder than later bytes, matching validation context in tokens can favor higher-compression tokenizers (more "late" bytes in an example). This could explain why we do not see the same preference for high compression (large vocabulary) at larger scales. For further reference, SuperBPE [@liu2025superbpespacetravellanguage] also matched evaluation context in bytes. Similarly to our results, they observed worse BPB scores for highly compressed SuperBPE compared to regular BPE.

```{=latex}
\clearpage
```
Supplementary Results
=====================

IsoFLOP Analysis across Compute Budgets {#sec:isoflop-across-budgets}
---------------------------------------

We present the IsoFLOPs across multiple compute budgets and compression rates for latent tokenized models in Figures `\ref{fig:1d_isoflops_blt}`{=latex}. And for subword tokenized models in Figures `\ref{fig:1d_isoflops_iso}`{=latex}. We observe that the optimal byte-per-parameter ratio $\rho^\star$ remains constant for most of the considered configurations, this trend is more visible for $C > 10^{20}$, where the compute of the global module becomes dominant, thus it is expected to hold also at larger scales.

![$C=10^{19}$ Latent Tokenization](figures/blt_entropy_1e19_results_global_parameters_fit_all.png){width="\\linewidth"}

![$C=2\times10^{19}$ Latent Tokenization](figures/blt_entropy_2e19_results_global_parameters_fit_all.png){width="\\linewidth"}

![$C=10^{20}$ Latent Tokenization](figures/blt_entropy_1e20_results_global_parameters_fit_all.png){width="\\linewidth"}

```{=latex}
\ContinuedFloat
```
![$C=2\times10^{20}$ Latent Tokenization](figures/blt_entropy_2e20_results_global_parameters_fit_all.png){width="\\linewidth"}

![$C=10^{21}$ Latent Tokenization](figures/blt_entropy_1e21_results_global_parameters_fit_all.png){width="\\linewidth"}

![$C=2\times10^{21}$ Latent Tokenization](figures/blt_entropy_2e21_results_global_parameters_fit_all.png){width="\\linewidth"}

![$C=10^{19}$ Latent Tokenization](figures/blt_entropy_1e19_results_compression_contour_global_parameters_fit_all.png){width="\\linewidth"}

![$C=2\times10^{19}$ Latent Tokenization](figures/blt_entropy_2e19_results_compression_contour_global_parameters_fit_all.png){width="\\linewidth"}

![$C=10^{20}$ Latent Tokenization](figures/blt_entropy_1e20_results_compression_contour_global_parameters_fit_all.png){width="\\linewidth"}

```{=latex}
\ContinuedFloat
```
![$C=2\times10^{20}$ Latent Tokenization](figures/blt_entropy_2e20_results_compression_contour_global_parameters_fit_all.png){width="\\linewidth"}

![$C=10^{21}$ Latent Tokenization](figures/blt_entropy_1e21_results_compression_contour_global_parameters_fit_all.png){width="\\linewidth"}

![$C=2\times10^{21}$ Latent Tokenization](figures/blt_entropy_2e21_results_compression_contour_global_parameters_fit_all.png){width="\\linewidth"}

![$C=5\times10^{19}$ Subword Tokenization](figures/isotropic_2e19_results_global_parameters_fit_all.png){width="\\linewidth"}

![$C=10^{20}$ Subword Tokenization](figures/isotropic_1e20_results_global_parameters_fit_all.png){width="\\linewidth"}

![$C=2\times10^{20}$ Subword Tokenization](figures/isotropic_2e20_results_global_parameters_fit_all.png){width="\\linewidth"}

![$C=5\times10^{19}$ Subword Tokenization](figures/isotropic_5e19_results_compression_contour_global_parameters_fit_all.png){width="\\linewidth"}

![$C=10^{20}$ Subword Tokenization](figures/isotropic_1e20_results_compression_contour_global_parameters_fit_all.png){width="\\linewidth"}

![$C=2\times10^{20}$ Subword Tokenization](figures/isotropic_2e20_results_compression_contour_global_parameters_fit_all.png){width="\\linewidth"}

```{=latex}
\clearpage
```
Optimal Data and Parameters across Compute Budgets
--------------------------------------------------

![Amount of data](figures/isotropic_optimal_configs_global_parameters_data.png){#fig:optimal_data_iso width="\\linewidth"}

```{=latex}
\hfill
```
![Model size](figures/isotropic_optimal_configs_global_parameters_params.png){#fig:optimal_params_iso width="\\linewidth"}

Figure `\ref{fig:optimal_data_params_iso}`{=latex} shows the optimal data in bytes $B^*$ and parameter counts $N^*$ across compressions and compute budgets for subword tokenized models.

Loss Obtained by Optimal Configurations
---------------------------------------

```{=latex}
\rowcolors{4}{metabg}{white}
```
```{=latex}
\setlength{\tabcolsep}{4pt}
```
```{=latex}
\renewcommand{\arraystretch}{1.2}
```
::: {#tab:blt_budget_bpb_comparison}
  ---------------------------------------- -------------------- -------- ------------ -------- -------- --------
  **Compute**                               **Latent Entropy**                                          
  `\cmidrule`{=latex}(lr)2-7 **(FLOPs)**            1              2          4          6        8        12
  $1 \times 10^{19}$                              1.1790         1.1178   **1.1025**   1.1095   1.1272   1.1598
  $2 \times 10^{19}$                              1.1200         1.0642   **1.0532**   1.0587   1.0727   1.1047
  $5 \times 10^{19}$                              1.0606         1.0080   **0.9987**   1.0049   1.0165   1.0422
  $1 \times 10^{20}$                              1.0158         0.9694   **0.9601**   0.9631   0.9751   0.9993
  $2 \times 10^{20}$                              0.9771         0.9359   **0.9265**   0.9314   0.9427   0.9650
  $5 \times 10^{20}$                              0.9333         0.8974   **0.8933**   0.8990   0.9085   0.9278
  $1 \times 10^{21}$                              0.9008         0.8722   **0.8686**   0.8744   0.8843   0.9041
  $2 \times 10^{21}$                              0.8741         0.8491   **0.8483**   0.8543   0.8650   0.8844
  **Compression:**                                  1              2          4          6        8        12
  ---------------------------------------- -------------------- -------- ------------ -------- -------- --------

  : Comparison of the lowest BPB obtained by latent tokenized models for specific compute budgets.
:::

In Tables `\ref{tab:blt_budget_bpb_comparison}`{=latex} and `\ref{tab:isotropic_budget_bpb_comparison}`{=latex}, we present the best scores obtained by models (i.e., not derived from scaling law) respectively for latent and subword tokenized models.

Multilingual 2D IsoFLOP {#sec:xling_1d_isoflop}
-----------------------

```{=latex}
\begin{figure*}[!htb]
    
    \begin{subfigure}{0.48\textwidth}
        \includegraphics[width=\linewidth]{figures/blt_entropy_fra_Latn_1e20_results_global_parameters_fra_Latn_loss_curves.pdf}
        \caption{French (Latin)}
    \end{subfigure}
    \begin{subfigure}{0.48\textwidth}
        \includegraphics[width=\linewidth]{figures/blt_entropy_vie_Latn_1e20_results_global_parameters_vie_latn_loss_curves.pdf}
        \caption{Vietnamese (Latin)}
    \end{subfigure}
    \begin{subfigure}{0.48\textwidth}
        \includegraphics[width=\linewidth]{figures/blt_entropy_arb_Arab_1e20_results_global_parameters_arb_Arab_loss_curves.pdf}
        \caption{Arabic (Arabic)}
    \end{subfigure}
    \begin{subfigure}{0.48\textwidth}
        \includegraphics[width=\linewidth]{figures/blt_entropy_rus_Cyrl_1e20_results_global_parameters_rus_Cyrl_loss_curves.pdf}
        \caption{Russian (Cyrillic)}
    \end{subfigure}
        \begin{subfigure}{0.48\textwidth}
        \includegraphics[width=\linewidth]{figures/blt_entropy_bytes2_1e20_results_global_parameters_loss_curves.pdf}
        \caption{English x2 (Latin)}
    \end{subfigure}
    \begin{subfigure}{0.48\textwidth}
        \includegraphics[width=\linewidth]{figures/blt_entropy_hin_Deva_1e20_results_global_parameters_hin_Deva_loss_curves.pdf}
        \caption{Hindi (Devanagari)}
    \end{subfigure}
    \caption{2D IsoFLOP fits across languages ($C=10^{20}$); all models use latent tokenization to achieve the set compression. Parabolas are fitted for each compression line to interpolate values of the loss.}
    \label{fig:1d_xlingual_1e20}
\end{figure*}
```
In Figure `\ref{fig:1d_xlingual_1e20}`{=latex}, we present 2-dimensional IsoFLOP for six considered languages. The visualization is based on the same data as used for 3-dimensional IsoFLOP in Figure `\ref{fig:2d_xlingual_1e20}`{=latex}.

```{=latex}
\clearpage
```
Comparison between Character and Byte-level Models {#sec:char-vs-byte}
--------------------------------------------------

In our analysis of subword tokenized models we focus on character-based instead of byte-based models to examine the properties of low compression. The main difference between these models is that the former has a much larger vocabulary (148,000 vs. 256), while achieving a similar compression rate. In our experiments, we consider character models to coerce on a similar vocabulary size as in BPE and SuperBPE.

We compare the loss of parameter optimal character ($T=1.01$) and byte models ($T=1.0$) in Figure `\ref{fig:chat_vs_byte_isotropic}`{=latex}. Notably, the gap between them is large for a small compute budget due to the relatively high cost of the embedding layer in small models. With the increase of the training budget, the difference narrows. This allows us to assume that character and byte tokenized models will follow similar scaling trends at larger scales. Therefore, in the most of experiments we only consider character-based models.

![Comparison of optimal test losses for subword tokenized models: byte $T=1.00$; character $T=1.01$; BPE $T=4.57$; SuperBPE $T=6.16$.](figures/isotropic_optimal_compression_loss_vs_budget_global_parameters.png){#fig:chat_vs_byte_isotropic width="0.75\\linewidth"}

```{=latex}
\clearpage 
```
AI2 Reasoning Challenge Results {#sec:arc-results}
-------------------------------

![0-shot Accuracy on ARC-Easy](figures/endtask_arc_easy.png){width="\\linewidth"}

```{=latex}
\hfill
```
![0-shot Accuracy on ARC-Challenge](figures/endtask_arc_challenge.png){width="\\linewidth"}

Figure `\ref{fig:endtask_arc}`{=latex} presents evaluations on multiple-choice questions from the AI2 Reasoning Challenge [@clark2018arc]. Interestingly, we observe that for the easier version of the task, models with `\compr{}`{=latex} 8 and `\compr{}`{=latex} 4 achieve similar scores. The higher compression (`\compr{}`{=latex} 8) even obtains the best score for the 4.1B-parameter model, while being cheaper to run than the corresponding `\compr{}`{=latex} 4 model. On the harder \`\`challenge" split, we observe a different pattern: `\compr{}`{=latex} 4 achieves higher scores than `\compr{}`{=latex} 8. We conclude that the choice of optimal compression can be task-dependent. More-compressed, and thus cheaper, tokenization may be adequate for easier tasks, while harder tasks may benefit from the additional inference compute associated with lower compression. We also note the underperformance of byte-level models, which we attribute to insufficient data seen during pre-training.

[^1]: Recent latent tokenized models allow achieving a wide range of `\compr{}`{=latex}, similarly to BLT. However, in other approaches `\compr{}`{=latex} cannot be precisely controlled due to reliance on a segment boundary predictor [@hwang2025dynamicchunkingendtoendhierarchical; @nawrot-etal-2023-efficient] or whitespace supervision [@neitemeier2025hierarchical; @slagle2024spacebyte; @videau2025bytesideaslanguagemodeling].

[^2]: Estimates of `\compr{}`{=latex} are computed for DCLM corpus [@li2024datacomp] consisting of \`\`plain English" texts.

[^3]: In this context, *isotropic* means that all modules of the model operate on sequences of the same granularity, unlike in hierarchical models.

[^4]: For low compression we choose character level tokenization instead of byte level to match the magnitude of vocabulary size across isotropic models. In Appendix `\ref{sec:char-vs-byte}`{=latex}, we show that character and byte models achieve similar performance at large scale.

[^5]: We discuss empirical derivation of this formula in Appendix `\ref{sec:scaling_law_2_derivation}`{=latex}.

[^6]: The results for jointly trained multilingual models are in Appendix `\ref{sec:beyond_english_mix}`{=latex}).
