Tokenizer Transfer
Summary
Tokenizer transfer is the problem of moving model capability across tokenization regimes without retraining from scratch.
What The Wiki Currently Believes
- Bolmo treats byteification as a special case of tokenizer transfer: a subword LM serves as a teacher for a byte-level latent-tokenizer architecture.
- Compute Optimal Tokenization adds a scaling constraint for tokenizer transfer: moving across tokenizers should preserve an appropriate byte-per-parameter training ratio, not a token-per-parameter ratio.
- FoNE and BitTokens raise the numeric version of the transfer problem: retrofitting a pretrained model with typed number tokens or numeric heads may require more than swapping the tokenizer.
- ReinPatch is a weaker but useful time-series analogue: a learned patching policy is pretrained on broad univariate time series and transferred zero-shot to downstream forecasting datasets while the backbone is trained separately.
- Pretrained Transformers as Universal Computation Engines is adjacent rather than a tokenizer-transfer method: it tests whether the frozen computation inside a language-pretrained Transformer transfers after only input/output adaptation.
Evidence
Bolmo’s main empirical claim is that byte-level LMs can become competitive quickly by exact distillation from existing subword LMs, rather than paying the full cost of from-scratch byte-level pretraining. Compute Optimal Tokenization gives that migration a budget check: if the target tokenizer changes compression rate, the old source model’s token count is no longer the stable scaling unit.
The number-tokenization sources do not yet solve transfer, but they make the migration problem concrete. A pretrained model may already contain spectral number-token structure, but Convergent Evolution shows that spectrum alone is not enough; tokenizer migration should preserve or re-learn task-usable geometry, parser behavior, and numeric heads.
ReinPatch does not transfer a full forecasting model or a text tokenizer. Its transfer claim is narrower: the segmentation policy itself can be reused as a frozen front-end for continuous time-series observations. That makes it a useful case study for separating learned tokenization from downstream model weights.
The FPT source adds a more general caution: sometimes the useful transferable object is not the tokenizer at all, but the pretrained sequence computation behind it. For time series, this keeps two questions separate: how numeric observations should be tokenized, and whether pretrained Transformer blocks should be reused.
Relation To Foundation TSFM Agenda
Tokenizer transfer is adjacent to the Foundation Time-Series Model Research Agenda through adaptive tokenization and numeric interface migration. ReinPatch is the direct time-series case for transferring a learned patching policy; FoNE and BitTokens are warnings that numeric-token migration can require parser, embedding, and head changes rather than a tokenizer swap.
Open Questions
- Which capabilities are lost when moving from subword to byte-level latent patches?
- When changing tokenizers, should distillation preserve source logits, byte-level likelihood, byte-per-parameter training ratios, or serving FLOPs per byte?
- Can tokenizer transfer be combined with learned dynamic chunking rather than fixed byte patches?
- Can a learned time-series patcher be transferred across datasets without baking in univariate or benchmark-specific assumptions?
- Can existing LLMs be migrated to typed numeric tokens without losing text capabilities or memorized numeric facts?
- Can graph-token vocabularies transfer across service graphs, telemetry schemas, and organizations, or do they overfit to one topology and monitoring stack?