Latent Tokenization
Summary
Latent tokenization is the broader pattern of replacing fixed tokens with learned chunks, concepts, abstraction levels, or specialized typed embeddings for values that fixed tokenizers handle poorly.
What The Wiki Currently Believes
- H-Net learns content- and context-dependent chunking inside a hierarchical network.
- ReinPatch ports the learned-boundary idea into time-series forecasting by training a detachable patch policy with downstream forecasting loss.
- Synergy learns a routing mechanism that bridges byte-level and higher-level abstraction.
- ConceptMoE merges semantically similar token sequences into concept representations before the expensive concept model.
- Compute Optimal Tokenization treats compression rate as a scaling variable and reports that the compute-optimal data/model relationship is more stable in bytes per parameter than in tokens per parameter.
- ELF keeps tokenization and final vocabulary decoding, but moves the generative trajectory into continuous contextual embedding space. This makes final readout timing part of the token-interface question.
- FoNE and BitTokens show the typed-value version of the same pressure: ordinary tokenizers fragment numbers and hide useful numeric structure.
- Pre-trained Large Language Models Use Fourier Features To Compute Addition suggests some useful numeric structure can emerge in pretrained token embeddings, but explicit numeric encodings may expose it more reliably.
- Graph Tokenization, GraphGPT, and GQT add the graph-structure version of this pressure: topology, node identity, edge attributes, local neighborhoods, and recurring subgraphs may need their own token interface before a standard Transformer can use them reliably.
- RAEv2 adds the visual-latent version: pretrained representation tokens are not just a compression artifact, and the layer aggregation rule can change reconstruction, generation, guidance, and action-conditioned video rollout behavior.
- Gemma 4 12B adds a production multimodal projection-frontend case: image patches and audio frames are projected into a shared decoder backbone instead of being processed by separate modality encoders.
Hierarchy vs Sparse Communication
Latent tokenization SHOULD be separated from ordinary sparse attention. A sparse attention mask reduces communication edges while often preserving the original token count. Learned chunking, patching, concept compression, and U-Net-like hierarchy change the granularity of representation. A hierarchy becomes U-Net-like only when coarse processing is paired with a path back to fine resolution through decoding, upsampling, skip connections, or residual detail.
This distinction matters for time-series models. Local or sparse attention can make long histories cheaper, but it does not by itself create a representation stack such as samples -> motifs -> events -> regimes -> latent state.
Evidence
These papers suggest segmentation is no longer just preprocessing. It can become a differentiable or reinforcement-trained compute-allocation and abstraction problem inside the model, or a typed-interface problem where numbers, bytes, concepts, and ordinary text tokens need different embedding contracts. Compute Optimal Tokenization adds the scaling-law version: once token granularity changes, “tokens per parameter” stops being portable, so compression should be tracked as part of the model budget.
RAEv2 widens the tokenization question from “which token granularity?” to “which latent layer interface?” For vision generation, the paper reports that summing multiple encoder layers can preserve more local detail while keeping semantic usefulness. The X discussion keeps this unsettled: residual-stream summation acts like a fixed depth weighting, so learned or sparse layer selection may be the real token-interface problem.
Hierarchical Modeling with a Fixed FLOPs Budget records the active research hypothesis that this should be trained as global compute allocation: a learned router should choose compression under a fixed FLOPs budget instead of inheriting manually specified per-layer compression ratios.
Relation To Foundation TSFM Agenda
Latent tokenization maps to the adaptive patching, point-wise numeric embedding, and dynamic-compute slots in the Foundation Time-Series Model Research Agenda. ReinPatch is the direct time-series link, while H-Net, ConceptMoE, and Compute Optimal Tokenization are adjacent architecture and scaling analogs. The main agenda warning is preservation: learned chunks must not erase spikes, change points, rare regimes, missingness, or intervention effects that matter for latent-state modeling.
Open Questions
- Should learned tokenization be byte-native, token-compressive, or concept-level?
- Can a time-series patcher learn boundaries that transfer across domains without erasing native multivariate structure?
- How should learned chunking interact with attention, KV cache, and MoE routing?
- What information-density unit should replace bytes when compression is applied to continuous time-series channels or irregular event streams?
- Should numeric values be handled through learned chunks, typed number tokens, point-wise scalar embeddings, or separate numeric heads?
- When are lightweight projection frontends enough for continuous modalities, and when does a modality need learned patching, typed embeddings, or a stronger encoder to preserve scale, timing, and local events?
- Can a fixed-FLOPs router learn cross-modal compression without turning text, image/video, and time-series data into separate hand-tuned recipes?