Latent Tokenization

Summary

Latent tokenization is the broader pattern of replacing fixed tokens with learned chunks, concepts, abstraction levels, or specialized typed embeddings for values that fixed tokenizers handle poorly.

What The Wiki Currently Believes

H-Net learns content- and context-dependent chunking inside a hierarchical network.
ReinPatch ports the learned-boundary idea into time-series forecasting by training a detachable patch policy with downstream forecasting loss.
Synergy learns a routing mechanism that bridges byte-level and higher-level abstraction.
ConceptMoE merges semantically similar token sequences into concept representations before the expensive concept model.
Compute Optimal Tokenization treats compression rate as a scaling variable and reports that the compute-optimal data/model relationship is more stable in bytes per parameter than in tokens per parameter.
Latent Context Language Models compress long prompt spans into soft latent tokens before decoder prefill, keeping the decoder interface but changing the retained context granularity.
Compress & Attend Transformers keep current chunks at token resolution while replacing older chunks with compressed chunk representations; the chunk size becomes a test-time quality/efficiency knob.
ELF keeps tokenization and final vocabulary decoding, but moves the generative trajectory into continuous contextual embedding space. This makes final readout timing part of the token-interface question.
FoNE and BitTokens show the typed-value version of the same pressure: ordinary tokenizers fragment numbers and hide useful numeric structure.
Pre-trained Large Language Models Use Fourier Features To Compute Addition suggests some useful numeric structure can emerge in pretrained token embeddings, but explicit numeric encodings may expose it more reliably.
Graph Tokenization, GraphGPT, and GQT add the graph-structure version of this pressure: topology, node identity, edge attributes, local neighborhoods, and recurring subgraphs may need their own token interface before a standard Transformer can use them reliably.
RAEv2 adds the visual-latent version: pretrained representation tokens are not just a compression artifact, and the layer aggregation rule can change reconstruction, generation, guidance, and action-conditioned video rollout behavior.
Gemma 4 12B adds a production multimodal projection-frontend case: image patches and audio frames are projected into a shared decoder backbone instead of being processed by separate modality encoders.
MiniMax Sparse Attention is the useful negative/contrast case: it changes which context tokens are read by sparse attention, but it does not itself learn a new token granularity or compressed latent token.

Hierarchy vs Sparse Communication

Latent tokenization SHOULD be separated from ordinary sparse attention. A sparse attention mask reduces communication edges while often preserving the original token count. Learned chunking, patching, concept compression, and U-Net-like hierarchy change the granularity of representation. A hierarchy becomes U-Net-like only when coarse processing is paired with a path back to fine resolution through decoding, upsampling, skip connections, or residual detail.

This distinction matters for time-series models. Local or sparse attention can make long histories cheaper, but it does not by itself create a representation stack such as samples -> motifs -> events -> regimes -> latent state.

MiniMax Sparse Attention is the current concrete example of that boundary. MSA keeps exact attention on selected key-value blocks and leaves selected tokens at token resolution; the abstraction question moves to the selector, not to a learned tokenizer.

LCLM and CAT add a context-compression variant rather than a tokenizer-removal variant. They still use ordinary text tokens inside local spans, but they change what the model retains from older or longer context. For time-series systems, that distinction should stay explicit: compressing history into soft tokens is not the same as learning native sample, channel, event, or regime tokens.

Evidence

These papers suggest segmentation is no longer just preprocessing. It can become a differentiable or reinforcement-trained compute-allocation and abstraction problem inside the model, or a typed-interface problem where numbers, bytes, concepts, and ordinary text tokens need different embedding contracts. Compute Optimal Tokenization adds the scaling-law version: once token granularity changes, “tokens per parameter” stops being portable, so compression should be tracked as part of the model budget.

RAEv2 widens the tokenization question from “which token granularity?” to “which latent layer interface?” For vision generation, the paper reports that summing multiple encoder layers can preserve more local detail while keeping semantic usefulness. The X discussion keeps this unsettled: residual-stream summation acts like a fixed depth weighting, so learned or sparse layer selection may be the real token-interface problem.

LCLM and CAT sharpen the preservation question. Both show that learned compression can improve long-context language-model serving tradeoffs, but both also need safeguards: LCLM uses staged training, auxiliary reconstruction, and optional expansion; CAT reports that large chunks can miss retrieval-critical detail. For time-series work, compression evidence should name what is preserved: reconstruction, retrieval, downstream prediction, anomaly sensitivity, or control value.

Hierarchical Modeling with a Fixed FLOPs Budget records the active research hypothesis that this should be trained as global compute allocation: a learned router should choose compression under a fixed FLOPs budget instead of inheriting manually specified per-layer compression ratios.

Relation To Foundation TSFM Agenda

Latent tokenization maps to the adaptive patching, point-wise numeric embedding, and dynamic-compute slots in the Foundation Time-Series Model Research Agenda. ReinPatch is the direct time-series link, while H-Net, ConceptMoE, and Compute Optimal Tokenization are adjacent architecture and scaling analogs. The main agenda warning is preservation: learned chunks must not erase spikes, change points, rare regimes, missingness, or intervention effects that matter for latent-state modeling.

Open Questions

Should learned tokenization be byte-native, token-compressive, or concept-level?
Can a time-series patcher learn boundaries that transfer across domains without erasing native multivariate structure?
How should learned chunking interact with attention, KV cache, and MoE routing?
What information-density unit should replace bytes when compression is applied to continuous time-series channels or irregular event streams?
Should numeric values be handled through learned chunks, typed number tokens, point-wise scalar embeddings, or separate numeric heads?
When are lightweight projection frontends enough for continuous modalities, and when does a modality need learned patching, typed embeddings, or a stronger encoder to preserve scale, timing, and local events?
Can a fixed-FLOPs router learn cross-modal compression without turning text, image/video, and time-series data into separate hand-tuned recipes?

Alex Open Research Wiki

Explorer

Latent Tokenization

Latent Tokenization

Summary

What The Wiki Currently Believes

Hierarchy vs Sparse Communication

Evidence

Relation To Foundation TSFM Agenda

Open Questions

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

Latent Tokenization

Latent Tokenization

Summary

What The Wiki Currently Believes

Hierarchy vs Sparse Communication

Evidence

Relation To Foundation TSFM Agenda

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks