ConceptMoE: Adaptive Token-To-Concept Compression For Implicit Compute Allocation
Source
- Raw Markdown: paper_conceptmoe-2026.md
- PDF: paper_conceptmoe-2026.pdf
Core Claim
ConceptMoE improves efficiency and effectiveness by merging semantically similar token sequences into concept representations before expensive MoE computation.
Key Contributions
- Introduces learnable token-to-concept chunking based on semantic similarity.
- Uses MoE to compare architectures under matched total parameters and activated FLOPs.
- Reports improvements on language pretraining, long-context understanding, multimodal benchmarks, and continual conversion.
- Reduces attention computation and KV cache requirements at higher compression ratios.
Method Notes
ConceptMoE connects Latent Tokenization with Mixture Of Experts: compression is not only a preprocessing step, but an implicit compute-allocation mechanism.
Evidence And Results
The abstract reports +0.9 language pretraining points, +2.3 long-context points, +0.6 multimodal points, and +5.5 points during continual training conversion under controlled settings.
Limitations
The source does not remove tokenization entirely; it compresses already-tokenized streams into concepts. It should be compared with byte-native methods such as H-Net and Synergy.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Dynamic compute allocation | adjacent | Dynamically merges semantically similar token spans before expensive MoE layers, reallocating saved computation under matched activated FLOPs. | Evidence is language and multimodal pretraining, not time-series spans, channels, regimes, or candidate futures. |
| Dynamic tokenization | adjacent | Learns token-to-concept chunk boundaries and tests compression ratio, router design, and dechunking. | Needs numeric-stream boundaries that preserve spikes, missingness, change points, and dense reconstruction. |
| Streaming state and long context | adjacent | Reduces token count and KV/cache pressure at higher compression ratios. | Does not maintain an always-on latent state or prove online update behavior. |
Links Into The Wiki
Open Questions
- How stable are learned concept boundaries across domains?
- Can concept compression be combined with byte-level or pixel-level inputs?