ConceptMoE: Adaptive Token-To-Concept Compression For Implicit Compute Allocation

Source

Core Claim

ConceptMoE improves efficiency and effectiveness by merging semantically similar token sequences into concept representations before expensive MoE computation.

Key Contributions

  • Introduces learnable token-to-concept chunking based on semantic similarity.
  • Uses MoE to compare architectures under matched total parameters and activated FLOPs.
  • Reports improvements on language pretraining, long-context understanding, multimodal benchmarks, and continual conversion.
  • Reduces attention computation and KV cache requirements at higher compression ratios.

Method Notes

ConceptMoE connects Latent Tokenization with Mixture Of Experts: compression is not only a preprocessing step, but an implicit compute-allocation mechanism.

Evidence And Results

The abstract reports +0.9 language pretraining points, +2.3 long-context points, +0.6 multimodal points, and +5.5 points during continual training conversion under controlled settings.

Limitations

The source does not remove tokenization entirely; it compresses already-tokenized streams into concepts. It should be compared with byte-native methods such as H-Net and Synergy.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Dynamic compute allocationadjacentDynamically merges semantically similar token spans before expensive MoE layers, reallocating saved computation under matched activated FLOPs.Evidence is language and multimodal pretraining, not time-series spans, channels, regimes, or candidate futures.
Dynamic tokenizationadjacentLearns token-to-concept chunk boundaries and tests compression ratio, router design, and dechunking.Needs numeric-stream boundaries that preserve spikes, missingness, change points, and dense reconstruction.
Streaming state and long contextadjacentReduces token count and KV/cache pressure at higher compression ratios.Does not maintain an always-on latent state or prove online update behavior.

Open Questions

  • How stable are learned concept boundaries across domains?
  • Can concept compression be combined with byte-level or pixel-level inputs?