Beyond Language Modeling: An Exploration Of Multimodal Pretraining
Source
- Raw Markdown: paper_beyond-language-modeling-2026.md
- PDF: paper_beyond-language-modeling-2026.pdf
Core Claim
Native multimodal pretraining can move foundation models beyond text-only language modeling when visual representations, data mixtures, world-modeling data, and MoE scaling are controlled together.
Key Contributions
- Uses controlled from-scratch Transfusion pretraining over language, images, video, image-text pairs, and action-conditioned video.
- Finds representation autoencoders useful for visual understanding and generation.
- Reports complementarity between visual and language data.
- Treats MoE as a way to handle modality specialization and asymmetric scaling needs.
Method Notes
The paper is a design-space study. It is linked to Unified Multimodal Models, Mixture Of Experts, Vision-Language Models, and World Models.
Evidence And Results
The source emphasizes IsoFLOP-style comparisons and controlled changes to representation, modality mix, and architecture. Its main synthesis value is the claim that visual data is more data-hungry than language and that MoE can harmonize the mismatch.
Limitations
The conclusions are tied to the Transfusion setup and data mixture. They should be compared against pixel-space approaches such as Tuna-2 and spectral representation arguments in Prism.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Representation quality: semantic state vs dense detail | adjacent | Uses one high-dimensional RAE/SigLIP-style visual representation for understanding and generation, with explicit discussion of semantic latents versus VAE/raw-pixel choices. | Evidence is visual-language, not numeric time-series; no controllable preservation tradeoff for dense scalar detail. |
| Control and counterfactuals | adjacent | Navigation-world-model experiments condition future visual-state prediction on text-formatted navigation actions. | Actions are navigation prompts, not a general action/control-input interface for multivariate time series. |
| Dynamic compute allocation | adjacent | MoE routing is used to handle modality specialization and vision-language scaling asymmetry. | Routing is per modality/token, not adaptive compute over time-series spans, channels, regimes, or futures. |
Links Into The Wiki
- Foundation Time-Series Model Research Agenda
- Unified Multimodal Models
- Mixture Of Experts
- Vision-Language Models
- World Models
Open Questions
- Does RAE remain the best visual substrate when pixel-space end-to-end models scale further?
- How general is the reported world-modeling emergence across action-conditioned datasets?