Beyond Language Modeling: An Exploration Of Multimodal Pretraining

Source

Raw Markdown: paper_beyond-language-modeling-2026.md
PDF: paper_beyond-language-modeling-2026.pdf

Core Claim

Native multimodal pretraining can move foundation models beyond text-only language modeling when visual representations, data mixtures, world-modeling data, and MoE scaling are controlled together.

Key Contributions

Uses controlled from-scratch Transfusion pretraining over language, images, video, image-text pairs, and action-conditioned video.
Finds representation autoencoders useful for visual understanding and generation.
Reports complementarity between visual and language data.
Treats MoE as a way to handle modality specialization and asymmetric scaling needs.

Method Notes

The paper is a design-space study. It is linked to Unified Multimodal Models, Mixture Of Experts, Vision-Language Models, and World Models.

Evidence And Results

The source emphasizes IsoFLOP-style comparisons and controlled changes to representation, modality mix, and architecture. Its main synthesis value is the claim that visual data is more data-hungry than language and that MoE can harmonize the mismatch.

Limitations

The conclusions are tied to the Transfusion setup and data mixture. They should be compared against pixel-space approaches such as Tuna-2 and spectral representation arguments in Prism.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Representation quality: semantic state vs dense detail	adjacent	Uses one high-dimensional RAE/SigLIP-style visual representation for understanding and generation, with explicit discussion of semantic latents versus VAE/raw-pixel choices.	Evidence is visual-language, not numeric time-series; no controllable preservation tradeoff for dense scalar detail.
Control and counterfactuals	adjacent	Navigation-world-model experiments condition future visual-state prediction on text-formatted navigation actions.	Actions are navigation prompts, not a general action/control-input interface for multivariate time series.
Dynamic compute allocation	adjacent	MoE routing is used to handle modality specialization and vision-language scaling asymmetry.	Routing is per modality/token, not adaptive compute over time-series spans, channels, regimes, or futures.

Links Into The Wiki

Open Questions

Does RAE remain the best visual substrate when pixel-space end-to-end models scale further?
How general is the reported world-modeling emergence across action-conditioned datasets?

Alex Open Research Wiki

Explorer

Beyond Language Modeling: An Exploration of Multimodal Pretraining