Vision-Language Models

Summary

The vision-language thread asks which output contract a VLM should expose: text tokens, predicted embeddings with optional text readout, structured outputs, or unified multimodal tokens and embeddings.

What The Wiki Currently Believes

VL-JEPA predicts text embeddings and decodes selectively, reducing trainable parameters and decoding work.
VLWM uses a structured token-generative world-model contract: video context maps to a goal, a goal interpretation, and interleaved textual action/state-change trajectories that support direct or critic-ranked planning. It makes state readable but risks bottlenecking geometry, uncertainty, and executability through language.
Action100M is the later hierarchical video/text data-engine source from the same research line. It scales open-vocabulary segment captions and action/actor fields, but only a 120k-video preview is public and its generated labels are not ground-truth physical state.
LeVLJEPA removes contrastive negatives from end-to-end vision-language pretraining, using cross-modal prediction plus per-modality SIGReg; its main win is dense patch-token quality for frozen VLM backbones and segmentation, not pooled zero-shot alignment.
Florence-2 stays with sequence-to-sequence generation, but uses task prompts and location tokens to serialize diverse structured outputs such as captions, boxes, OCR, grounding, and segmentation.
Molmo and PixMo keeps the standard image-encoder plus connector plus decoder-only LLM pattern, but pushes openness and data quality: PixMo provides open multimodal data and Molmo reports frontier-class VLM performance without relying on proprietary VLM distillation for the core data.
Perception Encoder shows that contrastive vision-language pretraining can learn broad visual features, but language alignment is needed to expose many of them at the output for MLLM use.
Aristotelian Representation Hypothesis revisits vision-language and video-language representation alignment: after width/depth calibration, global spectral convergence largely disappears, while local-neighborhood agreement remains the stronger cross-modal signal.
Beyond Language Modeling studies from-scratch multimodal pretraining with language, image, video, and action-conditioned data.
Tuna-2 argues that pixel-space unified modeling can support both understanding and generation without pretrained vision encoders.
Gemma 4 12B is the release-level encoder-free VLM/audio-language source: image patches and audio waveforms are projected into a shared decoder-only transformer instead of passing through separate multimodal encoders.
Gemini Robotics 1.5 adds an embodied-reasoning VLM contract: structured spatial outputs, progress/success detection, thinking traces, and subtask handoff to a VLA/action model.
MiniMax Sparse Attention adds a release-scale VLM/agentic model case through MiniMax-M3: native multimodal training, image/video/text input, and 1M-context support are tied to sparse attention rather than only to modality encoders or output contracts.

Evidence

These sources loosen or stress-test the standard “vision encoder plus autoregressive text decoder” pattern in different ways: embedding prediction, non-contrastive cross-modal prediction, prompt-and-location serialization, open data-engine scaling, alignment of hidden visual features, calibrated separation of local-neighborhood versus global-geometry convergence, unified multimodal pretraining, pixel-space modeling, production encoder-free projection into a shared decoder backbone, and sparse-attention-backed million-token multimodal context.

Output Contracts

The wiki should not treat “vision-language model” as synonymous with “vision encoder plus autoregressive language decoder.” Current sources suggest at least five contracts:

Token-generative: image or video plus query maps to text tokens. This is the standard VLM interface and remains useful when the primary product is natural language.
Embedding-predictive: image or video plus query maps to a target embedding, with text decoded only when needed. VL-JEPA is the selective-decoding anchor; LeVLJEPA is the non-contrastive cross-modal-pretraining anchor.
Structured token-generative: image or video plus task prompt maps to a parseable sequence such as boxes, points, regions, OCR, or captions. Florence-2 is the anchor case.
Structured world-model trajectory: visual context maps to a goal and interleaved textual actions plus predicted world-state changes. VLWM is the anchor; the decisive boundary is whether language preserves enough physical state for planning rather than only producing plausible prose.
Unified multimodal: image, video, text, audio, and sometimes action data are modeled under a shared objective or native multimodal representation. Beyond Language Modeling, Tuna-2, and Gemma 4 12B are current anchors.

This matters for robotics and time-series systems because language can be a readout rather than the system’s main internal state. A model can maintain continuous embeddings for monitoring, retrieval, classification, progress estimation, or candidate-action scoring, then decode language only for operator-facing explanations or external language interfaces. It can also expose structured answers first, then use a separate formatter when prose is needed.

Relation To Foundation TSFM Agenda

Vision-language models are adjacent to the Foundation Time-Series Model Research Agenda through context interfaces, selective language readout, and robotics handoff patterns. The transferable lesson is that a foundation TSFM can maintain latent or typed state internally and decode language only when an operator-facing answer, explanation, or plan handoff is needed.

Open Questions

How much text decoding is actually needed for online multimodal tasks?
When is text serialization of non-text outputs a useful unifying interface, and when does it become a bottleneck?
Which hidden visual layers should be aligned to language decoders, and when does alignment erase dense or local state?
When does removing contrastive negatives improve dense feature quality, and when does it damage alignment or retrieval enough to hurt the system?
Which cross-modal alignment metrics should be trusted after width/depth and layer-search calibration: local neighborhoods, global geometry, downstream probe transfer, or task-specific utility?
Can embedding-space prediction and pixel-space unified modeling be combined?
Which VLM output contract is best for conditioning fast robot action policies without forcing continuous control through language tokens?

Alex Open Research Wiki

Explorer

Vision-Language Models

Vision-Language Models

Summary

What The Wiki Currently Believes

Evidence

Output Contracts

Relation To Foundation TSFM Agenda

Open Questions

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

Vision-Language Models

Vision-Language Models

Summary

What The Wiki Currently Believes

Evidence

Output Contracts

Relation To Foundation TSFM Agenda

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks