Vision-Language Models

Summary

The vision-language thread asks which output contract a VLM should expose: text tokens, predicted embeddings with optional text readout, structured outputs, or unified multimodal tokens and embeddings.

What The Wiki Currently Believes

  • VL-JEPA predicts text embeddings and decodes selectively, reducing trainable parameters and decoding work.
  • Florence-2 stays with sequence-to-sequence generation, but uses task prompts and location tokens to serialize diverse structured outputs such as captions, boxes, OCR, grounding, and segmentation.
  • Molmo and PixMo keeps the standard image-encoder plus connector plus decoder-only LLM pattern, but pushes openness and data quality: PixMo provides open multimodal data and Molmo reports frontier-class VLM performance without relying on proprietary VLM distillation for the core data.
  • Perception Encoder shows that contrastive vision-language pretraining can learn broad visual features, but language alignment is needed to expose many of them at the output for MLLM use.
  • Beyond Language Modeling studies from-scratch multimodal pretraining with language, image, video, and action-conditioned data.
  • Tuna-2 argues that pixel-space unified modeling can support both understanding and generation without pretrained vision encoders.
  • Gemma 4 12B is the release-level encoder-free VLM/audio-language source: image patches and audio waveforms are projected into a shared decoder-only transformer instead of passing through separate multimodal encoders.
  • Gemini Robotics 1.5 adds an embodied-reasoning VLM contract: structured spatial outputs, progress/success detection, thinking traces, and subtask handoff to a VLA/action model.

Evidence

These sources loosen or stress-test the standard “vision encoder plus autoregressive text decoder” pattern in different ways: embedding prediction, prompt-and-location serialization, open data-engine scaling, alignment of hidden visual features, unified multimodal pretraining, pixel-space modeling, and production encoder-free projection into a shared decoder backbone.

Output Contracts

The wiki should not treat “vision-language model” as synonymous with “vision encoder plus autoregressive language decoder.” Current sources suggest at least four contracts:

  1. Token-generative: image or video plus query maps to text tokens. This is the standard VLM interface and remains useful when the primary product is natural language.
  2. Embedding-predictive: image or video plus query maps to a target embedding, with text decoded only when needed. VL-JEPA is the anchor case.
  3. Structured token-generative: image or video plus task prompt maps to a parseable sequence such as boxes, points, regions, OCR, or captions. Florence-2 is the anchor case.
  4. Unified multimodal: image, video, text, audio, and sometimes action data are modeled under a shared objective or native multimodal representation. Beyond Language Modeling, Tuna-2, and Gemma 4 12B are current anchors.

This matters for robotics and time-series systems because language can be a readout rather than the system’s main internal state. A model can maintain continuous embeddings for monitoring, retrieval, classification, progress estimation, or candidate-action scoring, then decode language only for operator-facing explanations or external language interfaces. It can also expose structured answers first, then use a separate formatter when prose is needed.

Relation To Foundation TSFM Agenda

Vision-language models are adjacent to the Foundation Time-Series Model Research Agenda through context interfaces, selective language readout, and robotics handoff patterns. The transferable lesson is that a foundation TSFM can maintain latent or typed state internally and decode language only when an operator-facing answer, explanation, or plan handoff is needed.

Open Questions

  • How much text decoding is actually needed for online multimodal tasks?
  • When is text serialization of non-text outputs a useful unifying interface, and when does it become a bottleneck?
  • Which hidden visual layers should be aligned to language decoders, and when does alignment erase dense or local state?
  • Can embedding-space prediction and pixel-space unified modeling be combined?
  • Which VLM output contract is best for conditioning fast robot action policies without forcing continuous control through language tokens?