VL-JEPA: Joint Embedding Predictive Architecture For Vision-Language

Source

Core Claim

VL-JEPA applies JEPA to vision-language learning by predicting continuous target-text embeddings rather than autoregressively generating tokens.

Key Contributions

  • Predicts target text embeddings from vision inputs and textual queries.
  • Uses a lightweight decoder only when text output is needed.
  • Supports selective decoding to reduce decoding operations.
  • Enables classification, retrieval, and discriminative VQA from the embedding space without architecture changes.

Method Notes

VL-JEPA belongs to JEPA, Vision-Language Models, Latent-Space Predictive Learning, and Self-Supervised Representation Learning. It is also a useful interface-pattern source for Slow Thinking For Robotics And Time Series: maintain a continuous target-embedding stream and decode language selectively.

Evidence And Results

The abstract reports stronger performance than a controlled token-space VLM baseline with 50% fewer trainable parameters, about 2.85x fewer decoding operations under selective decoding, and competitive results across video classification, retrieval, and VQA datasets.

Limitations

The model still needs a decoder when text output is required. It does not eliminate language generation; it makes generation selective.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Latent predictive learningadjacentVL-JEPA predicts continuous target-text embeddings from visual inputs and queries instead of training directly on token-space generation.Evidence is vision-language; no numeric time-series or control-state target.
Dynamic compute and selective decodingpartially closesThe raw paper reports streaming semantic embeddings and about 2.85x fewer decoding operations through adaptive selective decoding.Needs an analogous trigger for numeric events, regimes, and alarms.
Anti-collapse representation learningadjacentThe training objective uses embedding alignment with InfoNCE/uniformity pressure and discusses collapse-avoidance alternatives.Does not test TSFM-specific collapse across channels, regimes, or forecast horizons.
Causal and control modelingwarningWorldPrediction and action-anticipation experiments test semantic action recognition and anticipation, not outcome rollouts under candidate actions.Needs action-conditioned latent dynamics and future-state evaluation.

Open Questions

  • Can embedding prediction replace autoregressive decoding in broader multimodal assistant tasks?
  • How should target text embeddings be trained and regularized to avoid semantic collapse?
  • Can target embeddings be grounded in action consequences strongly enough to become state variables for action-conditioned world models?