VL-JEPA: Joint Embedding Predictive Architecture For Vision-Language
Source
- Raw Markdown: paper_vl-jepa-2025.md
- PDF: paper_vl-jepa-2025.pdf
Core Claim
VL-JEPA applies JEPA to vision-language learning by predicting continuous target-text embeddings rather than autoregressively generating tokens.
Key Contributions
- Predicts target text embeddings from vision inputs and textual queries.
- Uses a lightweight decoder only when text output is needed.
- Supports selective decoding to reduce decoding operations.
- Enables classification, retrieval, and discriminative VQA from the embedding space without architecture changes.
Method Notes
VL-JEPA belongs to JEPA, Vision-Language Models, Latent-Space Predictive Learning, and Self-Supervised Representation Learning. It is also a useful interface-pattern source for Slow Thinking For Robotics And Time Series: maintain a continuous target-embedding stream and decode language selectively.
Evidence And Results
The abstract reports stronger performance than a controlled token-space VLM baseline with 50% fewer trainable parameters, about 2.85x fewer decoding operations under selective decoding, and competitive results across video classification, retrieval, and VQA datasets.
Limitations
The model still needs a decoder when text output is required. It does not eliminate language generation; it makes generation selective.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Latent predictive learning | adjacent | VL-JEPA predicts continuous target-text embeddings from visual inputs and queries instead of training directly on token-space generation. | Evidence is vision-language; no numeric time-series or control-state target. |
| Dynamic compute and selective decoding | partially closes | The raw paper reports streaming semantic embeddings and about 2.85x fewer decoding operations through adaptive selective decoding. | Needs an analogous trigger for numeric events, regimes, and alarms. |
| Anti-collapse representation learning | adjacent | The training objective uses embedding alignment with InfoNCE/uniformity pressure and discusses collapse-avoidance alternatives. | Does not test TSFM-specific collapse across channels, regimes, or forecast horizons. |
| Causal and control modeling | warning | WorldPrediction and action-anticipation experiments test semantic action recognition and anticipation, not outcome rollouts under candidate actions. | Needs action-conditioned latent dynamics and future-state evaluation. |
Links Into The Wiki
- VL-JEPA
- Foundation Time-Series Model Research Agenda
- JEPA
- Vision-Language Models
- Latent-Space Predictive Learning
- Slow Thinking For Robotics And Time Series
- World Models
- Self-Supervised Representation Learning
Open Questions
- Can embedding prediction replace autoregressive decoding in broader multimodal assistant tasks?
- How should target text embeddings be trained and regularized to avoid semantic collapse?
- Can target embeddings be grounded in action consequences strongly enough to become state variables for action-conditioned world models?