VL-JEPA: Joint Embedding Predictive Architecture For Vision-Language

Source

Raw Markdown: paper_vl-jepa-2025.md
PDF: paper_vl-jepa-2025.pdf

Core Claim

VL-JEPA applies JEPA to vision-language learning by predicting continuous target-text embeddings rather than autoregressively generating tokens.

Key Contributions

Predicts target text embeddings from vision inputs and textual queries.
Uses a lightweight decoder only when text output is needed.
Supports selective decoding to reduce decoding operations.
Enables classification, retrieval, and discriminative VQA from the embedding space without architecture changes.

Method Notes

VL-JEPA belongs to JEPA, Vision-Language Models, Latent-Space Predictive Learning, and Self-Supervised Representation Learning. It is also a useful interface-pattern source for Slow Thinking For Robotics And Time Series: maintain a continuous target-embedding stream and decode language selectively.

LeVLJEPA adds an important boundary: it treats VL-JEPA as still contrastive because VL-JEPA uses an InfoNCE-style alignment signal, whereas LeVLJEPA trains image/text encoders from scratch with non-contrastive cross-modal prediction plus per-modality SIGReg. For TSL-JEPA, VL-JEPA is therefore the selective-readout interface precedent, while LeVLJEPA is the stronger non-contrastive objective precedent.

Follow-On Evidence From Action100M

Action100M provides direct follow-on data-scale evidence for VL-JEPA. Its paper trains VL-JEPA in three stages, moving from image-text pretraining to eight-frame and then 32-frame video training on Action100M, and reports consistent gains in zero-shot action recognition and text-to-video retrieval as the video supervision scales.

This strengthens the case that VL-JEPA’s embedding-predictive interface can use large action-centric video corpora. It does not make VL-JEPA an action-conditioned world model: Action100M labels observed video segments rather than candidate control inputs and next states. Reproducibility also remains bounded because the official Action100M release is a 120,000-video preview, not the complete corpus used for the reported training.

Evidence And Results

The abstract reports stronger performance than a controlled token-space VLM baseline with 50% fewer trainable parameters, about 2.85x fewer decoding operations under selective decoding, and competitive results across video classification, retrieval, and VQA datasets.

Limitations

The model still needs a decoder when text output is required. It does not eliminate language generation; it makes generation selective.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Latent predictive learning	adjacent	VL-JEPA predicts continuous target-text embeddings from visual inputs and queries instead of training directly on token-space generation.	Evidence is vision-language; no numeric time-series or control-state target.
Dynamic compute and selective decoding	partially closes	The raw paper reports streaming semantic embeddings and about 2.85x fewer decoding operations through adaptive selective decoding.	Needs an analogous trigger for numeric events, regimes, and alarms.
Anti-collapse representation learning	adjacent	The training objective uses embedding alignment with InfoNCE/uniformity pressure and discusses collapse-avoidance alternatives.	Does not test TSFM-specific collapse across channels, regimes, or forecast horizons.
Causal and control modeling	warning	WorldPrediction and action-anticipation experiments test semantic action recognition and anticipation, not outcome rollouts under candidate actions.	Needs action-conditioned latent dynamics and future-state evaluation.

Links Into The Wiki

Open Questions

Can embedding prediction replace autoregressive decoding in broader multimodal assistant tasks?
How should target text embeddings be trained and regularized to avoid semantic collapse?
Can target embeddings be grounded in action consequences strongly enough to become state variables for action-conditioned world models?

Alex Open Research Wiki

Explorer

VL-JEPA: Joint Embedding Predictive Architecture For Vision-Language

VL-JEPA: Joint Embedding Predictive Architecture For Vision-Language

Source

Core Claim

Key Contributions

Method Notes

Follow-On Evidence From Action100M

Evidence And Results

Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks