VL-JEPA

Summary

VL-JEPA is a vision-language model that predicts continuous target-text embeddings instead of autoregressively generating text tokens.

Role In The Wiki

VL-JEPA extends JEPA-style representation prediction to general-domain vision-language tasks and selective decoding. It anchors the wiki pattern where language is a readout from a continuous semantic embedding stream, not necessarily the system’s main internal representation.

LeVLJEPA should be treated as the adjacent non-contrastive objective successor, not as the same entity: it keeps cross-modal embedding prediction but replaces contrastive alignment with predictor/stop-gradient asymmetry plus per-modality SIGReg.

Action100M provides the main follow-on data-scale evidence: its paper trains VL-JEPA on large action-centric video supervision and reports scaling gains in zero-shot action recognition and video retrieval. The evidence remains representation-learning evidence rather than candidate-action rollout, and the complete training corpus is not publicly released.

Evidence

Relation To Foundation TSFM Agenda

Use the source-level agenda mapping in vl-jepa-2025 rather than duplicating verdict rows here.

At the entity level, VL-JEPA extends JEPA-style representation prediction to general-domain vision-language tasks and selective decoding. It anchors the wiki pattern where language is a readout from a continuous semantic embedding stream, not necessarily the system’s main internal representation. This page should stay as the object card; source pages carry slot-level verdicts, evidence, and missing pieces.

Alex Open Research Wiki

Explorer

VL-JEPA

VL-JEPA

Summary

Role In The Wiki

Evidence

Relation To Foundation TSFM Agenda

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

VL-JEPA

VL-JEPA

Summary

Role In The Wiki

Evidence

Relation To Foundation TSFM Agenda

Related Pages

Graph View

Table of Contents

Backlinks