LeVLJEPA
Summary
LeVLJEPA is a non-contrastive end-to-end vision-language pretraining method. It aligns image and text by cross-modal prediction with stop-gradient targets and uses per-modality SIGReg to prevent representation collapse without negatives, temperature, momentum encoders, or a teacher-student schedule.
Role In The Wiki
LeVLJEPA is the current bridge between the wiki’s LeJEPA/SIGReg thread and the VL-JEPA/selective-readout thread. It is especially relevant to TSL-JEPA because it shows a concrete way to train cross-modal embeddings without making autoregressive text generation or contrastive matched-vs-unmatched discrimination the central objective.
The useful abstraction for time-series work is:
paired modality or query-target view -> predicted target embedding -> optional readoutEvidence
Use the source page LeVLJEPA: End-to-End Vision-Language Pretraining Without Negatives for detailed evidence, limitations, and TSFM relevance.
At the entity level, the most important results are:
- direct symmetric cross-modal MSE collapses in ablations;
- predictor plus stop-gradient plus per-modality SIGReg is the stable recipe;
- dense patch-token features beat InfoNCE/SigLIP baselines on ADE20K and COCO-Stuff segmentation;
- frozen VLM-backbone transfer beats contrastive baselines across GQA, VQAv2, and POPE under Llama-1B and Qwen-1.5B;
- global zero-shot classification remains stronger for contrastive objectives at DataComp-L scale.
TSL-JEPA Relevance
LeVLJEPA does not solve time-series-language learning directly, but it provides a high-value design test for TSL-JEPA:
- replace contrastive query-label matching with non-contrastive prediction of target embeddings;
- apply distributional regularization independently per modality or target family;
- evaluate dense temporal state rather than only pooled class or caption metrics;
- use a frozen bridge/readout to isolate whether the time-series encoder already exposes the needed state.
Caveats
- Vision-language evidence should not be treated as proof for numeric time series, event streams, or control inputs.
- LeVLJEPA still uses stop-gradient targets, while some local LeNEPA questions ask whether SIGReg can remove stop-gradient/EMA stabilization.
- Its dense-feature advantage and zero-shot weakness should both be preserved; summarizing it as simply “better than CLIP/SigLIP” would be misleading.