LeVLJEPA

Summary

LeVLJEPA is a non-contrastive end-to-end vision-language pretraining method. It aligns image and text by cross-modal prediction with stop-gradient targets and uses per-modality SIGReg to prevent representation collapse without negatives, temperature, momentum encoders, or a teacher-student schedule.

Role In The Wiki

LeVLJEPA is the current bridge between the wiki’s LeJEPA/SIGReg thread and the VL-JEPA/selective-readout thread. It is especially relevant to TSL-JEPA because it shows a concrete way to train cross-modal embeddings without making autoregressive text generation or contrastive matched-vs-unmatched discrimination the central objective.

The useful abstraction for time-series work is:

paired modality or query-target view -> predicted target embedding -> optional readout

Evidence

Use the source page LeVLJEPA: End-to-End Vision-Language Pretraining Without Negatives for detailed evidence, limitations, and TSFM relevance.

At the entity level, the most important results are:

  • direct symmetric cross-modal MSE collapses in ablations;
  • predictor plus stop-gradient plus per-modality SIGReg is the stable recipe;
  • dense patch-token features beat InfoNCE/SigLIP baselines on ADE20K and COCO-Stuff segmentation;
  • frozen VLM-backbone transfer beats contrastive baselines across GQA, VQAv2, and POPE under Llama-1B and Qwen-1.5B;
  • global zero-shot classification remains stronger for contrastive objectives at DataComp-L scale.

TSL-JEPA Relevance

LeVLJEPA does not solve time-series-language learning directly, but it provides a high-value design test for TSL-JEPA:

  • replace contrastive query-label matching with non-contrastive prediction of target embeddings;
  • apply distributional regularization independently per modality or target family;
  • evaluate dense temporal state rather than only pooled class or caption metrics;
  • use a frozen bridge/readout to isolate whether the time-series encoder already exposes the needed state.

Caveats

  • Vision-language evidence should not be treated as proof for numeric time series, event streams, or control inputs.
  • LeVLJEPA still uses stop-gradient targets, while some local LeNEPA questions ask whether SIGReg can remove stop-gradient/EMA stabilization.
  • Its dense-feature advantage and zero-shot weakness should both be preserved; summarizing it as simply “better than CLIP/SigLIP” would be misleading.