LeVLJEPA: End-to-End Vision-Language Pretraining Without Negatives

Source

Credibility

Current arXiv preprint from Lukas Kuhn, Giuseppe Serra, Randall Balestriero, and Florian Buettner, with DKFZ, DKTK, Goethe University Frankfurt, and Brown affiliations listed on the project page. The paper is very recent, has an official project page, public code, and a released Hugging Face checkpoint. No peer-reviewed venue is listed at ingest time, so claims should be treated as credible early evidence rather than settled benchmark consensus.

Core Claim

LeVLJEPA shows that end-to-end vision-language pretraining can be fully non-contrastive. It trains an image encoder and a text encoder through matched-pair cross-modal prediction with stop-gradient targets, while applying per-modality SIGReg so the vision and text embedding marginals stay close to isotropic Gaussian distributions.

The paper’s practical claim is not that zero-shot image-text alignment is best. Contrastive objectives remain stronger on zero-shot transfer at DataComp-L scale. The core claim is that non-contrastive vision-language pretraining produces stronger dense semantic features: frozen patch tokens transfer better to semantic segmentation and to frozen-backbone VLM use.

Method Notes

LeVLJEPA trains on paired image-caption data with two symmetric prediction branches:

image encoder -> vision embedding -> vision-to-text predictor -> stop-gradient text embedding
text encoder  -> text embedding   -> text-to-vision predictor -> stop-gradient vision embedding

The objective combines cross-modal prediction and per-modality SIGReg:

Important implementation details:

  • no negatives;
  • no temperature parameter;
  • no momentum encoder;
  • no teacher-student schedule;
  • stop-gradient targets are still used for cross-modal asymmetry;
  • SIGReg is applied independently to vision and text embeddings;
  • the main DataComp-L checkpoint uses ViT-B/16, GPT-2 text encoder, 256-dimensional embedding space, and DataComp-large for 200k steps.

The objective ablation is important for TSL-JEPA: direct symmetric image-text MSE collapses; adding SIGReg alone improves rank but does not recover useful alignment; predictor plus stop-gradient without SIGReg is unstable or underperforms; the full predictor-plus-stop-gradient-plus-SIGReg recipe is the stable variant.

Evidence And Results

LeVLJEPA, InfoNCE, and SigLIP are compared with the same ViT-B/16 encoder, same data, same samples seen, and matching evaluation protocols.

Key reported results:

  • As a frozen VLM backbone with only an MLP bridge trained, LeVLJEPA is strongest across GQA, VQAv2, and POPE under both Llama-1B and Qwen-1.5B.
  • On semantic segmentation with frozen patch tokens and a linear head, LeVLJEPA reports 23.15 mIoU on ADE20K versus 20.90 for InfoNCE and 19.24 for SigLIP, and 31.10 mIoU on COCO-Stuff versus 29.02 and 28.88.
  • On ImageNet-9 background robustness, LeVLJEPA has the smallest reported drop under Mixed-Same and Mixed-Rand background swaps.
  • On global linear probing, the three objectives are close.
  • On DataComp-L zero-shot classification, contrastive objectives remain ahead; the paper interprets that as the metric rewarding the matched-vs-unmatched discrimination they optimize directly.

This makes LeVLJEPA a source about output-contract mismatch: pooled zero-shot alignment is not the same property as dense feature quality for downstream VLMs and dense prediction systems.

Relation To TSL-JEPA

LeVLJEPA is directly relevant to the local TSL-JEPA idea because it gives a fresh cross-modal JEPA recipe where language is part of the training interface but token generation is not the main objective. The transferable pattern is:

time-series window + query or target view -> predicted target representation -> optional typed/language readout

The strongest TSL-JEPA analogy is not zero-shot text classification. It is the frozen-transfer and dense-token lesson: a time-series-language model should test whether a frozen time-series encoder already contains dense numeric state, event timing, channel-local structure, rare regimes, and typed properties before asking a decoder to verbalize them.

Concrete TSL-JEPA hypotheses implied by LeVLJEPA:

  1. Compare contrastive query-label alignment against non-contrastive cross-modal prediction with stop-gradient target embeddings and per-modality temporal SIGReg.
  2. Evaluate dense temporal state and typed readouts, not only pooled labels or caption quality.
  3. Use a frozen-transfer protocol: freeze the time-series encoder and downstream language/structured-output model, train only a bridge/readout, and test retrieval, alerting, property extraction, and captioning.
  4. Distinguish this from VL-JEPA: Lukas Kuhn’s X reply says VL-JEPA uses InfoNCE to align a pretrained V-JEPA space to a pretrained language embedding model, while LeVLJEPA performs non-contrastive alignment for pretraining from scratch.

Relation To Foundation TSFM Agenda

Agenda slotVerdictEvidenceMissing pieces
Context interfaceadjacentLeVLJEPA conditions cross-modal prediction on paired image-caption views and shows language can supervise embeddings without token generation.Needs time-series query-target ontologies and typed readouts.
Latent-state predictionadjacentPrediction happens in representation space and can be consumed by a frozen downstream model.Evidence is image/text, not multivariate numeric time series or event streams.
Anti-collapse regularizationpartially closesPer-modality SIGReg prevents collapse in the full recipe and direct MSE collapse is documented in ablations.Need temporal SIGReg or VISReg-style variants under rare regimes, channels, irregular events, and actions.
Dense-state preservationwarning-positiveDense patch-token results show non-contrastive objectives can beat pooled-contrastive objectives on per-token semantic structure.Need analogous dense temporal probes for event timing, segment boundaries, numeric properties, and channel identity.
Decoder/readout groundingadjacentFrozen VLM-backbone protocol isolates whether patch tokens can be read by a frozen language model through a trained bridge.Need frozen bridge/readout protocols for time-series labels, numeric values, alerts, retrieval, and captions.

Limitations

  • Evidence is vision-language only; it does not prove the method works for time-series-language systems.
  • The method still uses stop-gradient targets, so it is not the same stabilization story as LeNEPA’s no-stop-gradient temporal SIGReg path.
  • Contrastive objectives still win on the zero-shot alignment protocol at DataComp-L scale.
  • Results use ViT-B/16 and released DataComp-L-scale settings; larger model/data scaling remains an open question in the paper.
  • The Hugging Face checkpoint is marked CC BY-NC 4.0, while the arXiv paper is CC BY 4.0; code/checkpoint reuse should respect artifact-specific licenses.

Open Questions

  • Can TSL-JEPA reproduce the LeVLJEPA dense-token lesson with dense temporal probes rather than visual patch-token probes?
  • Does temporal SIGReg alone suffice for cross-modal time-series/text prediction, or is stop-gradient/predictor asymmetry still required?
  • How should TSL-JEPA encode targets that are labels, numeric properties, intervals, events, captions, retrieval handles, or alerts without collapsing them into generic text embeddings?
  • Can a frozen bridge/readout evaluate time-series representation quality as cleanly as LeVLJEPA’s frozen VLM-backbone protocol evaluates patch-token features?