Next-Embedding Prediction Makes Strong Vision Learners

Source

Core Claim

NEPA trains visual models to predict future patch embeddings with a simple Transformer objective, avoiding pixel reconstruction, discrete tokens, contrastive loss, and task-specific heads.

Key Contributions

  • Introduces Next-Embedding Predictive Autoregression for visual SSL.
  • Uses causal masking and stop-gradient in an embedding-prediction setup.
  • Reports strong ImageNet fine-tuning and ADE20K transfer.
  • Frames generative pretraining from embeddings as a modality-agnostic alternative.

Method Notes

NEPA belongs to Next-Embedding Prediction, Latent-Space Predictive Learning, Self-Supervised Representation Learning, and Vision Foundation Models.

Evidence And Results

The abstract reports 83.8% and 85.3% top-1 ImageNet-1K accuracy with ViT-B and ViT-L after fine-tuning, plus effective semantic-segmentation transfer.

Limitations

NEPA still uses stop-gradient, which makes it a useful contrast with SIGReg-centered sources such as LeJEPA.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Augmentation-free or dataset-aware self-supervisionadjacentTrains by next-embedding prediction without pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads.Evidence is vision-only and uses image patch order, not temporal numeric streams or heterogeneous datasets.
Representation quality: semantic state vs dense detailadjacentPredicts future patch embeddings in continuous representation space with a causal Transformer and stop-gradient target.Does not test dense generation/editing fidelity or action-relevant state in time series.
Anti-collapse regularizationinsufficient evidenceUses stop-gradient for stable prediction.The raw markdown is only a TeX input stub, and the available PDF evidence does not establish time-series collapse diagnostics.

Open Questions

  • Can NEPA be made collapse-resistant without stop-gradient?
  • How does next-embedding prediction compare to DINOv3 at larger data/model scales?