Next-Embedding Prediction Makes Strong Vision Learners
Source
- Raw Markdown: paper_nepa-2025.md
- PDF: paper_nepa-2025.pdf
Core Claim
NEPA trains visual models to predict future patch embeddings with a simple Transformer objective, avoiding pixel reconstruction, discrete tokens, contrastive loss, and task-specific heads.
Key Contributions
- Introduces Next-Embedding Predictive Autoregression for visual SSL.
- Uses causal masking and stop-gradient in an embedding-prediction setup.
- Reports strong ImageNet fine-tuning and ADE20K transfer.
- Frames generative pretraining from embeddings as a modality-agnostic alternative.
Method Notes
NEPA belongs to Next-Embedding Prediction, Latent-Space Predictive Learning, Self-Supervised Representation Learning, and Vision Foundation Models.
Evidence And Results
The abstract reports 83.8% and 85.3% top-1 ImageNet-1K accuracy with ViT-B and ViT-L after fine-tuning, plus effective semantic-segmentation transfer.
Limitations
NEPA still uses stop-gradient, which makes it a useful contrast with SIGReg-centered sources such as LeJEPA.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Augmentation-free or dataset-aware self-supervision | adjacent | Trains by next-embedding prediction without pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads. | Evidence is vision-only and uses image patch order, not temporal numeric streams or heterogeneous datasets. |
| Representation quality: semantic state vs dense detail | adjacent | Predicts future patch embeddings in continuous representation space with a causal Transformer and stop-gradient target. | Does not test dense generation/editing fidelity or action-relevant state in time series. |
| Anti-collapse regularization | insufficient evidence | Uses stop-gradient for stable prediction. | The raw markdown is only a TeX input stub, and the available PDF evidence does not establish time-series collapse diagnostics. |
Links Into The Wiki
- Foundation Time-Series Model Research Agenda
- Next-Embedding Prediction
- Latent-Space Predictive Learning
- Self-Supervised Representation Learning
- Vision Foundation Models
Open Questions
- Can NEPA be made collapse-resistant without stop-gradient?
- How does next-embedding prediction compare to DINOv3 at larger data/model scales?