Next-Embedding Prediction Makes Strong Vision Learners

Source

Raw Markdown: paper_nepa-2025.md
PDF: paper_nepa-2025.pdf

Core Claim

NEPA trains visual models to predict future patch embeddings with a simple Transformer objective, avoiding pixel reconstruction, discrete tokens, contrastive loss, and task-specific heads.

Key Contributions

Introduces Next-Embedding Predictive Autoregression for visual SSL.
Uses causal masking and stop-gradient in an embedding-prediction setup.
Reports strong ImageNet fine-tuning and ADE20K transfer.
Frames generative pretraining from embeddings as a modality-agnostic alternative.

Method Notes

NEPA belongs to Next-Embedding Prediction, Latent-Space Predictive Learning, Self-Supervised Representation Learning, and Vision Foundation Models.

Evidence And Results

The abstract reports 83.8% and 85.3% top-1 ImageNet-1K accuracy with ViT-B and ViT-L after fine-tuning, plus effective semantic-segmentation transfer.

Limitations

NEPA still uses stop-gradient, which makes it a useful contrast with SIGReg-centered sources such as LeJEPA.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Augmentation-free or dataset-aware self-supervision	adjacent	Trains by next-embedding prediction without pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads.	Evidence is vision-only and uses image patch order, not temporal numeric streams or heterogeneous datasets.
Representation quality: semantic state vs dense detail	adjacent	Predicts future patch embeddings in continuous representation space with a causal Transformer and stop-gradient target.	Does not test dense generation/editing fidelity or action-relevant state in time series.
Anti-collapse regularization	insufficient evidence	Uses stop-gradient for stable prediction.	The raw markdown is only a TeX input stub, and the available PDF evidence does not establish time-series collapse diagnostics.

Links Into The Wiki

Open Questions

Can NEPA be made collapse-resistant without stop-gradient?
How does next-embedding prediction compare to DINOv3 at larger data/model scales?

Alex Open Research Wiki

Explorer

Next-Embedding Prediction Makes Strong Vision Learners

Next-Embedding Prediction Makes Strong Vision Learners

Source

Core Claim

Key Contributions

Method Notes

Evidence And Results

Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks