Next-Embedding Prediction
Summary
Next-embedding prediction trains a sequence model to predict future embeddings rather than reconstructing raw observations. It sits between reconstruction-style modeling and JEPA: the target is already in representation space, but the basic recipe can be simpler than a full joint-embedding setup with separate context and target encoders.
For time series, the important design question is where the target embedding comes from. If the target embedding is too local, it may miss useful context. If it is already too contextual, the predictor may learn an easier target that has mixed away patch-level state.
What The Wiki Currently Believes
- NEPA introduces next-embedding predictive autoregression for visual SSL: embed patches, then predict future patch embeddings without reconstructing pixels, discrete tokens, contrastive pairs, or task-specific heads.
- EIDOS adapts the next-embedding idea to time-series forecasting with point-wise scalar embeddings, stop-gradient on the target branch, and observation-space grounding.
- LeWorldModel uses next-embedding prediction inside a JEPA-style world model, adding Gaussian regularization to stabilize end-to-end latent prediction from pixels.
- The local dynamic-curriculum notes treat next-embedding prediction as a useful diagnostic for target-layer choice before applying surprise-based sampling to JEPA-style training.
Relation To JEPA
NEPA should stay close to JEPA, but not be collapsed into it.
The overlap is the latent target: both avoid raw reconstruction as the main prediction target. The difference is the recipe. NEPA starts from an embedding layer and predicts the next embedding. JEPA is the broader joint-embedding family, where context and target views are encoded and the predictor learns to match target representations, often with additional anti-collapse or distribution-shaping constraints.
That means a NEPA failure mode can warn JEPA design, but it is not automatically JEPA evidence. When a JEPA curriculum uses latent prediction surprise as a sampling signal, it should still ablate how the target representation is built.
Time-Series Target-Layer Note
In a NEPA-style setup, a CNN-style embedder encodes each patch independently. That baseline trains well.
Two changes make the setup fragile:
- replacing the independent CNN patch embedder with a more sequence-dependent path, such as a Mamba/H-Net-like dynamic patching path;
- predicting an internal Transformer layer instead of the initial embedding layer.
The dynamic-patching change gives mixed results: one case works, another does not. The internal-layer target is worse in the reported setup: moving from the embedding layer to an internal Transformer layer cuts quality roughly in half.
The practical conclusion is simple: this NEPA setup works best when the target embedding is still patch-independent. Once the target already mixes information across patches, the prediction problem changes and quality can collapse.
For JEPA, this is a design warning, not a direct result. Surprise-controlled curricula should check whether the target encoder introduces cross-patch dependence before using latent prediction loss as the value signal for selecting windows.
Relation To Foundation TSFM Agenda
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Latent-state prediction | adjacent | NEPA and EIDOS predict embeddings rather than raw observations, making latent prediction a first-class objective. | Need high-dimensional streaming time-series tests where the latent tracks regime, state, and rare events. |
| Representation quality | warning | The local time-series note shows target-layer choice can dominate whether next-embedding prediction preserves patch-level state. | Run public ablations over independent patch targets, contextual targets, and internal-layer targets. |
| Anti-collapse regularization | warning | NEPA relies on stop-gradient, while LeWorldModel adds Gaussian regularization in a JEPA-style world model. | Test whether SIGReg/Gaussian regularization can replace stop-gradient without losing rare or local state. |
| Data diversity, curriculum, and long tail | adjacent | Target quality affects whether latent prediction surprise is a useful sampler signal for long-tailed temporal corpora. | Validate surprise-based curricula with target-layer ablations and rare-state metrics. |
Open Questions
- Which target layer is best for next-embedding prediction on numeric time series: independent patch embeddings, point-wise embeddings, contextual embeddings, or internal Transformer layers?
- Can a contextual target be made useful without erasing patch-level state?
- Does observation-space grounding, as in EIDOS, prevent the target from drifting away from dense numeric detail?
- Can SIGReg-style regularization replace stop-gradient in NEPA-style time-series training?
- When is next-embedding prediction enough, and when does the model need a fuller JEPA-style context and target interface?