Latent-Space Predictive Learning

Summary

Latent-space predictive learning trains models to predict future representations, not only future raw observations. Its central promise is noise suppression; its central risk is learning the easiest predictable latent factors rather than the factors needed by the downstream task.

In the wiki’s latent-state time-series framing, latent prediction is valuable when the latent target tracks system state, regime, constraints, and plausible futures better than raw observation prediction. It is not automatically sufficient: predictable latents can still ignore rare but decision-relevant changes.

What The Wiki Currently Believes

CHARM applies latent prediction to multivariate time-series embeddings with channel descriptions, keeping the output at the time-channel representation level rather than reconstructing raw values.
EIDOS uses latent-space predictive learning for time-series forecasting robustness, with point-wise SiGLU scalar embeddings, a future-segment target aggregator, stop-gradient, and observation grounding.
VJEPA adds the probabilistic branch: predict an explicit distribution over future latent states rather than one embedding, while keeping the evidence boundary that the evaluated diagonal-Gaussian head is unimodal and does not demonstrate separated multi-modal futures.
Joint Embedding Predictive Architectures Focus on Slow Features warns that latent prediction can prefer static or slowly changing distractors over action-relevant state.
When Does LeJEPA Learn a World Model? gives a positive latent-prediction condition: when positive pairs come from a Gaussian OU-style latent process and the embedding is whitened or Gaussian, alignment penalizes nonlinear components enough to recover true latent state up to rotation.
Learn From Your Own Latents And Not From Tokens gives a complementary sample-complexity condition: on a hidden hierarchical grammar, predicting the model’s own recovered latents can avoid the exponential surface-token bottleneck.
NextLat turns own-latent prediction into a practical Transformer training objective: predict the model’s own next hidden state from the current hidden state plus next token, while retaining ordinary next-token supervision. Treat it as JEPA-adjacent belief-state pressure, not as a pure external target-encoder JEPA.
LeNEPA is the local MILETS 2026 time-series result for this cluster: no-augmentation next-latent prediction, temporal SIGReg, and fixed-recipe frozen-probe evaluation across PTB-XL and Aionoscope Diag. The LeNEPA idea page now tracks follow-up target-family comparisons against EIDOS-style grounding and NextLat-style own-hidden targets.
The Illusion of Superposition is not a predictive-learning source, but it is an important latent-interface warning: continuous latent thoughts do not automatically preserve multiple candidate continuations.
LeWorldModel predicts future latent states conditioned on actions for control.
Temporal Straightening separates next-latent predictability from planner-facing geometry: a local curvature loss aligns consecutive latent velocities and tests whether Euclidean latent distance becomes a better goal cost for gradient-based planning.
AdaJEPA keeps the same latent prediction interface but updates selected model parameters at test time from observed action-conditioned transitions before replanning.
Sensorimotor World Models predicts future latent states while using inverse dynamics to make latent transitions action-informative and compact around controllable degrees of freedom.
SkyJEPA predicts future latent dynamics for quadrotor control and uses a physics-inspired prober to lift latent rollouts back into metric state before MPPI planning.
Looped World Models predicts through action-conditioned latent transitions with repeated shared-block refinement, spectral retention, and deferred terminal decoding.
stable-worldmodel turns that action-conditioned latent-prediction line into a reproducible platform question: evaluation must compare latent prediction quality, planning success, solver behavior, and distribution-shift factors separately.
Genie adds a latent-action discovery boundary case: it learns discrete action-like codes from image/video transitions, then predicts future video tokens conditioned on those codes. Treat this as latent action/interface evidence, not as direct numeric TSFM evidence.
Next-Embedding Prediction predicts future visual patch embeddings and tracks the NEPA-style target-layer sensitivity note separately from the broader JEPA page.
RAEv2 is not a JEPA source, but it matters here because it treats REPA as x-prediction in RAE latent space and uses that prediction head for internal guidance.
Reconstruction or Semantics? evaluates which latent spaces make robotic diffusion world models useful.
Self-Teaching Autoencoder is adjacent rather than predictive over time: it trains a decoder by matching transformed latent representations instead of reconstructing pixels directly.
Time Series Forecasting Using Manifold Learning is a classical embed-predict-lift baseline: create a low-dimensional manifold embedding, forecast in latent space, then lift the forecast back to observation space.
VL-JEPA extends latent-space predictive learning to vision-language targets by predicting a target text embedding instead of reconstructing tokens.
LeVLJEPA tests the non-contrastive version of cross-modal latent prediction: image and text embeddings predict each other with stop-gradient/predictor asymmetry and SIGReg as the stabilizer.
VLA-JEPA extends latent-space predictive learning to VLA pretraining by predicting future V-JEPA2 state embeddings from current observations, language context, and latent-action tokens.
OTF-LAM-Dino adds a latent-action variant where future prediction happens in a frozen DINOv2 representation space, reducing pressure to reconstruct pixel-level nuisance detail while testing whether factorized observed-transition primitives can support action-like latent dynamics.
World Models is the historical action-conditioned latent prediction anchor: encode pixels into $z_{t}$ , predict $z_{t + 1}$ from $a_{t}, z_{t}, h_{t}$ , then use the recurrent state for control.

Evidence

The corpus repeatedly treats latent prediction as a way to suppress irrelevant surface noise while retaining task-relevant dynamics. World Models (2018) is the early action-conditioned version of that claim: predict the next latent visual state under an action, then use the dynamics state for control or imagined training. JEPA Slow Features shows why this should be tested rather than assumed: fixed distractors can be more predictable than the state variable. LeJEPA Identifiability gives the complementary positive condition: with a suitable latent process and Gaussian/whitening constraint, latent alignment recovers state rather than only an arbitrary predictable feature. Own Latents gives a different positive condition: when data has a recoverable hidden hierarchy, once a level is learned, predicting that learned latent can make the next level statistically local rather than surface-token-limited. Illusion of Superposition supplies the negative reasoning-side analogue: a continuous latent interface can still collapse to a discrete commitment or answer shortcut unless the objective and capacity make the latent state causally useful. CHARM adds an early multivariate time-series variant with channel descriptions and an EMA target encoder. EIDOS adds a forecasting-specific variant: it removes the auxiliary target encoder common in some JEPA-style systems and keeps latent targets grounded in observed numeric values. VL-JEPA and LeVLJEPA add the vision-language versions: predict continuous answer or cross-modal embeddings, with LeVLJEPA testing whether negatives can be replaced by stop-gradient/predictor asymmetry plus SIGReg. The manifold-learning source is older and non-neural, but it makes the same decomposition explicit enough to serve as a baseline vocabulary: embed, predict, lift.

NextLat adds a training-side bridge between own-latent theory and ordinary language-model training. It keeps token prediction as the surface task but supervises the hidden trajectory so that $h_{t}$ should be a compact state from which $h_{t + 1}$ is predictable. That makes it useful evidence for belief-state pressure, while still leaving the TSFM question open: replace the next-token transition input with typed events, exogenous variables, actions, control inputs, or interventions and check whether dense numeric state survives.

LeNEPA now operationalizes part of this bridge in time-series SSL: external next-latent targets plus temporal SIGReg can work under a fixed-recipe stress test. The remaining target-choice question moves to the follow-up LeNEPA idea: external embeddings may preserve local numeric detail, own-hidden targets may improve compact belief state, and distribution regularization may reduce collapse but can still erase rare or action-relevant state.

Temporal Straightening adds a downstream-optimization criterion: a latent can predict the next state yet remain awkward for action-sequence optimization. Its evidence is visual control, and its linear-theory scope does not establish that all useful time-series latent trajectories should be smooth or Euclidean; abrupt events and irreversible transitions may require event-aware or directional geometry.

AdaJEPA contributes the deployment-time version of the interface: latent prediction loss is not only an offline objective but also the self-supervised update used after each executed action. Sensorimotor World Models contributes an objective-choice version of the interface: the latent transition is regularized by the ability to recover the action, so the latent state is encouraged to preserve controllable structure rather than an externally prescribed embedding distribution. Looped World Models contributes a transition-compute version of the interface: latent prediction is not only a target choice but also a repeated refinement process under actions. SkyJEPA contributes the structured-readout version: latent prediction can suppress direct state-prediction compounding error, but a physically constrained prober is needed before the rollout is useful for control costs and constraints. RAEv2 contributes a generation-side version of the same interface: the REPA head predicts the clean latent representation, and the model can use that weaker prediction as an internal guidance branch. VLA-JEPA contributes the VLA-policy version: latent prediction is used to shape transition tokens that later condition continuous control-input generation. Self-Teaching Autoencoder adds another generation-side variant: use transformed latent agreement as the reconstruction signal itself. For time-series work, the transferable question is whether auxiliary latent heads can regularize useful intermediate structure without requiring a separate model or a second inference pass, while still grounding outputs enough to preserve dense values.

On Training in Imagination adds a control-facing representation criterion: smoother latent rollout geometry and lower Lipschitz constants can tighten return-error bounds, but only if they do not increase dynamics error.

Relation To Foundation TSFM Agenda

This page is central to the latent-state and representation-quality slots in the Foundation Time-Series Model Research Agenda. It partially closes the agenda only when latent targets preserve regime, state, dense numeric detail, and plausible futures; latent prediction by itself remains insufficient when it learns slow shortcuts or drops action-relevant variables.

Open Questions

Which latent targets are most stable: learned online targets, pretrained semantic encoders, distribution-regularized embeddings, or action-recovery-grounded embeddings?
How should latent objectives stay grounded enough for high-fidelity generation?
Can transformed latent agreement ground a decoder without sacrificing representation utility?
Which stress tests reveal when latent prediction has learned slow shortcuts rather than transition dynamics?
Which tests reveal whether a latent predictor is linearly identifiable rather than only useful under nonlinear probes?
Can next-hidden-state supervision make a Transformer maintain compact belief state on multivariate time series without erasing rare events, dense numeric detail, or action history?
Can the published LeNEPA baseline identify when an external next-embedding target is better than a NextLat-style own-hidden target, and when a hybrid target is needed?
Which tests reveal whether latent prediction is recovering a real hierarchy rather than clustering a synthetic or nuisance equivalence class?
When are semantic embedding streams sufficient state, and when must they be paired with explicit actions, control inputs, or interventions?
Can auxiliary latent x-prediction heads regularize useful intermediate structure and guide sampling without a separate model or second inference pass in time-series or action-conditioned world models?
Can LeVLJEPA-style non-contrastive cross-modal prediction be adapted to time-series/query-target pairs while preserving local numeric state?
Which latent-state probes show that continuous hidden computation preserves multiple plausible futures rather than collapsing to one shortcut trajectory?
Can latent prediction loss serve as a reliable online adaptation signal for control, or can it improve one-step alignment while harming candidate-action ranking?
Can time-normalized latent-velocity regularization improve candidate-intervention planning on irregular multivariate time series without smoothing away real regime changes and rare events?

Alex Open Research Wiki

Explorer

Latent-Space Predictive Learning

Latent-Space Predictive Learning

Summary

What The Wiki Currently Believes

Evidence

Relation To Foundation TSFM Agenda

Open Questions

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

Latent-Space Predictive Learning

Latent-Space Predictive Learning

Summary

What The Wiki Currently Believes

Evidence

Relation To Foundation TSFM Agenda

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks