Joint Embedding Predictive Architecture

Summary

JEPA is the wiki’s central pattern for learning by predicting in representation space instead of reconstructing raw observations or generating tokens. The key caveat is that “predictable” does not automatically mean “task-relevant”: latent prediction can focus on slow, stable nuisance features unless the objective or evaluation protocol checks for that.

The local Dynamic Curriculum Learning For JEPA idea adds a second caveat: a representation-space loss can only drive useful sampling if the target representation still preserves the local state distinctions that matter.

What The Wiki Currently Believes

A Path Towards Autonomous Machine Intelligence frames JEPA as a building block for predictive world models and hierarchical planning.
Introduction to Latent Variable Energy-Based Models presents H-JEPA as a hierarchical stack of joint embedding predictors for multi-level prediction under uncertainty.
VJEPA makes the uncertainty interface explicit: deterministic squared-error JEPA is an implicit fixed-variance Gaussian, while VJEPA predicts a conditional distribution over future latent states and BJEPA combines learned dynamics with a modular prior expert. The evaluated diagonal-Gaussian head is unimodal, so separated multi-modal futures remain a model-family option rather than a demonstrated result.
Joint Embedding Predictive Architectures Focus on Slow Features is the main failure-mode source: VICReg- and SimCLR-style JEPA can learn fixed background distractors instead of action-relevant state in a moving-dot world-model setup.
LeJEPA argues that JEPA needs a target embedding distribution, specifically an isotropic Gaussian, and proposes SIGReg as a scalable way to enforce it.
VISReg is a current SIGReg-family pressure test: it claims regularization-based JEPA can scale toward DINO-like visual OOD performance, but it also argues vanilla SIGReg has a vanishing-gradient weakness under collapse and should be split into variance/scale plus SWD shape regularization.
When Does LeJEPA Learn a World Model? sharpens that claim: under Gaussian latent variables, stationary isotropic additive-noise transitions, and successful Gaussian/whitening constraints, LeJEPA recovers the true latent state up to rotation; outside that regime, non-Gaussian or policy-shaped trajectories can distort the learned state.
LeWorldModel applies JEPA to action-conditioned pixel world modeling with a two-term objective.
Temporal Straightening adds explicit latent-trajectory geometry to a JEPA world model: next-latent prediction remains the dynamics objective, while a local curvature term makes the learned state easier to optimize through during goal-conditioned planning.
AdaJEPA adds test-time adaptation to JEPA world models: after executing an MPC action chunk, the observed transition becomes a self-supervised latent-prediction update, and the model replans with the adapted parameters.
Sensorimotor World Models adds inverse dynamics regularization as an action-grounded alternative to SIGReg for end-to-end JEPA world models: it can intentionally collapse action-irrelevant variation while preserving controllable latent structure.
Learn From Your Own Latents And Not From Tokens gives a sample-complexity theory for the JEPA/data2vec intuition: on the Random Hierarchy Model, own-latent prediction can recover hidden hierarchy at the local clustering scale $v m^{3}$ rather than the token-level $v m^{L + 1}$ scale.
NextLat is the adjacent autoregressive-Transformer branch: it predicts the model’s own next hidden state while keeping next-token supervision, so it supports the own-latent/belief-state intuition but should not be collapsed into pure JEPA.
LeNEPA is the local published bridge between NEPA and LeJEPA for time series: use next-latent-token prediction, remove augmentations and stop-gradient/EMA stabilization, and add temporal SIGReg. The LeNEPA idea page now tracks follow-up target-family comparisons, including NextLat-style own-hidden targets.
stable-worldmodel is the shared evaluation and implementation surface around the JEPA world-model line: it includes LeWM, PLDM, DINO-WM, and related baselines so objective changes can be compared under common data, solver, and robustness protocols.
Self-Teaching Autoencoder is not a standard JEPA source, but it imports the latent-agreement and SIGReg framing into an autoencoder with a decoder constrained by transformed views.
CHARM applies JEPA to multivariate time-series representation learning with channel descriptions, causal/smoothing augmentations, and a multi-resolution latent embedding loss.
EIDOS adapts next-embedding prediction to time-series forecasting with a point-wise scalar tokenizer and lightweight future-segment target aggregation instead of a full auxiliary target encoder.
Next-Embedding Prediction is adjacent to JEPA but should stay separate: NEPA predicts future embeddings from an embedding stream, while JEPA is the broader joint-embedding predictive family.
VL-JEPA is the strongest current vision-language example of replacing token-space supervision with embedding-space prediction: it predicts a target text embedding conditioned on visual input and a query, then decodes text only when a readout is needed.
LeVLJEPA is the current non-contrastive vision-language extension: it trains image/text encoders from scratch with cross-modal prediction, stop-gradient targets, and per-modality SIGReg, and reports stronger dense patch-token features despite weaker DataComp-L zero-shot alignment than contrastive objectives.
VLA-JEPA is the current robotics example of applying JEPA-style latent prediction to VLA pretraining: future video frames create target-side V-JEPA2 state embeddings, while the learner path sees only current observation, language, and latent-action tokens.
SkyJEPA is the quadrotor-control branch: it uses JEPA-style latent dynamics over explicit state and motor-control histories, a physics-inspired prober for metric-state rollouts, and MPPI for real-time candidate control-input search.
TSL-JEPA is the local time-series idea that applies the same query-conditioned and selective-decoding pattern to retrieval, alerting, captioning, and structured time-series readouts.
Dynamic Curriculum Learning For JEPA records an internal, unpublished time-series and video research direction: use the current model’s latent prediction surprise to select useful windows from large unlabeled temporal corpora, then validate on public time-series and video benchmarks.

Evidence

The source set shows JEPA moving from architecture proposal to theory, then to probabilistic semantics, domain-specific systems, and shared evaluation: autonomous intelligence in APTAMI, lecture-note grounding in LVEBM, distributional latent prediction and filtering/control semantics in VJEPA, early failure-mode analysis in JEPA Slow Features, theory and regularization in LeJEPA, VISReg’s SIGReg-family scale/shape regularizer pressure test in VISReg, state-identifiability theory in When Does LeJEPA Learn a World Model?, sample-complexity theory for own-latent targets in Learn From Your Own Latents And Not From Tokens, pixel control in LeWorldModel, online plan—execute—adapt—replan updates in AdaJEPA, action-grounded inverse-dynamics regularization in Sensorimotor World Models, infrastructure and robustness evaluation in stable-worldmodel, time-series representation learning in CHARM, time-series forecasting and next-embedding variants in EIDOS, LeNEPA, and NEPA, vision-language tasks in VL-JEPA and non-contrastive cross-modal pretraining in LeVLJEPA, robotics VLA pretraining in VLA-JEPA, and real-time quadrotor control in SkyJEPA. Self-Teaching Autoencoder sits on the boundary: it is an autoencoder project, but it asks whether a latent-agreement objective can keep a decoder grounded without making raw reconstruction loss the primary teacher.

NextLat should be treated as a boundary case for this page. It shares the representation-space prediction intuition with JEPA and own-latent learning, but the target is an internal hidden state generated by the same autoregressive Transformer, and the transition input includes the next token. That makes it especially useful for comparing token-space LLM training against latent-state supervision, while keeping the action-conditioned world-model question separate.

LeNEPA is the local time-series realization of the NEPA-adjacent boundary: use next-latent-token prediction, but add temporal SIGReg so the latent state is not only predictable but also controlled against collapse and dimensional degeneration. The remaining LeNEPA idea is now a follow-up agenda for target selection and grounding, not an unpublished-method placeholder. Temporal Straightening adds a distinct constraint: non-collapsed, predictive latents can still have geometry that is poor for differentiable planning.

NEPA Boundary

Next-Embedding Prediction belongs next to JEPA, but it should have its own topic page. NEPA’s target-layer sensitivity is a useful warning for JEPA-style curricula: before using latent prediction surprise as a sampling signal, ablate whether the target path is patch-independent, contextual, internal-layer, or task-conditioned. That warning is not direct evidence about pure JEPA.

VL-JEPA And Selective Decoding

VL-JEPA is useful because it separates semantic prediction from language generation. A classical token-generative VLM learns a distribution over answer tokens. A CLIP-style model aligns independently encoded image and text embeddings. VL-JEPA sits between them: it predicts the embedding of the target answer from visual input and a textual query, then optionally decodes that embedding into text.

This makes JEPA relevant to fast/slow system design. The high-rate internal stream can stay continuous and non-autoregressive, while language becomes a selective readout for humans or external language interfaces. In robotics and time-series systems, this suggests a useful middle layer: maintain compact task, state, or incident embeddings continuously, then decode explanations or labels sparsely when the embedding changes enough to matter.

VL-JEPA should not be collapsed into action-conditioned world modeling. It predicts target text embeddings, not future state embeddings under candidate actions, control inputs, or interventions. Its main contribution to the wiki’s world-model thread is the interface pattern: prediction in representation space first, human-readable language second.

TSL-JEPA Extension

TSL-JEPA translates the VL-JEPA interface to time series. The intended object is not a chat model over time-series plots. It is a query-conditioned representation system where retrieval, alerting, classification, structured property extraction, and optional captioning share a predictive embedding interface.

The practical comparison against next-token prediction should therefore test query and candidate-label reformulation, not only fixed-prompt answer quality. If the representation-space target is useful, the model should be less sensitive to superficial wording changes than a pure token-generation pipeline trained on the same query/answer data.

Relation To Foundation TSFM Agenda

JEPA maps to the Foundation Time-Series Model Research Agenda through latent prediction, anti-collapse regularization, and the semantic-state-versus-dense-detail tension.

Agenda slot	Verdict	Evidence	Missing pieces
Latent-state prediction	partially closes	CHARM, EIDOS, and LeNEPA give time-series variants of predictive representation learning; LeWorldModel gives action-conditioned evidence outside numeric time series; Own Latents gives a synthetic sample-complexity mechanism for recovering hidden hierarchy by predicting learned latents.	Need broader high-dimensional and streaming time-series state-maintenance tests.
Multi-modal future distributions	adjacent	VJEPA replaces one latent target estimate with an explicit predictive distribution and supports sampled belief rollouts.	Its evaluated diagonal-Gaussian head is unimodal and the evidence is not numeric time series; needs separated-mode, calibration, and decision-utility benchmarks with expressive predictive families.
Data diversity, curriculum, and long tail	adjacent	Dynamic Curriculum Learning For JEPA proposes using latent prediction surprise to select useful windows from unlabeled useful-signal-poor temporal corpora. Evidence is internal and unpublished.	Need public matched-compute time-series and video experiments with rare-state metrics and normal-retention checks.
Anti-collapse regularization	partially closes	LeJEPA and LeWorldModel add Gaussian regularization; VISReg splits SIGReg-family regularization into variance scale plus SWD shape matching and reports stronger collapse-stage gradients; LeVLJEPA applies per-modality SIGReg in non-contrastive image/text pretraining and documents direct symmetric alignment collapse; LeNEPA adds temporal SIGReg evidence for no-stop-gradient time-series next-latent prediction; Sensorimotor World Models adds inverse dynamics as an action-grounded anti-collapse signal; LeJEPA Identifiability gives conditions under which Gaussian/whitened alignment recovers true latent state; JEPA Slow Features gives a failure-mode warning.	Need rare-regime, weak-action, non-Gaussian, and intervention-sensitive probes in time-series domains.
Control and counterfactuals	adjacent	LeWorldModel is an action-conditioned world-model anchor outside this page’s time-series evidence; Temporal Straightening regularizes planner-facing latent geometry for GD/MPC; AdaJEPA adds within-episode test-time model updates during MPC; VLA-JEPA adds robotics VLA evidence for leakage-free latent state prediction plus a flow-matching control-input head; SkyJEPA adds explicit motor-control rollout through MPPI for real-time quadrotor flight.	Need candidate-action rollout evidence for numeric or operational time series.
Decoder grounding	adjacent	Self-Teaching Autoencoder keeps a decoder in the latent objective through transformed self-consistency.	Needs stronger vision baselines, downstream representation tests, and time-series grounding experiments.

Open Questions

Can SIGReg-style Gaussian regularization replace stop-gradient and teacher-student stabilizers at very large multimodal scale, or does VISReg-style scale/shape decoupling become necessary when collapse-stage gradients matter?
When should action-grounded inverse dynamics replace, complement, or be rejected in favor of SIGReg-style distribution control?
Can LeJEPA-style identifiability be extended from passive/OU positive pairs to action-conditioned world models with typed actions or interventions?
Which domains require latent variables beyond deterministic embeddings?
Which JEPA predictive family can preserve separated future regimes without averaging incompatible latent states: mixtures, flows, diffusion, latent variables, or energy-based scoring?
How should time-series JEPA preserve both slow regime state and fast transition state without letting static exogenous context dominate the representation?
Can VL-JEPA-style target embeddings become state variables for action-conditioned world models?
Can decoder-grounded latent consistency preserve dense outputs without making pixel or observation loss dominate the representation?
How should embedding targets be grounded in control consequences rather than only caption, retrieval, or VQA semantics?
Which target paths are safe for surprise-based JEPA curricula: patch-independent embeddings, contextual embeddings, internal Transformer layers, or task-conditioned target states?
Can a time-series encoder plus text encoder plus predictor become a publishable text-conditioned time-series JEPA system without erasing dense numeric detail?
Can TSL-JEPA turn structured time-series queries into typed outputs without making free-form text generation the main objective?
Can LeVLJEPA’s non-contrastive image/text prediction recipe transfer to time-series/text targets, and do dense temporal probes reverse conclusions from pooled query-answer metrics the way dense patch-token probes do in vision-language?
Can SkyJEPA-style physics-inspired probing be adapted to typed non-robotic systems where constraints and state variables are known but not governed by rigid-body dynamics?
Can AdaJEPA-style test-time adaptation be made safe, calibrated, and persistent enough for real robots or operational time-series systems?
Can Temporal-Straightening-style latent-velocity regularization transfer to irregular time series while respecting actions, exogenous variables, real discontinuities, and asymmetric transition costs?
Does the own-latent sample-complexity advantage survive outside RHM-style synthetic hierarchy when real data has ambiguous parses, recursion, long tails, and context-dependent rules?
Can NextLat-style hidden-state prediction be combined with JEPA-style distribution regularization so the learned state is both recursively predictable and identifiable?
Can the published LeNEPA baseline make that comparison operational by testing external next embeddings, contextual embeddings, and own-hidden targets under the same SIGReg/grounding/probe suite?

Alex Open Research Wiki

Explorer

Joint Embedding Predictive Architecture

Joint Embedding Predictive Architecture

Summary

What The Wiki Currently Believes

Evidence

NEPA Boundary

VL-JEPA And Selective Decoding

TSL-JEPA Extension

Relation To Foundation TSFM Agenda

Open Questions

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

Joint Embedding Predictive Architecture

Joint Embedding Predictive Architecture

Summary

What The Wiki Currently Believes

Evidence

NEPA Boundary

VL-JEPA And Selective Decoding

TSL-JEPA Extension

Relation To Foundation TSFM Agenda

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks