AdaJEPA

Source

Raw Markdown: paper_adajepa-2026.md
PDF: paper_adajepa-2026.pdf
Preprint: arXiv 2606.32026
Paper-listed project URL: agenticlearning.ai/adajepa
Local artifact-discovery notes: papers/adajepa-2026/official_artifacts_snapshot.md

Status And Credibility

AdaJEPA is a current 2026 arXiv preprint submitted on 2026-06-30 by Ying Wang, Oumayma Bounou, Yann LeCun, and Mengye Ren from New York University and AMI Labs. It is credible enough to track as an important Alex-provided source because it is a fresh JEPA/world-model paper from a directly relevant author group, it extends the local LeWorldModel, stable-worldmodel, and the Temporal Straightening latent-planning line cited by the paper, and it is authored by researchers already central to the JEPA and latent-planning thread.

Credibility caveats: it is not peer reviewed in the local evidence record, no official code repository was verified at ingest time, the paper-listed project URL currently resolves to the Agentic Learning AI Lab site rather than a dedicated AdaJEPA page, and no official author/lab X announcement was verified. Treat the numerical claims as paper-reported preprint evidence until code, data, and independent reproduction appear.

Core Claim

AdaJEPA argues that a latent world model used by model predictive control should not remain frozen at deployment. After each MPC step, the agent executes the first action chunk, observes the resulting transition, performs a lightweight self-supervised latent-prediction update, and replans with the adapted model. The paper’s main claim is that this plan—execute—adapt—replan loop improves goal-reaching under visual, geometric, physical-dynamics, and maze-layout shifts, often with only one gradient step and a tiny recent-transition buffer.

flowchart LR
  O["current observation"]
  WM["JEPA latent world model"]
  MPC["MPC planner"]
  A["execute first action chunk"]
  N["next observation"]
  B["recent transition buffer"]
  U["one or few gradient updates"]

  O --> WM --> MPC --> A --> N --> B --> U --> WM
  N --> O

Model Interface

The paper assumes trajectories of observations and actions/control inputs. A sensory encoder maps observations to latent states, an action encoder maps actions to latent action embeddings, and a predictor forecasts the next latent state:

z_{t} = E_{ϕ}^{s} (o_{t}), u_{t} = E_{ψ}^{a} (a_{t}), \overset{z}{^}_{t + 1} = f_{θ} (z_{t}, u_{t}) .

At test time, AdaJEPA stores recent observed transitions and minimizes the same kind of latent prediction loss used during pretraining:

L_{ada} (B) = \frac{1}{∣ B ∣} (o_{i}, a_{i}, o_{i + 1}) \in B \sum ℓ (f_{θ} (z_{i}, E_{ψ}^{a} (a_{i})), sg (z_{i + 1})) .

Only a subset of parameters $Ω \subseteq {ϕ, ψ, θ}$ is updated before the next replan. The default experiments update selected final encoder/predictor layers with one gradient step, a recent buffer of 5 transitions, and the training learning rates. Each episode starts from the same pretrained model and maintains an episode-local adapted copy.

Evidence And Results

The experiments use PushT/PushObj visual manipulation and PointMaze navigation variants. The paper separates in-distribution evaluation from several shift types:

Shape shifts: PushObj trains on four shapes and tests on both seen and held-out shapes.
Visual shifts: PushT observations receive blur, salt-and-pepper noise, dark lighting, or color shifts.
Dynamics shifts: PointMaze changes mass or damping.
Layout shifts: PointMaze uses held-out random maze layouts.

Key paper-reported results:

On in-distribution tasks, adaptation is reported as safe: it improves suboptimal frozen models and does not materially harm already strong frozen models.
On PointMaze held-out layouts, the frozen model reports 53.3% GD and 49.3% CEM success, while adapting the first predictor block plus last encoder stage reports 78.7% GD and 70.7% CEM success.
On PushT validation trajectories, AdaJEPA improves several base JEPA world models while adding only about 0.01—0.03 seconds per MPC replan on an H200. For example, a global-feature Temporal Straightening world model improves CEM success from 74.0% to 81.3%, and a spatial-feature version improves CEM success from 89.3% to 93.3%.
In the PushObj data-scaling study, adaptation is most valuable in low-data regimes. For seen shapes with one training shape and 1k trajectories, test-time adaptation raises success from about 28.1% to 60.8%, exceeding a frozen model trained with 16x more trajectories per shape. The appendix also reports that an adapted single-shape model can exceed the best four-shape frozen model once the per-shape trajectory count reaches 2k for unseen shapes.
Shape diversity still matters. Under a fixed 16k total trajectory budget, spreading data across four shapes is reported better than concentrating it on one shape for both seen and unseen shapes after adaptation.

Limitations And Gotchas

The source is a preprint; no venue acceptance, official code release, or independent reproduction was verified during ingest.
The evidence is visual manipulation and maze navigation, not physical robot deployment, graph time series, observability telemetry, healthcare, power-grid control, or other numeric multivariate time-series domains.
AdaJEPA adapts within an episode from the latest observed transitions; the paper does not claim a persistent continual-learning memory across episodes.
Adaptation is bounded by the pretrained representation. The paper notes that visual corruptions such as red-anchor/red-block shifts show modest gains when the model’s color reliance remains a bottleneck.
Test-time gradient updates add a new safety and reproducibility axis: which parameters are updated, buffer policy, learning rate, number of steps, per-step latency, reset policy, and possible update instability must be reported.
The paper improves goal-reaching success but does not provide a calibrated uncertainty model, causal counterfactual protocol, or general intervention-validity test.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Latent-state prediction	partially closes outside time series	Predicts and adapts future latent states from observation/action transitions instead of reconstructing pixels.	Needs numeric multivariate time-series evidence, dense-value probes, irregular event streams, and state-identifiability checks under policy-shaped data.
Control and counterfactuals	partially closes outside time series	Uses MPC over candidate action/control-input sequences, then updates the world model from the transition caused by the chosen action.	Needs typed digital interventions, calibrated counterfactual validation, real or simulator-backed operational benchmarks, and safety constraints for online adaptation.
Streaming state and constant updates	adjacent	The model is revised repeatedly inside a closed control loop from newly observed transitions.	It is episodic test-time adaptation, not an always-on bounded-memory streaming TSFM with eviction, abstention, and long-horizon state-refresh evaluation.
Data diversity and long tail	adjacent	Data-scale experiments show adaptation can compensate for low training coverage while still benefiting from shape diversity.	Needs rare-regime preservation tests, no-adaptation controls, and matched-compute scaling in non-robotic time-series corpora.
Benchmark hygiene	warning	The paper separately reports shift types, adapted layers, buffer/step choices, planner type, latency, and data scale.	Needs public code/data, independent reproduction, and a protocol separating frozen, within-episode adaptation, persistent continual learning, and data-collection effects.

Links Into The Wiki

Open Questions

Can test-time adaptation be made safe enough for real robots or operational systems when a bad update can change future action choices?
Which online-update targets are best for action-conditioned time series: encoder layers, predictor layers, LoRA adapters, recurrent state, fast weights, or explicit environment parameters?
How should adaptation interact with calibrated uncertainty, safety shields, and no-op decisions?
Can this loop become persistent continual learning without catastrophic drift, privacy leaks, or forgetting of rare but safety-critical states?
Does latent prediction loss reliably indicate control utility under larger shifts, or can adaptation reduce prediction error while hurting plan ranking?
What is the TSFM analogue of PushObj/PointMaze where typed interventions, exogenous variables, event streams, and dense numeric state are all present?

Alex Open Research Wiki

Explorer

AdaJEPA: An Adaptive Latent World Model