LeWorldModel: Stable End-To-End Joint-Embedding Predictive Architecture From Pixels

Source

Core Claim

LeWorldModel trains a stable end-to-end JEPA world model from raw pixels using next-embedding prediction and Gaussian-distribution regularization.

Key Contributions

  • Presents a two-term objective for stable pixel world modeling.
  • Avoids EMA, pretrained encoders, auxiliary supervision, and multi-loss heuristic stacks.
  • Uses Gaussian-distributed latent embeddings to prevent collapse.
  • Reports fast planning and meaningful physical latent structure on control tasks.

Method Notes

LeWorldModel operationalizes ideas from APTAMI, LeJEPA, and World Models.

Evidence And Results

The abstract reports training with about 15M parameters on a single GPU, planning up to 48x faster than foundation-model-based world models, and competitive control performance across 2D and 3D tasks.

Limitations

The paper notes short-horizon planning, offline data coverage, and action-label reliance as remaining limitations.

stable-worldmodel is the follow-up evaluation substrate to watch for this line: it puts LeWM-style latent world-model baselines into a shared data, solver, and factor-of-variation protocol where prediction error and planning success can be separated.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Causal structure, counterfactuals, and controlpartially closesTrains on offline trajectories of observations and actions, predicts next latent states, and uses MPC to optimize candidate action sequences.Evidence is visual control, not multivariate operational time series or explicit causal intervention logs.
Anti-collapse regularizationpartially closesUses Gaussian-distributed latent embeddings to stabilize end-to-end JEPA world-model training from pixels.Needs evidence that the regularizer preserves rare regimes and dense numeric state in time-series domains.
Representation quality: semantic state vs dense numeric detailpartially closesPhysical probes and violation-of-expectation tests show latent structure captures some physical state and implausible events.Short-horizon latent planning and no dense numeric reconstruction/editing interface.

Open Questions

  • Can LeWorldModel scale to long-horizon hierarchical planning?
  • Can inverse dynamics reduce dependence on explicit action labels?