LeWorldModel: Stable End-To-End Joint-Embedding Predictive Architecture From Pixels
Source
- Raw Markdown: paper_leworldmodel-2026.md
- PDF: paper_leworldmodel-2026.pdf
Core Claim
LeWorldModel trains a stable end-to-end JEPA world model from raw pixels using next-embedding prediction and Gaussian-distribution regularization.
Key Contributions
- Presents a two-term objective for stable pixel world modeling.
- Avoids EMA, pretrained encoders, auxiliary supervision, and multi-loss heuristic stacks.
- Uses Gaussian-distributed latent embeddings to prevent collapse.
- Reports fast planning and meaningful physical latent structure on control tasks.
Method Notes
LeWorldModel operationalizes ideas from APTAMI, LeJEPA, and World Models.
Evidence And Results
The abstract reports training with about 15M parameters on a single GPU, planning up to 48x faster than foundation-model-based world models, and competitive control performance across 2D and 3D tasks.
Limitations
The paper notes short-horizon planning, offline data coverage, and action-label reliance as remaining limitations.
stable-worldmodel is the follow-up evaluation substrate to watch for this line: it puts LeWM-style latent world-model baselines into a shared data, solver, and factor-of-variation protocol where prediction error and planning success can be separated.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Causal structure, counterfactuals, and control | partially closes | Trains on offline trajectories of observations and actions, predicts next latent states, and uses MPC to optimize candidate action sequences. | Evidence is visual control, not multivariate operational time series or explicit causal intervention logs. |
| Anti-collapse regularization | partially closes | Uses Gaussian-distributed latent embeddings to stabilize end-to-end JEPA world-model training from pixels. | Needs evidence that the regularizer preserves rare regimes and dense numeric state in time-series domains. |
| Representation quality: semantic state vs dense numeric detail | partially closes | Physical probes and violation-of-expectation tests show latent structure captures some physical state and implausible events. | Short-horizon latent planning and no dense numeric reconstruction/editing interface. |
Links Into The Wiki
- JEPA
- World Models
- stable-worldmodel
- Representation Collapse
- Latent-Space Predictive Learning
- Foundation Time-Series Model Research Agenda
Open Questions
- Can LeWorldModel scale to long-horizon hierarchical planning?
- Can inverse dynamics reduce dependence on explicit action labels?