LeWorldModel: Stable End-To-End Joint-Embedding Predictive Architecture From Pixels

Source

Raw Markdown: paper_leworldmodel-2026.md
PDF: paper_leworldmodel-2026.pdf

Core Claim

LeWorldModel trains a stable end-to-end JEPA world model from raw pixels using next-embedding prediction and Gaussian-distribution regularization.

Key Contributions

Presents a two-term objective for stable pixel world modeling.
Avoids EMA, pretrained encoders, auxiliary supervision, and multi-loss heuristic stacks.
Uses Gaussian-distributed latent embeddings to prevent collapse.
Reports fast planning and meaningful physical latent structure on control tasks.

Method Notes

LeWorldModel operationalizes ideas from APTAMI, LeJEPA, and World Models.

Temporal Straightening makes an adjacent emergent property explicit. LeWorldModel reports that its PushT latent paths become straighter during training without a dedicated curvature loss; its appendix also calls this a possible form of temporal collapse while associating it with useful downstream structure. Temporal Straightening directly optimizes consecutive latent-velocity alignment and tests whether that geometry improves GD/MPC planning. The latter paper predates LeWorldModel’s arXiv release, so this should be read as a complementary objective comparison rather than a LeWorldModel follow-up experiment.

Evidence And Results

The abstract reports training with about 15M parameters on a single GPU, planning up to 48x faster than foundation-model-based world models, and competitive control performance across 2D and 3D tasks.

Limitations

The paper notes short-horizon planning, offline data coverage, and action-label reliance as remaining limitations.

stable-worldmodel is the follow-up evaluation substrate to watch for this line: it puts LeWM-style latent world-model baselines into a shared data, solver, and factor-of-variation protocol where prediction error and planning success can be separated.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Causal structure, counterfactuals, and control	partially closes	Trains on offline trajectories of observations and actions, predicts next latent states, and uses MPC to optimize candidate action sequences.	Evidence is visual control, not multivariate operational time series or explicit causal intervention logs.
Anti-collapse regularization	partially closes	Uses Gaussian-distributed latent embeddings to stabilize end-to-end JEPA world-model training from pixels.	Needs evidence that the regularizer preserves rare regimes and dense numeric state in time-series domains.
Representation quality: semantic state vs dense numeric detail	partially closes	Physical probes and violation-of-expectation tests show latent structure captures some physical state and implausible events.	Short-horizon latent planning and no dense numeric reconstruction/editing interface.

Links Into The Wiki

Open Questions

Can LeWorldModel scale to long-horizon hierarchical planning?
Can inverse dynamics reduce dependence on explicit action labels?

Alex Open Research Wiki

Explorer

LeWorldModel: Stable End-To-End Joint-Embedding Predictive Architecture From Pixels

LeWorldModel: Stable End-To-End Joint-Embedding Predictive Architecture From Pixels

Source

Core Claim

Key Contributions

Method Notes

Evidence And Results

Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks