Next-Latent Prediction Transformers Learn Compact World Models
Source
- Raw Markdown: paper_nextlat-2026.md
- PDF: paper_nextlat-2026.pdf
- Preprint: arXiv 2511.05963
- OpenReview: PLAN-FM Bridge @ AAAI 2026
- Official blog: Next-Latent Prediction Transformers
- Official code: JaydenTeoh/NextLat
- Official X thread: Jayden Teoh announcement
Local artifact snapshots are stored as raw provenance under papers/nextlat-2026/github-readme-nextlat.md, papers/nextlat-2026/official-blog-nextlat-2026.md, papers/nextlat-2026/openreview-planfm-2026.md, and papers/nextlat-2026/x-thread-jayden_teoh-2066905213328605612.md.
Status And Credibility
This is a current arXiv preprint: v1 was submitted on 2025-11-08 and the latest checked revision is v4 from 2026-06-15. The arXiv page labels it a Microsoft Research preprint. OpenReview lists it as published at PLAN-FM Bridge @ AAAI 2026, which is useful venue/status context but not a top-tier main-conference archival acceptance. The official repository releases code, training/evaluation scripts, baseline implementations, configs, and reproducibility notes under an MIT license; the paper artifact is CC BY 4.0. This makes the source credible fresh evidence for latent-space supervision in Transformers, while still requiring independent replication and larger-scale evidence.
Core Claim
NextLat augments ordinary autoregressive next-token training with a self-supervised latent transition objective. The base Transformer still encodes a prefix into hidden states and predicts the next token, but training also fits a lightweight latent dynamics model that predicts the Transformer’s next hidden state from the current hidden state and the next token.
flowchart LR Ht[hidden state h_t] --> Dyn[latent dynamics p_psi] Xt[next token X_{t+1}] --> Dyn Dyn --> Hhat[predicted hidden state h_hat_{t+1}] Hnext[target hidden state h_{t+1}] -. stop-gradient target .-> Loss[next-hidden loss] Hhat --> Loss Hhat --> KL[token-distribution KL]
The paper’s theoretical claim is that if next-token consistency and transition consistency are optimized, the hidden state must become a belief state: a sufficient statistic of the token history for predicting the future. The intended effect is to add a recurrent inductive bias to a Transformer without changing the base architecture or ordinary inference path.
Objective
The practical objective combines next-token loss, next-hidden-state regression, and an optional token-distribution KL term:
The next-hidden term rolls out a simple latent dynamics model over horizon and matches detached target hidden states with Smooth L1 loss. The paper emphasizes that belief-state convergence only needs in the idealized theorem; larger horizons mainly add richer empirical supervision.
Evidence And Results
- World modeling / Manhattan taxi rides: all models reach 100% next-token legality, but NextLat gives the best reported valid-trajectory score, sequence-compression score, and effective latent-rank score. The paper uses this to show that next-token accuracy alone can hide incoherent internal maps.
- Reasoning / Countdown: NextLat outperforms GPT, BST, MTP, and JTP in the reported setup. With horizon , NextLat reaches 54.8% accuracy versus 42.3% for BST and 39.2%/39.0% for MTP/JTP at the same horizon.
- Planning / Path-Star: NextLat maintains close to 100% solve rate across reported graph topologies, while token-space baselines degrade on harder settings.
- TinyStories probing: frozen NextLat hidden states show stronger future-token predictive probe performance up to 20 tokens ahead while preserving next-token performance better than MTP/JTP.
- FineWeb-Edu / language modeling: 1.3B models are trained on 100B FineWeb-Edu tokens. NextLat reports a modest average zero-shot accuracy gain over GPT, but not a consistent task-level improvement. The clearer result is that NextLat preserves perplexity better than MTP/JTP and enables stronger self-speculative decoding.
- Self-speculative decoding: NextLat reports up to about speedup in the evaluated language-model setting because the latent dynamics model can recursively draft beyond its training horizon; MTP/JTP are more constrained by fixed token-space horizons.
Why It Matters
NextLat is important for this wiki because it sits directly between next-token LLM training, JEPA-style latent prediction, recurrent-state learning, and world-model objectives. It does not merely add a better probe; it uses the model’s own hidden states as training targets so the Transformer is pressured to make them recursively predictable.
Alex’s LeNEPA note should be kept next to this source. The closest comparison is not ordinary JEPA alone, but the NEPA/LeNEPA question of target construction: should a time-series model predict an external next embedding, a distribution-regularized embedding, its own next hidden state as in NextLat, or a hybrid of these targets?
For Latent-Space Predictive Learning, the source is a strong language/sequence-modeling example of predicting learned internal state rather than raw tokens alone. For World Models, it offers a compact belief-state objective and a latent transition interface, but the action variable is the next token rather than a typed control input or intervention. For Foundation Time-Series Model Research Agenda, the useful transfer is the belief-state pressure and the diagnostics that separate next-token accuracy from latent world-model quality.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Latent-state prediction | partially closes outside TSFM | Adds a next-hidden-state objective with an idealized belief-state theorem and evidence that hidden states become more compact and future-predictive. | Needs numeric time-series, event-stream, and high-dimensional multivariate tests where the latent tracks regimes, rare events, cross-channel state, and dense values. |
| World-model transition interface | adjacent | Learns a latent dynamics model and tests internal world-model consistency on Manhattan taxi trajectories. | The transition input is a next token, not a typed action, control input, intervention, or candidate action sequence; no closed-loop operational control benchmark. |
| Benchmark hygiene | warning | Shows 100% next-token legality on Manhattan can coexist with poor internal maps; reports valid trajectories, sequence compression, effective latent rank, and detour robustness. | Need TSFM benchmarks that similarly separate observation loss from latent-state quality and action consequence quality. |
| Dynamic compute and serving | partially closes outside TSFM | Variable-length self-speculative decoding uses latent rollouts to draft beyond the training horizon and reports up to about speedup. | Draft length is still selected statically in experiments; serving evidence is language-model/B200-specific and not a learned fixed-FLOPs controller. |
| Control and counterfactuals | insufficient evidence | The belief-state framing is control-relevant, and the paper uses planning tasks. | No explicit intervention or control-input channels for numeric systems; no counterfactual rollout under candidate actions. |
Limitations
- The main evidence is sequence modeling, synthetic/world-model tests, and language modeling, not numeric time series or operational telemetry.
- The paper’s theorem is idealized: it assumes successful optimization of next-token and transition consistency.
- The latent dynamics model is a simple MLP; the authors leave richer transition architectures and hierarchical belief states for future work.
- Stop-gradient placement, KL self-distillation, Smooth L1 loss, optimizer behavior, and multi-step supervision remain empirical design choices with scale sensitivity.
- Reported language-model gains are clearer for representation quality, perplexity preservation, and speculative decoding than for consistent zero-shot benchmark accuracy.
- Self-speculative decoding experiments use fixed draft lengths selected between 2 and 10 tokens; adaptive draft-length policies remain future work.
- The official code release improves reproducibility, but the README notes
torch.compile()sensitivity on Path-Star and , B200-specific throughput measurement details, and checkpoint-dependent post-hoc evaluation steps.
Links Into The Wiki
- NextLat
- Latent-Space Predictive Learning
- JEPA
- Next-Embedding Prediction
- LeNEPA
- World Models
- Looped Transformers And Test-Time Memory
- Foundation Time-Series Model Research Agenda
- Learn From Your Own Latents And Not From Tokens
- LeWorldModel
- Looped World Models
- Pretraining Recurrent Networks without Recurrence
Open Questions
- Does NextLat still improve state quality at larger language-model scales where next-token training already learns stronger long-range representations?
- Which latent target layer is best for time-series or world-model use: final hidden state, intermediate state, multi-layer aggregate, channel-specific state, or task-conditioned state?
- In a LeNEPA-style time-series experiment, when does an own-hidden target beat an external next-embedding target under matched compute and dense-value preservation checks?
- Can a NextLat-style objective preserve dense numeric detail while compressing history into belief states?
- How should the transition input be generalized from
next tokento typed actions, control inputs, interventions, events, or exogenous variables? - Can latent speculative decoding become an adaptive compute-allocation mechanism rather than a fixed draft-length serving trick?
- Does co-training a latent dynamics model provide a practical route for training nonlinear recurrent state models without full BPTT on long sequences?