Next-Latent Prediction Transformers Learn Compact World Models

Source

Raw Markdown: paper_nextlat-2026.md
PDF: paper_nextlat-2026.pdf
Preprint: arXiv 2511.05963
OpenReview: PLAN-FM Bridge @ AAAI 2026
Official blog: Next-Latent Prediction Transformers
Official code: JaydenTeoh/NextLat
Official X thread: Jayden Teoh announcement

Local artifact snapshots are stored as raw provenance under papers/nextlat-2026/github-readme-nextlat.md, papers/nextlat-2026/official-blog-nextlat-2026.md, papers/nextlat-2026/openreview-planfm-2026.md, and papers/nextlat-2026/x-thread-jayden_teoh-2066905213328605612.md.

Status And Credibility

This is a current arXiv preprint: v1 was submitted on 2025-11-08 and the latest checked revision is v4 from 2026-06-15. The arXiv page labels it a Microsoft Research preprint. OpenReview lists it as published at PLAN-FM Bridge @ AAAI 2026, which is useful venue/status context but not a top-tier main-conference archival acceptance. The official repository releases code, training/evaluation scripts, baseline implementations, configs, and reproducibility notes under an MIT license; the paper artifact is CC BY 4.0. This makes the source credible fresh evidence for latent-space supervision in Transformers, while still requiring independent replication and larger-scale evidence.

Core Claim

NextLat augments ordinary autoregressive next-token training with a self-supervised latent transition objective. The base Transformer still encodes a prefix into hidden states and predicts the next token, but training also fits a lightweight latent dynamics model that predicts the Transformer’s next hidden state from the current hidden state and the next token.

flowchart LR
  Ht[hidden state h_t] --> Dyn[latent dynamics p_psi]
  Xt[next token X_{t+1}] --> Dyn
  Dyn --> Hhat[predicted hidden state h_hat_{t+1}]
  Hnext[target hidden state h_{t+1}] -. stop-gradient target .-> Loss[next-hidden loss]
  Hhat --> Loss
  Hhat --> KL[token-distribution KL]

The paper’s theoretical claim is that if next-token consistency and transition consistency are optimized, the hidden state must become a belief state: a sufficient statistic of the token history for predicting the future. The intended effect is to add a recurrent inductive bias to a Transformer without changing the base architecture or ordinary inference path.

Objective

The practical objective combines next-token loss, next-hidden-state regression, and an optional token-distribution KL term:

L_{NextLat} = L_{next - token} + λ_{next - h} L_{next - h} + λ_{KL} L_{KL} .

The next-hidden term rolls out a simple latent dynamics model over horizon $d$ and matches detached target hidden states with Smooth L1 loss. The paper emphasizes that belief-state convergence only needs $d = 1$ in the idealized theorem; larger horizons mainly add richer empirical supervision.

Evidence And Results

World modeling / Manhattan taxi rides: all models reach 100% next-token legality, but NextLat gives the best reported valid-trajectory score, sequence-compression score, and effective latent-rank score. The paper uses this to show that next-token accuracy alone can hide incoherent internal maps.
Reasoning / Countdown: NextLat outperforms GPT, BST, MTP, and JTP in the reported setup. With horizon $d = 1$ , NextLat reaches 54.8% accuracy versus 42.3% for BST and 39.2%/39.0% for MTP/JTP at the same horizon.
Planning / Path-Star: NextLat maintains close to 100% solve rate across reported graph topologies, while token-space baselines degrade on harder settings.
TinyStories probing: frozen NextLat hidden states show stronger future-token predictive probe performance up to 20 tokens ahead while preserving next-token performance better than MTP/JTP.
FineWeb-Edu / language modeling: 1.3B models are trained on 100B FineWeb-Edu tokens. NextLat $d = 2$ reports a modest average zero-shot accuracy gain over GPT, but not a consistent task-level improvement. The clearer result is that NextLat preserves perplexity better than MTP/JTP and enables stronger self-speculative decoding.
Self-speculative decoding: NextLat reports up to about $3.3 \times$ speedup in the evaluated language-model setting because the latent dynamics model can recursively draft beyond its training horizon; MTP/JTP are more constrained by fixed token-space horizons.

Why It Matters

NextLat is important for this wiki because it sits directly between next-token LLM training, JEPA-style latent prediction, recurrent-state learning, and world-model objectives. It does not merely add a better probe; it uses the model’s own hidden states as training targets so the Transformer is pressured to make them recursively predictable.

Alex’s LeNEPA note should be kept next to this source. The closest comparison is not ordinary JEPA alone, but the NEPA/LeNEPA question of target construction: should a time-series model predict an external next embedding, a distribution-regularized embedding, its own next hidden state as in NextLat, or a hybrid of these targets?

For Latent-Space Predictive Learning, the source is a strong language/sequence-modeling example of predicting learned internal state rather than raw tokens alone. For World Models, it offers a compact belief-state objective and a latent transition interface, but the action variable is the next token rather than a typed control input or intervention. For Foundation Time-Series Model Research Agenda, the useful transfer is the belief-state pressure and the diagnostics that separate next-token accuracy from latent world-model quality.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Latent-state prediction	partially closes outside TSFM	Adds a next-hidden-state objective with an idealized belief-state theorem and evidence that hidden states become more compact and future-predictive.	Needs numeric time-series, event-stream, and high-dimensional multivariate tests where the latent tracks regimes, rare events, cross-channel state, and dense values.
World-model transition interface	adjacent	Learns a latent dynamics model $p_{ψ} (h_{t}, X_{t + 1}) \to h_{t + 1}$ and tests internal world-model consistency on Manhattan taxi trajectories.	The transition input is a next token, not a typed action, control input, intervention, or candidate action sequence; no closed-loop operational control benchmark.
Benchmark hygiene	warning	Shows 100% next-token legality on Manhattan can coexist with poor internal maps; reports valid trajectories, sequence compression, effective latent rank, and detour robustness.	Need TSFM benchmarks that similarly separate observation loss from latent-state quality and action consequence quality.
Dynamic compute and serving	partially closes outside TSFM	Variable-length self-speculative decoding uses latent rollouts to draft beyond the training horizon and reports up to about $3.3 \times$ speedup.	Draft length is still selected statically in experiments; serving evidence is language-model/B200-specific and not a learned fixed-FLOPs controller.
Control and counterfactuals	insufficient evidence	The belief-state framing is control-relevant, and the paper uses planning tasks.	No explicit intervention or control-input channels for numeric systems; no counterfactual rollout under candidate actions.

Limitations

The main evidence is sequence modeling, synthetic/world-model tests, and language modeling, not numeric time series or operational telemetry.
The paper’s theorem is idealized: it assumes successful optimization of next-token and transition consistency.
The latent dynamics model is a simple MLP; the authors leave richer transition architectures and hierarchical belief states for future work.
Stop-gradient placement, KL self-distillation, Smooth L1 loss, optimizer behavior, and multi-step supervision remain empirical design choices with scale sensitivity.
Reported language-model gains are clearer for representation quality, perplexity preservation, and speculative decoding than for consistent zero-shot benchmark accuracy.
Self-speculative decoding experiments use fixed draft lengths selected between 2 and 10 tokens; adaptive draft-length policies remain future work.
The official code release improves reproducibility, but the README notes torch.compile() sensitivity on Path-Star and $A_{5}$ , B200-specific throughput measurement details, and checkpoint-dependent post-hoc evaluation steps.

Links Into The Wiki

Open Questions

Does NextLat still improve state quality at larger language-model scales where next-token training already learns stronger long-range representations?
Which latent target layer is best for time-series or world-model use: final hidden state, intermediate state, multi-layer aggregate, channel-specific state, or task-conditioned state?
In a LeNEPA-style time-series experiment, when does an own-hidden target beat an external next-embedding target under matched compute and dense-value preservation checks?
Can a NextLat-style objective preserve dense numeric detail while compressing history into belief states?
How should the transition input be generalized from next token to typed actions, control inputs, interventions, events, or exogenous variables?
Can latent speculative decoding become an adaptive compute-allocation mechanism rather than a fixed draft-length serving trick?
Does co-training a latent dynamics model provide a practical route for training nonlinear recurrent state models without full BPTT on long sequences?

Alex Open Research Wiki

Explorer

Next-Latent Prediction Transformers Learn Compact World Models

Next-Latent Prediction Transformers Learn Compact World Models

Source

Status And Credibility

Core Claim

Objective

Evidence And Results

Why It Matters

Foundation TSFM Relevance

Limitations

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks