NextLat

Summary

NextLat is the next-latent-prediction method introduced by Next-Latent Prediction Transformers Learn Compact World Models. It trains an autoregressive Transformer with ordinary next-token prediction plus an auxiliary objective that predicts the model’s own next hidden state.

The method matters because it turns a Transformer’s hidden state into an explicitly supervised latent transition object. The base Transformer and its ordinary autoregressive inference path remain unchanged, while a lightweight latent dynamics model is used during training and for optional self-speculative decoding.

Method Contract

  • Base model: decoder-only autoregressive Transformer.
  • Main objective: next-token cross-entropy.
  • Auxiliary objective: predict from with a latent dynamics model.
  • Target handling: detached hidden-state targets and stop-gradient choices are used to avoid collapse and reduce extra backward cost.
  • Optional semantic alignment: KL matching between token distributions from true and predicted hidden states.
  • Claimed latent semantics: optimized hidden states become belief states, i.e. compact sufficient statistics for predicting future observations.
  • Serving hook: recursively rolling the latent dynamics model can draft variable-length continuations for self-speculative decoding.
flowchart LR
  Prefix[token history] --> Tr[Transformer]
  Tr --> Ht[h_t]
  Ht --> LM[next-token head]
  LM --> Xt[next token]
  Ht --> Psi[latent dynamics]
  Xt --> Psi
  Psi --> Hhat[h_hat_t+1]
  Hnext[h_t+1 target] -. detached .-> Loss[NextLat loss]
  Hhat --> Loss

Official Artifacts

The repository includes NextLat plus GPT, MTP, JTP, and BST baselines, training/evaluation scripts, configs, and data instructions. It does not by itself make the paper’s claims independently replicated.

Relevance To This Wiki

NextLat belongs on the latent-space predictive learning, JEPA-adjacent, and world-model branches. It is not a pure JEPA system because it keeps next-token prediction and uses the Transformer’s own hidden states as targets rather than a separate target encoder. It is also not a complete action-conditioned world model because the transition is over hidden state plus next token, not over typed external actions or interventions.

It should also be read as a close neighbor of Alex’s LeNEPA idea: LeNEPA asks whether NEPA-style next-embedding prediction plus LeJEPA-style distribution regularization should use external embeddings, own hidden states, or both. NextLat supplies the own-hidden-state side of that comparison.

For time-series and operational world-model work, the useful transfer is the pressure toward compact belief states and the evaluation lesson: next-observation accuracy is not enough. A TSFM analogue should check whether latent states preserve regimes, rare events, channel dependencies, exogenous variables, and action history, not only whether forecast loss improves.

Caveats

  • Evidence is language and synthetic/sequence-world-model evidence, not numeric time-series evidence.
  • The idealized theorem depends on successful optimization and does not remove empirical target/loss-design questions.
  • The latent dynamics model is simple and underexplored.
  • Self-speculative decoding is promising but currently evaluated with fixed draft-length sweeps rather than learned adaptive budgets.
  • The official README records reproducibility caveats around torch.compile(), Triton/Liger kernels, and hardware-specific throughput measurement.