MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

Source

Core Claim

MesaNet uses a chunkwise parallelizable Mesa layer whose in-context regression objective is solved close to optimality at each time point using conjugate gradients.

Relevance To This Wiki

It belongs to the memory-as-optimization branch: inference compute is spent solving local sequential optimization problems inside the model rather than only applying fixed layers.

Limitations

The main evidence is language modeling and long-context tasks. The extra inference FLOPs are part of the contract and need matched-budget comparison.

Foundation TSFM Relevance

Adjacent to dynamic compute and test-time adaptation for time-series systems where extra effort may be worthwhile for rare regimes or high-uncertainty windows.

Open Questions

  • What matched-budget baseline should this source be compared against: unique-depth Transformer layers, recurrent state, explicit memory, or extra inference steps?
  • Which claims transfer from token-sequence reasoning to multivariate time-series state tracking, event streams, or action-conditioned world models?