Language Models Need Sleep

Source

Credibility And Scope

This is a fresh arXiv v1 preprint from 2026-05-25 by Sangyun Lee, Sean McLeish, Tom Goldstein, and Giulia Fanti, with Carnegie Mellon University and University of Maryland affiliations in the paper. The team is credible for language-model architecture and training work, but this source is not yet venue-reviewed in the local evidence available at ingest time.

Official narrative discovery did not find a verified paper-specific blog post, project page, code repository, or author X thread during the ingest. Search results also surfaced an unrelated 2025 OpenReview submission with a similar title and different authors; that source is not used here.

Core Claim

LLM Sleep argues that SSM-attention hybrid models do not fail on evicted context only because their fast-weight memory is too small. They can also fail because converting evicted context into a useful fast-weight state requires computation. The proposed fix is to add a sleep phase: before clearing the attention cache, the model performs offline recurrent passes over the current context and updates SSM fast weights, then resumes wake-time prediction with the cache cleared and no extra prediction-time loop.

flowchart LR
  Context[recent context window]
  Attention[attention cache]
  Sleep[sleep: N recurrent passes]
  FastWeights[SSM fast weights]
  Evict[clear KV cache]
  Predict[wake-time single forward pass]
  Context --> Attention
  Context --> Sleep
  Sleep --> FastWeights
  Attention --> Evict
  FastWeights --> Predict
  Evict --> Predict

The useful abstraction is:

Mechanism In Plain Terms

The starting point is a model with two memory systems. Attention keeps a high-fidelity KV cache for the current window, while SSM-style blocks maintain a fixed-size fast-weight state for information that must survive after the window is gone. In the paper’s controlled experiments this is a GDN-attention hybrid; in pretrained experiments it is Jet-Nemotron, or Ouro augmented with Jet layers so that it has SSM fast-weight memory. The method needs that SSM-style persistent state: with one sleep pass, it collapses back to the ordinary SSM-attention hybrid baseline.

The sleep loop is not a loop over answer generation. It happens at a window boundary, before the attention cache is cleared. The model takes the current context window and runs the selected model blocks over it multiple times. During those passes, attention can still use the current window, but the only state meant to survive the boundary is the SSM fast-weight state. Hidden activations and the KV cache are then discarded, and later prediction proceeds with a normal single forward pass.

So the operational picture is:

QuestionShort answer
Is the base model a mixture of SSM and Transformer attention?Yes for the main hybrid setting: attention handles the retained window, while SSM blocks carry fast-weight memory across evictions.
Are they running the same SSM again and again?Not exactly. They loop the model blocks for sleep passes over the current window; the persistent effect of those passes is the updated SSM fast weights.
Does sleep change slow model parameters at serving time?No. Sleep updates fast weights/state, not durable trained parameters. The slow parameters are learned during training.
Does it add chain-of-thought or repeated prediction-time reasoning?No. Extra compute is spent before eviction during consolidation; wake-time answer prediction remains a single forward pass.
Why does this help?Compressing context into useful state is itself a computation problem, not just a storage problem. More sleep passes give the model more compute to organize evicted context into a state that supports later reasoning.

Training is modified so the model learns to use this sleep phase. The sequence is split into context windows. Chunks whose loss mask is all zero are treated as consolidation chunks: the model performs sleep passes and updates fast weights, but does not take a prediction loss there. Prediction chunks run once and compute the masked cross-entropy loss. Backpropagation goes through the whole graph, including the sleep passes and the fast-weight updates, so the slow parameters learn how to perform useful consolidation. This makes training sequential across windows and roughly more expensive as grows.

For this wiki, the important abstraction is state refresh before eviction. In a streaming or infinite-context system, raw history cannot remain directly attendable forever. LLM Sleep suggests one possible serving contract: spend bounded extra compute when a finite retained window is about to roll off, write a better compact state, and keep normal online prediction cheap.

Key Contributions

  • Identifies a reasoning-over-evicted-context failure mode for SSM-attention hybrids: fixed-size fast weights can have enough storage capacity while still lacking enough consolidation compute.
  • Introduces a sleep phase that loops the model blocks during memory consolidation rather than during answer prediction.
  • Keeps prediction-phase latency fixed: after sleep, answers are produced with a standard single forward pass.
  • Tests the mechanism on controlled synthetic tasks where reasoning depth can be increased while memory load is held fixed.
  • Extends the pattern to pretrained model initializations on GSM-Infinite using Jet-Nemotron 2B and Ouro 1.4B variants.

Evidence And Results

On the Rule 110 cellular-automaton task, the paper holds the stored information fixed and increases rollout depth. Under hard KV-cache eviction, the non-looped GDN-attention hybrid stays near random guessing on the harder setting, reaching about 10% exact accuracy after nearly 5B training tokens. Adding sleep-time recurrent passes improves learning under the same token budget: two loops reaches roughly 20% exact accuracy, while three and four loops exceed 30%.

On Depo multi-hop graph retrieval, the model sees a directed cycle fragmented across four cache windows and must answer unseen multi-hop queries after the graph has left the KV cache. Increasing sleep passes improves learning especially for harder 4-, 8-, and 16-hop queries; within the reported budget, only the 4-loop model begins improving on the hardest 16-hop setting.

On GSM-Infinite, extra sleep passes improve pretrained-model fine-tuning most clearly as arithmetic depth increases. For Jet-Nemotron 2B, six loops improve final accuracy from 0.742 to 0.812 on six-operation problems and from 0.351 to 0.388 on eight-operation problems. For Ouro 1.4B, four loops improve final accuracy from 0.419 to 0.615 on six-operation problems and from 0.210 to 0.272 on eight-operation problems.

In the sliding-window variant with Ouro 1.4B and a window size of 512, increasing sleep passes also helps retrieval under distractors: the paper reports two-operation accuracy improving from 0.596 to 0.905 when moving from the no-loop baseline to longer sleep.

Why It Matters

This source adds a useful missing axis to the wiki’s memory and dynamic-compute map. Long-context systems are often discussed as a storage problem: KV cache, compressed cache, recurrent state, memory tokens, or retrieval summaries. This paper says the conversion process itself is a compute problem. A model may need extra offline passes to transform transient observations into state that can support later reasoning.

The durable reading for this wiki is that LLM Sleep is a test-time or serving-time compute allocation pattern for SSM-attention hybrids: spend extra compute at the moment when a finite context window is about to roll off, consolidate that window into SSM fast weights, and keep later wake-time prediction cheap. This makes it relevant to infinite-context or streaming-data systems where raw history cannot remain directly attendable forever.

For time-series and world-model readers, the analogy is immediate but still unproven: an always-on time-series model may need a consolidation phase that turns recent high-resolution observations, events, and context into a compact latent state before the raw window rolls off. The paper does not show numeric time-series or action-conditioned control evidence, so the transfer claim should stay adjacent.

Limitations

  • The strongest evidence is on language, synthetic cellular automata, synthetic graph retrieval, and GSM-Infinite math reasoning, not numeric time series, telemetry, robot trajectories, or action-conditioned world models.
  • Sleep preserves wake-time prediction latency but moves cost into consolidation and training. Training becomes sequential across context windows and roughly scales with the number of sleep passes.
  • The method updates SSM fast weights, not the model’s durable slow parameters during deployment. It is a fast-state consolidation mechanism rather than lifelong parameter learning.
  • The paper reports modest-scale pretrained experiments and does not settle production scheduling, cache management, or matched wall-clock serving tradeoffs.
  • No official implementation was verified during ingest, so reproducibility rests on the paper details and converted source for now.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Streaming state and long contextadjacentConverts evicted token context into SSM fast weights before clearing the KV cache, preserving single-pass wake-time prediction. This is directly analogous to state refresh at finite-window boundaries in streaming systems.Needs numeric time-series, event streams, graph telemetry, and always-on serving evidence.
Dynamic compute allocationadjacentAllocates extra recurrent compute to consolidation before eviction rather than every prediction token.Needs matched latency, throughput, and memory-budget comparisons against wider, deeper, retrieval, and cache-compression baselines.
Latent state trackingwarningShows that fixed-size fast weights can fail when the bottleneck is deep computation over stored context rather than storage capacity alone.Need probes that verify preserved regimes, rare events, dense numeric detail, and action effects.

Open Questions

  • Can sleep-time consolidation preserve dense numeric detail and rare regimes in multivariate time series, or does it mostly help discrete reasoning tasks?
  • When should an infinite-context or streaming system spend compute on consolidation before eviction instead of using retrieval, larger windows, recurrent state, or prediction-time loops?
  • Can a sleep schedule be triggered adaptively by uncertainty, cache pressure, event density, or state-change magnitude?
  • How should sleep-time fast-weight updates interact with explicit action histories, intervention status, and delayed outcomes?
  • What is the right matched-budget baseline: longer KV cache, compressed KV cache, recurrent memory tokens, wider SSM state, unique-depth layers, or more wake-time reasoning?