Generative Recursive Reasoning

Source

Status And Credibility

The arXiv record lists the paper as cs.AI, first submitted on 2026-05-19 and revised as v2 on 2026-05-20. A related OpenReview version, Generative Recursive Reasoning Models, is marked as an ICLR 2026 Workshop on AI with Recursive Self-Improvement poster. This is therefore a current 2026 preprint plus workshop-poster source, not a main-conference archival result.

Credibility is strong enough for an important ingest because the author team includes KAIST, Mila, NYU, Yoshua Bengio, and Sungjin Ahn; the paper has an official project page; and it directly extends the HRM and TRM recursive-reasoning branch already tracked in this wiki. The project page currently says code is coming soon, so third-party GRAM repositories SHOULD NOT be treated as official code.

Core Claim

GRAM turns recursive latent reasoning from a deterministic latent-state refinement process into a stochastic latent-variable generative process. Instead of making every run follow one hidden trajectory toward one answer, GRAM samples multiple latent reasoning trajectories and marginalizes over them. This gives recursive reasoning models both a conditional reasoning interface, , and an unconditional generation interface, , when the input is fixed or absent.

Method Notes

GRAM starts from the recursive-reasoning contract used by Hierarchical Reasoning Model and Tiny Recursive Model: repeatedly refine a persistent latent state with shared transition functions and deep supervision. The change is that high-level latent transitions become stochastic.

For a transition, GRAM computes a deterministic proposal and then samples learned stochastic guidance:

The prior samples trajectories conditioned on the input, while a variational posterior can condition on both input and target during training. The objective is an amortized variational-inference surrogate: the full trajectory ELBO is approximated through deep supervision and truncated gradient propagation over each supervision step.

At inference, GRAM exposes two budget knobs:

  • Depth: run more recursive transitions or supervision steps.
  • Width: sample multiple stochastic latent trajectories in parallel, then choose by majority vote or a Latent Process Reward Model (LPRM) that predicts trajectory quality from latent state.

Evidence And Results

The paper evaluates controlled structured-reasoning and generation tasks, not broad language-model reasoning or numeric time-series tasks.

  • On Sudoku-Extreme and ARC-AGI, GRAM is reported to outperform deterministic recursive baselines such as Looped TF, HRM, and TRM under matched task-specific settings.
  • The width-scaling result is the key new evidence: on Sudoku-Extreme, GRAM with 20 samples at 16 iterations is reported at 97.0%, above TRM at 320 iterations at 90.5%, while using comparable compute and lower sequential latency.
  • On multi-solution constraint-satisfaction tasks, deterministic recursive models collapse to a small subset of solutions. GRAM reports 99.7% accuracy and 90.3% coverage on 8x8 N-Queens with 20 samples, plus substantially lower graph-coloring conflicts than autoregressive sampling.
  • For unconditional generation, GRAM reports 99.05% valid Sudoku-board generation with 10.9M parameters and 16 supervision steps, and binarized-MNIST FID/IS comparable to a D3PM baseline as recursion depth increases.
  • Ablations support the claim that learned stochastic guidance matters: removing stochastic guidance, using stochasticity without guidance, or adding naive randomness to TRM does not reproduce the full GRAM gains.

Limitations

  • The evidence is concentrated in small, controlled, discrete puzzle and constraint-satisfaction domains.
  • The paper explicitly avoids treating frontier LLMs as controlled baselines because training data, inference budgets, prompting, tool use, and external scaffolding are not comparable.
  • There is no numeric time-series, event-stream, telemetry, control-input, intervention, or action-conditioned world-model experiment.
  • Deep supervision and recurrent latent-variable training add sequential training cost; the paper names training efficiency as a barrier to larger foundation-model scaling.
  • Official code was not verified at ingest time; the official project page says code is coming soon.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Dynamic compute allocationadjacentGRAM separates test-time scaling into recursive depth and parallel trajectory width, and uses an LPRM to select candidates.Needs numeric time-series, streaming-state, serving-latency, and matched unique-depth or wider-baseline tests.
Multi-modal future distributionsadjacentStochastic latent trajectories preserve multiple valid puzzle solutions instead of collapsing to one deterministic attractor.Needs calibrated multiple plausible future trajectories for numeric observations, events, and action-conditioned rollouts.
Latent-state refinementadjacentThe method iteratively refines a persistent latent puzzle state and adds probabilistic transitions.Needs evidence that the refined state preserves regimes, dense numeric detail, exogenous context, and action history in time-series systems.
Control and counterfactualsinsufficient evidenceThe benchmark tasks involve constraints and multiple solutions but not logged actions, control inputs, or interventions.Needs candidate-action or intervention-conditioned next-state prediction and downstream decision utility.
Benchmark levelwarningStrong puzzle-task gains can test recurrence and constraint propagation, but they do not prove general reasoning, forecasting, or world-model ability.Needs benchmark hygiene across passive forecasting, state prediction, generation, and action-conditioned control tasks.

Open Questions

  • Can width-based latent-trajectory sampling become a practical scenario-generation interface for multivariate time series, or is it mainly useful for small discrete search spaces?
  • What selection signal replaces LPRM when candidate futures involve safety, cost, calibration, and downstream action value rather than exact puzzle correctness?
  • Can stochastic recursive latent transitions be trained without expensive sequential deep supervision, perhaps through denoising, blockwise objectives, or parallel recurrent solvers?
  • How should GRAM-style sampling compare against energy-based candidate optimization, diffusion/flow future generation, and ordinary ensemble forecasting under matched wall-clock latency?