Pretraining Recurrent Networks without Recurrence

Source

Raw Markdown: paper_pretraining-recurrent-networks-without-recurrence-2026.md
PDF: paper_pretraining-recurrent-networks-without-recurrence-2026.pdf
Preprint: arXiv:2606.06479
Project page: akarshkumar.com/smt
Official code: akarshkumar0101/smt. Local metadata: papers/pretraining-recurrent-networks-without-recurrence-2026/official_artifacts_metadata.json.
Gonzo ML discussion: t.me/gonzo_ML/5572. Local public-post snapshot: papers/pretraining-recurrent-networks-without-recurrence-2026/telegram-post-gonzo-ml-5572.md.
Review link from the post: ArXivIQ

Status And Credibility

This is an arXiv v1 preprint submitted on 2026-06-04 by Akarsh Kumar and Phillip Isola from MIT. The arXiv metadata lists cs.LG and cs.AI, a CC BY 4.0 license, and the comment 30 pages, 23 figures.

Treat it as current and interesting but not yet peer-reviewed evidence. Positive credibility signals are the public MIT project page, the Apache-2.0 minimal PyTorch implementation, explicit comparison to BPTT on synthetic, character-level language, and pixel-sequence tasks, and a frank discussion section that names the Transformer-teacher expressivity limit and the need for post-training. Caveats: no venue is visible, no large-scale language-model checkpoint is released, and the main experiments are small-to-medium research runs rather than ParaRNN-style billion-parameter LM training.

Core Claim

The paper proposes Supervised Memory Training (SMT) and DAgger Memory Training (DMT) as a way to pretrain nonlinear RNNs without standard BPTT.

The training decomposition is:

flowchart LR
  Past[Past context with timestamps]
  Enc[Transformer encoder]
  Mem[Predictive memory m_t]
  Dec[Transformer decoder]
  Future[Future prediction loss]
  Upd[RNN updater f_theta]
  Next[Next memory m_{t+1}]
  DMT[DMT: on-policy memory imitation]

  Past --> Enc --> Mem
  Mem --> Dec --> Future
  Mem --> Upd --> Next
  Enc --> Next
  Next --> DMT --> Upd

SMT first trains a Transformer encoder-decoder to produce a compressed predictive state of the past. The RNN is then trained as a supervised one-step memory-transition model:

(m_{t}, x_{t + 1}) \mapsto m_{t + 1} .

This avoids backpropagating through an unrolled recurrent trajectory during pretraining. DMT then fixes train-test mismatch by rolling out the RNN’s own memory states and training it to imitate the encoder trajectory from those on-policy states.

Evidence And Results

Claim	Paper evidence	Wiki interpretation
SMT changes credit assignment	The paper defines BPTT credit paths as $O (T)$ and SMT credit paths as $O (1)$ between tokens.	This is the central positive claim: SMT bypasses recurrent credit propagation instead of merely accelerating it.
One-step SMT is not enough by itself	The method section says SMT creates train-test mismatch: at evaluation the RNN consumes its own memories, so one-step errors drift.	The user’s comparison discussion is right to foreground this. DMT is not an optional detail; it is the mechanism that makes rollout usable.
DMT mitigates drift	The paper reports that DMT reduces rollout drift and significantly improves RNN performance across SMT hyperparameters.	DMT is a lightweight post-training phase, but it is not fully time-parallel and therefore reintroduces some sequential rollout cost.
Synthetic long-range tasks	SMT→DMT beats BPTT across retrieval, string copy, stack state tracking, key-value recall, and modular arithmetic settings.	Good evidence for the credit-assignment framing, but still controlled tasks.
Pixel-sequence modeling	SMT→DMT captures MNIST and Sketchy raster-scan structure where BPTT RNNs show recency bias.	Useful evidence for long-horizon finite-state memory, not yet general image modeling.
Length generalization	On synthetic stack state tracking, SMT→DMT RNN generalizes better than its Transformer teacher at lengths longer than training.	This is the paper’s strongest answer to the “teacher upper-bound” objection, but only for one synthetic state-tracking setup.
Scaling	TinyStories experiments show smooth improvements with larger context length, memory state size, and model size.	Encouraging scaling shape; not comparable to Apple ParaRNN’s 7B language-model scale.

Comparison With Apple ParaRNN

Alex’s follow-up discussion frames the right comparison as credit-assignment replacement vs trajectory-solver parallelization.

Axis	SMT / DMT	Apple ParaRNN	Takeaway
What is parallelized	SMT parallelizes pretraining by replacing recurrent credit propagation with supervised predictive-memory labels.	ParaRNN parallelizes nonlinear RNN application by solving the hidden-state trajectory as a nonlinear system with Newton iterations and parallel reductions.	SMT changes the learning problem; ParaRNN changes the solver for the original recurrent trajectory.
Relationship to BPTT	SMT pretraining avoids BPTT; DMT may unroll the RNN but uses imitation labels rather than ordinary long-horizon task credit.	ParaRNN remains closer to BPTT-style end-to-end training, but makes the forward/backward RNN application parallelizable in practice.	If the primary problem is vanishing/exploding credit assignment, SMT is conceptually cleaner. If the primary problem is wall-clock sequential unroll, ParaRNN is cleaner.
Expressivity ceiling	The teacher is a time-parallel Transformer encoder-decoder, so its bounded sequential depth may fail on tasks where nonlinear recurrence is required. The paper explicitly says BPTT finetuning may be needed to go beyond the teacher.	No Transformer teacher upper-bound: the nonlinear RNN cell itself is trained.	This is the strongest objection to SMT as a universal route. SMT is a pretraining method, not a proof that teacher-supervised RNNs can learn every recurrent behavior.
Rollout mismatch	One-step SMT labels can fail over long rollouts because memory errors compound; DMT addresses this by on-policy imitation.	The hidden trajectory is solved for the actual recurrent equations, so there is no separate teacher-memory rollout distribution.	The “one-step does not generalize” objection is real for SMT alone; the paper’s answer is DMT plus possible BPTT post-training.
Practical scale	Experiments are research-scale: synthetic tasks, TinyStories character LM, MNIST/Sketchy pixel sequences, one-H200 runs.	The ParaRNN source reports adapted GRU/LSTM language models up to 7B parameters and smooth 7B training curves.	ParaRNN has stronger large-scale LM feasibility evidence today.
Solver assumptions	No Newton solver; depends instead on the quality and expressivity of learned predictive states.	Depends on fast Newton convergence and structured Jacobians; Predictability Enables Parallelization sharpens that predictable systems can be polylog-parallelizable while chaotic systems become ill-conditioned.	ParaRNN-style methods look stronger when the target dynamics are predictable and the cell is solver-friendly; SMT looks attractive when BPTT credit paths are the bottleneck.
TSFM/world-model relevance	Predictive memory states are directly aligned with latent-state time-series modeling: remember only what is needed for future prediction.	Nonlinear recurrent state is an attractive substrate for physical, operational, and action-conditioned dynamics if solver constraints hold.	For this wiki, the two methods are complementary candidates, not substitutes.

Comparison Verdict

ParaRNN is the stronger current baseline for large-scale nonlinear RNN language modeling: it keeps the recurrent objective, avoids a Transformer teacher ceiling, and has reported 7B-scale results. SMT/DMT is more interesting as a credit-assignment and predictive-state pretraining idea: it says the RNN should first learn a compressed Markovian memory from future-prediction labels, then learn to follow that memory with a one-step updater.

The skeptical synthesis is:

SMT alone is too weak because one-step imitation can drift over long rollouts.
DMT is the paper’s answer to drift, but DMT is no longer the clean fully time-parallel story.
A Transformer teacher can be an expressivity ceiling for tasks where nonlinear recurrence is exactly the needed computation class.
The paper acknowledges this and suggests lightweight post-training or BPTT finetuning.
ParaRNN/Newton-style methods are likely the better engineering route when predictable dynamics and structured cells make the solver fast.
SMT/DMT remains valuable if the goal is to pretrain a useful predictive memory state before applying a stronger recurrent fine-tuning or ParaRNN-style solver.

A useful hybrid question is therefore: can SMT initialize the memory geometry of a nonlinear recurrent model, then ParaRNN/DEER-style parallel trajectory solving perform the final end-to-end training without starting from scratch?

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Streaming state, long context, and constant updates	adjacent	Trains fixed-memory RNNs whose state is a predictive abstraction of the past and whose inference cost is constant per token.	Evidence is language/pixel sequence modeling, not numeric telemetry or always-on TSFM serving.
Latent-state predictive learning	adjacent	The encoder objective explicitly learns predictive states sufficient for future prediction.	Needs tests on multivariate time series with rare regimes, exogenous variables, event streams, actions, and interventions.
Control and counterfactuals	insufficient evidence	The project page says SMT applies to seq2seq and behavioral cloning but not directly to RL reward optimization.	Needs action-conditioned rollouts, reward/control objectives, and post-training beyond predictive observation modeling.
Benchmark validity	warning	The paper exposes the teacher-expressivity ceiling and drift after one-step memory training.	Claims about replacing BPTT should separate SMT-only, SMT→DMT, and BPTT/ParaRNN-style post-training.

Limitations And Gotchas

This is an arXiv preprint with no peer-reviewed venue visible at ingest time.
SMT’s teacher is time-parallel and therefore may be less expressive than a nonlinear RNN on tasks needing deep sequential computation.
DMT is needed because one-step memory prediction drifts under rollout.
DMT is not fully time-parallel, so the final recipe is less clean than “no recurrence during training.”
The paper’s own discussion says BPTT finetuning may be required to go beyond the teacher.
The GRU SMT runs fail in the reported sequential compute/data experiments because the GRU architecture induces memory-space collapse during SMT training.
Experiments are far below the 7B LM scale reported by ParaRNN.
The converted raw Markdown originally exposed deleted author-note blocks from the LaTeX source; those non-rendered blocks were removed from the committed Markdown snapshot.

Open Questions

Can SMT initialize nonlinear recurrent memory states and then use ParaRNN or DEER-style parallel trajectory solving for final end-to-end training?
Which tasks actually require teacher-exceeding nonlinear recurrence rather than Transformer-learned predictive states plus DMT?
Can DMT be parallelized with a predictable-dynamics solver without losing its on-policy correction benefit?
What metrics predict long-rollout memory drift better than one-step $R^{2}$ or MSE?
Does SMT’s predictive-state objective preserve rare regimes, event timing, exogenous variables, and action history in multivariate time series?
Can SMT support control settings where the objective is reward or intervention utility, not only future observation prediction or behavioral cloning?

Alex Open Research Wiki

Explorer

Pretraining Recurrent Networks without Recurrence

Pretraining Recurrent Networks without Recurrence

Source

Status And Credibility

Core Claim

Evidence And Results

Comparison With Apple ParaRNN

Comparison Verdict

Foundation TSFM Relevance

Limitations And Gotchas

Open Questions

Links Into The Wiki

Graph View

Table of Contents

Backlinks