Pretraining Recurrent Networks without Recurrence
Source
- Raw Markdown: paper_pretraining-recurrent-networks-without-recurrence-2026.md
- PDF: paper_pretraining-recurrent-networks-without-recurrence-2026.pdf
- Preprint: arXiv:2606.06479
- Project page: akarshkumar.com/smt
- Official code: akarshkumar0101/smt. Local metadata:
papers/pretraining-recurrent-networks-without-recurrence-2026/official_artifacts_metadata.json. - Gonzo ML discussion: t.me/gonzo_ML/5572. Local public-post snapshot:
papers/pretraining-recurrent-networks-without-recurrence-2026/telegram-post-gonzo-ml-5572.md. - Review link from the post: ArXivIQ
Status And Credibility
This is an arXiv v1 preprint submitted on 2026-06-04 by Akarsh Kumar and Phillip Isola from MIT. The arXiv metadata lists cs.LG and cs.AI, a CC BY 4.0 license, and the comment 30 pages, 23 figures.
Treat it as current and interesting but not yet peer-reviewed evidence. Positive credibility signals are the public MIT project page, the Apache-2.0 minimal PyTorch implementation, explicit comparison to BPTT on synthetic, character-level language, and pixel-sequence tasks, and a frank discussion section that names the Transformer-teacher expressivity limit and the need for post-training. Caveats: no venue is visible, no large-scale language-model checkpoint is released, and the main experiments are small-to-medium research runs rather than ParaRNN-style billion-parameter LM training.
Core Claim
The paper proposes Supervised Memory Training (SMT) and DAgger Memory Training (DMT) as a way to pretrain nonlinear RNNs without standard BPTT.
The training decomposition is:
flowchart LR Past[Past context with timestamps] Enc[Transformer encoder] Mem[Predictive memory m_t] Dec[Transformer decoder] Future[Future prediction loss] Upd[RNN updater f_theta] Next[Next memory m_{t+1}] DMT[DMT: on-policy memory imitation] Past --> Enc --> Mem Mem --> Dec --> Future Mem --> Upd --> Next Enc --> Next Next --> DMT --> Upd
SMT first trains a Transformer encoder-decoder to produce a compressed predictive state of the past. The RNN is then trained as a supervised one-step memory-transition model:
This avoids backpropagating through an unrolled recurrent trajectory during pretraining. DMT then fixes train-test mismatch by rolling out the RNN’s own memory states and training it to imitate the encoder trajectory from those on-policy states.
Evidence And Results
| Claim | Paper evidence | Wiki interpretation |
|---|---|---|
| SMT changes credit assignment | The paper defines BPTT credit paths as and SMT credit paths as between tokens. | This is the central positive claim: SMT bypasses recurrent credit propagation instead of merely accelerating it. |
| One-step SMT is not enough by itself | The method section says SMT creates train-test mismatch: at evaluation the RNN consumes its own memories, so one-step errors drift. | The user’s comparison discussion is right to foreground this. DMT is not an optional detail; it is the mechanism that makes rollout usable. |
| DMT mitigates drift | The paper reports that DMT reduces rollout drift and significantly improves RNN performance across SMT hyperparameters. | DMT is a lightweight post-training phase, but it is not fully time-parallel and therefore reintroduces some sequential rollout cost. |
| Synthetic long-range tasks | SMT→DMT beats BPTT across retrieval, string copy, stack state tracking, key-value recall, and modular arithmetic settings. | Good evidence for the credit-assignment framing, but still controlled tasks. |
| Pixel-sequence modeling | SMT→DMT captures MNIST and Sketchy raster-scan structure where BPTT RNNs show recency bias. | Useful evidence for long-horizon finite-state memory, not yet general image modeling. |
| Length generalization | On synthetic stack state tracking, SMT→DMT RNN generalizes better than its Transformer teacher at lengths longer than training. | This is the paper’s strongest answer to the “teacher upper-bound” objection, but only for one synthetic state-tracking setup. |
| Scaling | TinyStories experiments show smooth improvements with larger context length, memory state size, and model size. | Encouraging scaling shape; not comparable to Apple ParaRNN’s 7B language-model scale. |
Comparison With Apple ParaRNN
Alex’s follow-up discussion frames the right comparison as credit-assignment replacement vs trajectory-solver parallelization.
| Axis | SMT / DMT | Apple ParaRNN | Takeaway |
|---|---|---|---|
| What is parallelized | SMT parallelizes pretraining by replacing recurrent credit propagation with supervised predictive-memory labels. | ParaRNN parallelizes nonlinear RNN application by solving the hidden-state trajectory as a nonlinear system with Newton iterations and parallel reductions. | SMT changes the learning problem; ParaRNN changes the solver for the original recurrent trajectory. |
| Relationship to BPTT | SMT pretraining avoids BPTT; DMT may unroll the RNN but uses imitation labels rather than ordinary long-horizon task credit. | ParaRNN remains closer to BPTT-style end-to-end training, but makes the forward/backward RNN application parallelizable in practice. | If the primary problem is vanishing/exploding credit assignment, SMT is conceptually cleaner. If the primary problem is wall-clock sequential unroll, ParaRNN is cleaner. |
| Expressivity ceiling | The teacher is a time-parallel Transformer encoder-decoder, so its bounded sequential depth may fail on tasks where nonlinear recurrence is required. The paper explicitly says BPTT finetuning may be needed to go beyond the teacher. | No Transformer teacher upper-bound: the nonlinear RNN cell itself is trained. | This is the strongest objection to SMT as a universal route. SMT is a pretraining method, not a proof that teacher-supervised RNNs can learn every recurrent behavior. |
| Rollout mismatch | One-step SMT labels can fail over long rollouts because memory errors compound; DMT addresses this by on-policy imitation. | The hidden trajectory is solved for the actual recurrent equations, so there is no separate teacher-memory rollout distribution. | The “one-step does not generalize” objection is real for SMT alone; the paper’s answer is DMT plus possible BPTT post-training. |
| Practical scale | Experiments are research-scale: synthetic tasks, TinyStories character LM, MNIST/Sketchy pixel sequences, one-H200 runs. | The ParaRNN source reports adapted GRU/LSTM language models up to 7B parameters and smooth 7B training curves. | ParaRNN has stronger large-scale LM feasibility evidence today. |
| Solver assumptions | No Newton solver; depends instead on the quality and expressivity of learned predictive states. | Depends on fast Newton convergence and structured Jacobians; Predictability Enables Parallelization sharpens that predictable systems can be polylog-parallelizable while chaotic systems become ill-conditioned. | ParaRNN-style methods look stronger when the target dynamics are predictable and the cell is solver-friendly; SMT looks attractive when BPTT credit paths are the bottleneck. |
| TSFM/world-model relevance | Predictive memory states are directly aligned with latent-state time-series modeling: remember only what is needed for future prediction. | Nonlinear recurrent state is an attractive substrate for physical, operational, and action-conditioned dynamics if solver constraints hold. | For this wiki, the two methods are complementary candidates, not substitutes. |
Comparison Verdict
ParaRNN is the stronger current baseline for large-scale nonlinear RNN language modeling: it keeps the recurrent objective, avoids a Transformer teacher ceiling, and has reported 7B-scale results. SMT/DMT is more interesting as a credit-assignment and predictive-state pretraining idea: it says the RNN should first learn a compressed Markovian memory from future-prediction labels, then learn to follow that memory with a one-step updater.
The skeptical synthesis is:
- SMT alone is too weak because one-step imitation can drift over long rollouts.
- DMT is the paper’s answer to drift, but DMT is no longer the clean fully time-parallel story.
- A Transformer teacher can be an expressivity ceiling for tasks where nonlinear recurrence is exactly the needed computation class.
- The paper acknowledges this and suggests lightweight post-training or BPTT finetuning.
- ParaRNN/Newton-style methods are likely the better engineering route when predictable dynamics and structured cells make the solver fast.
- SMT/DMT remains valuable if the goal is to pretrain a useful predictive memory state before applying a stronger recurrent fine-tuning or ParaRNN-style solver.
A useful hybrid question is therefore: can SMT initialize the memory geometry of a nonlinear recurrent model, then ParaRNN/DEER-style parallel trajectory solving perform the final end-to-end training without starting from scratch?
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Streaming state, long context, and constant updates | adjacent | Trains fixed-memory RNNs whose state is a predictive abstraction of the past and whose inference cost is constant per token. | Evidence is language/pixel sequence modeling, not numeric telemetry or always-on TSFM serving. |
| Latent-state predictive learning | adjacent | The encoder objective explicitly learns predictive states sufficient for future prediction. | Needs tests on multivariate time series with rare regimes, exogenous variables, event streams, actions, and interventions. |
| Control and counterfactuals | insufficient evidence | The project page says SMT applies to seq2seq and behavioral cloning but not directly to RL reward optimization. | Needs action-conditioned rollouts, reward/control objectives, and post-training beyond predictive observation modeling. |
| Benchmark validity | warning | The paper exposes the teacher-expressivity ceiling and drift after one-step memory training. | Claims about replacing BPTT should separate SMT-only, SMT→DMT, and BPTT/ParaRNN-style post-training. |
Limitations And Gotchas
- This is an arXiv preprint with no peer-reviewed venue visible at ingest time.
- SMT’s teacher is time-parallel and therefore may be less expressive than a nonlinear RNN on tasks needing deep sequential computation.
- DMT is needed because one-step memory prediction drifts under rollout.
- DMT is not fully time-parallel, so the final recipe is less clean than “no recurrence during training.”
- The paper’s own discussion says BPTT finetuning may be required to go beyond the teacher.
- The GRU SMT runs fail in the reported sequential compute/data experiments because the GRU architecture induces memory-space collapse during SMT training.
- Experiments are far below the 7B LM scale reported by ParaRNN.
- The converted raw Markdown originally exposed deleted author-note blocks from the LaTeX source; those non-rendered blocks were removed from the committed Markdown snapshot.
Open Questions
- Can SMT initialize nonlinear recurrent memory states and then use ParaRNN or DEER-style parallel trajectory solving for final end-to-end training?
- Which tasks actually require teacher-exceeding nonlinear recurrence rather than Transformer-learned predictive states plus DMT?
- Can DMT be parallelized with a predictable-dynamics solver without losing its on-policy correction benefit?
- What metrics predict long-rollout memory drift better than one-step or MSE?
- Does SMT’s predictive-state objective preserve rare regimes, event timing, exogenous variables, and action history in multivariate time series?
- Can SMT support control settings where the objective is reward or intervention utility, not only future observation prediction or behavioral cloning?