Supervised Memory Training
Summary
Supervised Memory Training (SMT) is a pretraining method for nonlinear RNNs that replaces recurrent credit propagation with supervised one-step memory-transition labels. A Transformer encoder-decoder first learns predictive memory states for each context; the RNN updater then learns . DAgger Memory Training (DMT) is the follow-up on-policy imitation phase that reduces rollout drift from SMT-only one-step training.
Role In The Wiki
SMT belongs to the nonlinear recurrent-state branch of Efficient Recurrent Sequence Models. It complements ParaRNN: ParaRNN parallelizes the actual nonlinear recurrent trajectory with Newton iterations and parallel reduction, while SMT changes the training target into predictive-state imitation.
For time-series and world-model work, the interesting transfer is the predictive memory interface: learn a compact latent state that is sufficient for future prediction, then train a recurrent updater to maintain that state. The caveat is that Transformer-teacher expressivity, rollout drift, and reward/control objectives remain unsolved.