Universal Transformers

Source

Core Claim

Universal Transformer reuses a Transformer block across recurrent depth, combines self-attention with a recurrent inductive bias, and adds Adaptive Computation Time for per-position halting.

Relevance To This Wiki

This is the root source for the looped and recurrent-depth Transformer branch. It gives the ancestor interface: shared layers, iterative state refinement, optional dynamic compute, and a claim that depth recurrence improves systematic generalization.

Limitations

The evidence predates modern large-scale pretraining and does not settle whether shared-depth models beat unique-depth Transformers under matched memory, latency, and training compute.

Foundation TSFM Relevance

Use as architecture background for dynamic compute and compact latent-state updates, not as direct evidence for numeric time-series or action-conditioned world models.

Open Questions

  • What matched-budget baseline should this source be compared against: unique-depth Transformer layers, recurrent state, explicit memory, or extra inference steps?
  • Which claims transfer from token-sequence reasoning to multivariate time-series state tracking, event streams, or action-conditioned world models?