Universal Transformers

Source

Raw Markdown: paper_universal-transformers-2018.md
PDF: paper_universal-transformers-2018.pdf
Preprint: arXiv 1807.03819
Official code: tensorflow/tensor2tensor

Core Claim

Universal Transformer reuses a Transformer block across recurrent depth, combines self-attention with a recurrent inductive bias, and adds Adaptive Computation Time for per-position halting.

Relevance To This Wiki

This is the root source for the looped and recurrent-depth Transformer branch. It gives the ancestor interface: shared layers, iterative state refinement, optional dynamic compute, and a claim that depth recurrence improves systematic generalization.

Limitations

The evidence predates modern large-scale pretraining and does not settle whether shared-depth models beat unique-depth Transformers under matched memory, latency, and training compute.

Foundation TSFM Relevance

Use as architecture background for dynamic compute and compact latent-state updates, not as direct evidence for numeric time-series or action-conditioned world models.

Links Into The Wiki

Open Questions

What matched-budget baseline should this source be compared against: unique-depth Transformer layers, recurrent state, explicit memory, or extra inference steps?
Which claims transfer from token-sequence reasoning to multivariate time-series state tracking, event streams, or action-conditioned world models?

Alex Open Research Wiki

Explorer

Universal Transformers

Universal Transformers

Source

Core Claim

Relevance To This Wiki

Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks