FAST: Efficient Action Tokenization for Vision-Language-Action Models

Source

Core Claim

FAST is not itself a robot policy; it is an action tokenizer that compresses continuous action trajectories into discrete tokens through frequency-space structure. It tries to make autoregressive VLA training less hostile to high-rate continuous control.

Method Notes

  • The tokenizer normalizes action dimensions, transforms action chunks into a frequency representation, keeps low-frequency structure first, and applies BPE-style compression.
  • FAST is a bridge between action-token VLAs and continuous action experts: it preserves the next-token training interface while encoding smoother control-input trajectories more compactly.
  • The source is therefore a key caveat to the claim that modern fast robotics heads are always diffusion or flow; improved tokenization is a competing path.

Evidence And Limitations

FAST reports strong compression and improved action-token VLA performance, including comparisons against diffusion/flow-style policies. Its tradeoff is that tokenized autoregressive generation can still be slower or less direct than dedicated continuous action experts for some real-time control loops.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Patch size, dynamic tokenization, and point-wise numeric embeddingsadjacentCompresses continuous action chunks with DCT coefficients and BPE, ordering low-frequency components first for autoregressive prediction.It tokenizes robot actions, not observed numeric time-series streams or adaptive per-span telemetry.
Control and counterfactualsadjacentMakes high-rate continuous control-input trajectories compatible with VLA next-token training.Tokenization improves policy training but does not model action consequences or counterfactual futures.
Dynamic compute allocationadjacentReduces action-token length and training compute for autoregressive VLA policies.Inference can remain slower than flow/diffusion action experts and is not adaptive to state difficulty.

Open Questions

  • Is frequency-space tokenization enough for high-dimensional humanoid hands and whole-body control?
  • Should time-series foundation models tokenize future numeric blocks in frequency space instead of predicting raw values?