FAST: Efficient Action Tokenization for Vision-Language-Action Models
Source
- Raw Markdown: paper_fast-2025.md
- PDF: paper_fast-2025.pdf
- Preprint: arXiv 2501.09747
- Project page: pi.website/research/fast
Core Claim
FAST is not itself a robot policy; it is an action tokenizer that compresses continuous action trajectories into discrete tokens through frequency-space structure. It tries to make autoregressive VLA training less hostile to high-rate continuous control.
Method Notes
- The tokenizer normalizes action dimensions, transforms action chunks into a frequency representation, keeps low-frequency structure first, and applies BPE-style compression.
- FAST is a bridge between action-token VLAs and continuous action experts: it preserves the next-token training interface while encoding smoother control-input trajectories more compactly.
- The source is therefore a key caveat to the claim that modern fast robotics heads are always diffusion or flow; improved tokenization is a competing path.
Evidence And Limitations
FAST reports strong compression and improved action-token VLA performance, including comparisons against diffusion/flow-style policies. Its tradeoff is that tokenized autoregressive generation can still be slower or less direct than dedicated continuous action experts for some real-time control loops.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Patch size, dynamic tokenization, and point-wise numeric embeddings | adjacent | Compresses continuous action chunks with DCT coefficients and BPE, ordering low-frequency components first for autoregressive prediction. | It tokenizes robot actions, not observed numeric time-series streams or adaptive per-span telemetry. |
| Control and counterfactuals | adjacent | Makes high-rate continuous control-input trajectories compatible with VLA next-token training. | Tokenization improves policy training but does not model action consequences or counterfactual futures. |
| Dynamic compute allocation | adjacent | Reduces action-token length and training compute for autoregressive VLA policies. | Inference can remain slower than flow/diffusion action experts and is not adaptive to state difficulty. |
Links Into The Wiki
- Foundation Time-Series Model Research Agenda
- FAST
- Robotics Time-Series Modeling
- Robotics Text Conditioning
Open Questions
- Is frequency-space tokenization enough for high-dimensional humanoid hands and whole-body control?
- Should time-series foundation models tokenize future numeric blocks in frequency space instead of predicting raw values?