FAST: Efficient Action Tokenization for Vision-Language-Action Models

Source

Core Claim

FAST is not itself a robot policy; it is an action tokenizer that compresses continuous action trajectories into discrete tokens through frequency-space structure. It tries to make autoregressive VLA training less hostile to high-rate continuous control.

Method Notes

The tokenizer normalizes action dimensions, transforms action chunks into a frequency representation, keeps low-frequency structure first, and applies BPE-style compression.
FAST is a bridge between action-token VLAs and continuous action experts: it preserves the next-token training interface while encoding smoother control-input trajectories more compactly.
The source is therefore a key caveat to the claim that modern fast robotics heads are always diffusion or flow; improved tokenization is a competing path.

Evidence And Limitations

FAST reports strong compression and improved action-token VLA performance, including comparisons against diffusion/flow-style policies. Its tradeoff is that tokenized autoregressive generation can still be slower or less direct than dedicated continuous action experts for some real-time control loops.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Patch size, dynamic tokenization, and point-wise numeric embeddings	adjacent	Compresses continuous action chunks with DCT coefficients and BPE, ordering low-frequency components first for autoregressive prediction.	It tokenizes robot actions, not observed numeric time-series streams or adaptive per-span telemetry.
Control and counterfactuals	adjacent	Makes high-rate continuous control-input trajectories compatible with VLA next-token training.	Tokenization improves policy training but does not model action consequences or counterfactual futures.
Dynamic compute allocation	adjacent	Reduces action-token length and training compute for autoregressive VLA policies.	Inference can remain slower than flow/diffusion action experts and is not adaptive to state difficulty.

Links Into The Wiki

Open Questions

Is frequency-space tokenization enough for high-dimensional humanoid hands and whole-body control?
Should time-series foundation models tokenize future numeric blocks in frequency space instead of predicting raw values?

Alex Open Research Wiki

Explorer

FAST: Efficient Action Tokenization for Vision-Language-Action Models

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Source

Core Claim

Method Notes

Evidence And Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks