---
abstract: |
  Autoregressive sequence models, such as Transformer-based vision-language action (VLA) policies, can be tremendously effective for capturing complex and generalizable robotic behaviors. However, such models require us to choose a tokenization of our continuous action signals, which determines how the discrete symbols predicted by the model map to continuous robot actions. We find that current approaches for robot action tokenization, based on simple per-dimension, per-timestep binning schemes, typically perform poorly when learning dexterous skills from high-frequency robot data. To address this challenge, we propose a new compression-based tokenization scheme for robot actions, based on the discrete cosine transform. Our tokenization approach, `\ModelAcronymLong`{=latex}, enables us to train autoregressive VLAs for highly dexterous and high-frequency tasks where standard discretization methods fail completely. Based on `\ModelAcronym{}`{=latex}, we release `\ModelUniversalAcronym`{=latex}, a *universal* robot action tokenizer, trained on 1M real robot action trajectories. It can be used as a black-box tokenizer for a wide range of robot action sequences, with diverse action spaces and control frequencies. Finally, we show that, when combined with the $\bm{\pi_0}$ VLA, our method can scale to training on 10k hours of robot data and match the performance of diffusion VLAs, while reducing training time by up to 5x.
author:
- Anonymous Submission
- |
  Karl Pertsch$^{\ast, 1, 2, 3}$, Kyle Stachowicz$^{\ast, 2}$,\
  Brian Ichter$^{1}$, Danny Driess$^{1}$, Suraj Nair$^{1}$, Quan Vuong$^{1}$, Oier Mees$^{2}$, Chelsea Finn$^{1, 3}$, Sergey Levine$^{1, 2}$\
  $^{1}$Physical Intelligence, $^{2}$UC Berkeley, $^{3}$Stanford\
  [^1] <https://pi.website/research/fast>
bibliography:
- references.bib
title: ' `\ModelAcronym`{=latex}: Efficient Action Tokenization for Vision-Language-Action Models '
---

```{=latex}
\newcommand{\red}[1]{\textcolor{red}{#1}}
```
```{=latex}
\newcommand{\ba}{\mathbf{a}}
```
```{=latex}
\newcommand{\bv}{\mathbf{v}}
```
```{=latex}
\newcommand{\bu}{\mathbf{u}}
```
```{=latex}
\newcommand{\bo}{\mathbf{o}}
```
```{=latex}
\newcommand{\bq}{\mathbf{q}}
```
```{=latex}
\newcommand{\bI}{\mathbf{I}}
```
```{=latex}
\newcommand{\bA}{\mathbf{A}}
```
```{=latex}
\newcommand{\E}{\mathbb{E}}
```
```{=latex}
\renewcommand{\cref}{\Cref}
```
```{=latex}
\def \ModelAcronym {FAST}
```
```{=latex}
\def \ModelUniversalAcronym {FAST+}
```
```{=latex}
\def \ModelAcronymLong {\underline{\textbf{F}}requency-space \underline{\textbf{A}}ction \underline{\textbf{S}}equence \underline{\textbf{T}}okenization (\textbf{FAST})}
```
```{=latex}
\def \GeneralistModelAcronym {$\pi_0$-FAST}
```
```{=latex}
\def \ModelSymbol {$\pi_0$}
```
```{=latex}
\def \ModelSymbolBold {$\bm{\pi_0}$}
```
```{=latex}
\def \Robots {7}
```
```{=latex}
\def \Tasks {68}
```
```{=latex}
\pdfinfo{
   /Author (Physical Intelligence)
   /Title  (@title)
   /Subject (Robot Foundation Models)
   /Keywords (Robot Foundation Models)
}
```
```{=latex}
\def\cameraready{0}
```
```{=latex}
\ifx
```
```{=latex}
\cameraready
```
```{=latex}
\undefined
```
```{=latex}
\else
```
```{=latex}
\fi
```
```{=latex}
\maketitle
```
```{=latex}
\IEEEpeerreviewmaketitle
```
Introduction {#sec:intro}
============

```{=latex}
\centering
```
```{=latex}
\vspace{-0.5cm}
```
![We propose `\ModelAcronym`{=latex}, a simple yet effective approach for tokenization of robot action trajectories via time-series compression. `\ModelAcronym{}`{=latex} enables training of autoregressive VLAs that solve complex dexterous manipulation tasks and generalize broadly to new scenes. We use it to train `\GeneralistModelAcronym{}`{=latex}, a generalist robot policy that matches the performance of the state-of-the-art $\pi_0$ diffusion VLA on dexterous and long-horizon manipulation tasks, while training 5x faster (**top**). ](figures/convergence_2.jpg){#fig:convergence width="\\linewidth"}

```{=latex}
\vspace{-1.5em}
```
Large, high-capacity Transformer models can be tremendously effective for capturing complex and generalizable robotic behaviors both from scratch [@rt12022arxiv; @zhao2023learning; @octo_2023; @bharadhwaj2023roboagent; @Doshi24-crossformer; @wangscaling] and using models pre-trained for next-token prediction on Internet-scale image-text corpora [@rt22023arxiv; @kim2024openvla; @wen2024tinyvlafastdataefficientvisionlanguageaction; @black2024pi_0; @ye2024latent]. However, these models require choosing a tokenization of the continuous action signal, which determines how the discrete symbols predicted by the model map to continuous robot actions [@yan2024elastictok; @jang2024efficient; @lee2024behavior; @chen2022beats]. It is widely known that a good choice of tokenization can be critical to the performance of sequence models [@radford2019language; @sennrich2015neural]. Prior robotic policies of this sort typically use naïve tokenization strategies based on a per-dimension, per-timestep binning scheme [@brohan2022rt; @rt22023arxiv; @kim2024openvla]. We find that such methods perform poorly when learning dexterous skills with high-frequency control (see `\cref{fig:teaser}`{=latex}, right). We observe that correlations between time steps are a major challenge for naïve tokenization strategies when predicting sequences of future actions, i.e., action \`\`chunks", as is common for high-frequency control. Highly correlated action tokens *diminish* the effectiveness of the next token prediction objective used in autoregressive VLAs. Intuitively, in such cases low token prediction loss can often be achieved with mappings as trivial as simply copying the most recent action token, leaving models in poor local optima.

In this work, we propose a new tokenization strategy from first principles. Our key insight is that robot action signals need to be *compressed* before training, to reduce correlation between consecutive tokens. We take inspiration from compression-based tokenization strategies, such as the byte-pair encoding method commonly used by language models [@gage1994new; @sennrich2015neural]. However, since robotic actions are continuous, the corresponding compression strategy should be chosen accordingly. We therefore base our method off of the discrete cosine transform (DCT) encoding, which is widely used for compressing continuous signals such as images (e.g., JPEG compression). We find that the resulting tokenization approach, `\ModelAcronymLong`{=latex}, enables us to train autoregressive VLA policies via simple next token prediction (see `\cref{fig:teaser}`{=latex}, left) for highly dexterous and high-frequency tasks where standard discretization methods fail entirely. Additionally, `\ModelAcronym{}`{=latex} for the first time enables efficient VLA training on the recently introduced DROID dataset [@khazatsky2024droid], a large-scale multitask \`\`in-the-wild" robot manipulation dataset. The resulting policy is the first language-conditioned generalist manipulation policy that can be successfully evaluated *zero-shot* in unseen environments, simply by prompting it in natural language.

```{=latex}
\centering
```
![**Left**: `\ModelAcronym{}`{=latex} tokenization enables training of autoregressive Transformers for dexterous robot control via simple next token prediction. **Right**: `\ModelAcronym{}`{=latex} outperforms popular binning tokenization schemes, e.g., used in OpenVLA [@kim2024openvla], particularly for high-frequency robot data. ](figures/teaser.png){#fig:teaser width="\\linewidth"}

Based on `\ModelAcronym{}`{=latex}, we develop `\ModelUniversalAcronym`{=latex}, a **[universal]{.underline} robot action tokenizer**, trained on 1M real robot action trajectories that cover a large diversity of robot embodiments, action spaces and control frequencies. We demonstrate that the `\ModelUniversalAcronym`{=latex} tokenizer effectively tokenizes a wide range of robot action sequences, from single-arm to bi-manual and mobile robots, and is a good off-the-shelf tokenizer for training autoregressive VLA models. When integrated with the $\pi_0$ VLA, FAST-based autoregressive VLAs scale to training on 10k hours of robot data and achieve performance comparable to diffusion-based VLAs across a variety of tasks, while reducing training time by up to 5x (see `\cref{fig:convergence}`{=latex}).

Related Work {#sec:related}
============

`\noindent`{=latex}**Tokenization for language, text, and audio.** Tokenization is a key component of training pipelines for modern transformer-based autoregressive sequence models, and the choice of tokenization approach can have significant impact on model training and downstream performance [@radford2019language]. While there are multiple works exploring the training of \`\`tokenization-free" language models [@gillick2016bytelm; @meta_blt] that directly operate on bit streams, most language models today rely on a text tokenization stage prior to training. A common approach is byte pair encoding [@gage1994new; @radford2019language], which compresses input text by merging frequently occurring token sequences into new tokens. For images, *learned* compression schemes present an effective approach: input images can be represented as \`\`soft tokens" produced by a pre-trained vision encoder [@liu2023llava], and full autoregressive image input-output can be achieved with a vector-quantizing autoencoder [@esser2020taming; @vqvae]. Similar approaches can be extended to the video domain [@yu2023magvit]. In audio generation and speech synthesis, which share the time-series structure of action prediction, state-of-the-art models typically encode time-series audio data using either frequency-domain spectrogram images [@gong21b_interspeech] or using learned vector quantizers [@zeghidour2021soundstreamendtoendneuralaudio].

`\noindent`{=latex}**Vision-language-action models.** Recently, multiple works have developed *generalist* robot policies [@brohan2022rt; @octo_2023; @bharadhwaj2023roboagent; @rt22023arxiv; @Doshi24-crossformer; @kim2024openvla; @wangscaling; @cheang2024gr2generativevideolanguageactionmodel] that are trained on increasingly large robot learning datasets [@open_x_embodiment_rt_x_2023; @khazatsky2024droid; @walke2023bridgedata; @fang2024rh20t; @mandlekar2018roboturk; @jiang2024dexmimicgen]. One promising approach for training generalist policies are vision-language-action models (VLAs;  [@rt22023arxiv; @collaboration2023open; @kim2024openvla; @Zawalski24-ecot; @black2024pi_0; @wen2024tinyvlafastdataefficientvisionlanguageaction; @zheng2024tracevla; @zhen20243d; @cheng2024navila; @cheang2024gr2generativevideolanguageactionmodel]). VLAs fine-tune vision-language models, that are pre-trained on internet-scale image and text data, for robot control. This has multiple benefits: using large vision-language model backbones, with billions of parameters, provides policies with the necessary expressivity for fitting large robot datasets. Reusing weights pre-trained on internet-scale datasets also improves the ability of VLAs to follow diverse language commands and generalize, e.g., to new objects and scene backgrounds [@rt22023arxiv; @kim2024openvla; @Zawalski24-ecot; @wen2024tinyvlafastdataefficientvisionlanguageaction; @jones2025beyond]. Most VLA models today are confined to rather simple, low-frequency control tasks, particularly models that use the most common autoregressive VLA design [@rt22023arxiv; @kim2024openvla]. We show that this is a direct consequence of the *action tokenization* schemes employed by these models, which make training on dexterous tasks challenging. We introduce a new action tokenization approach that allows us to train the first autoregressive VLAs on dexterous and high-frequency robot data.

`\noindent`{=latex}**Action representations for VLA training.** Prior works have explored various action parameterizations for training robot policies, including VLAs. One line of work uses \`\`semantic" action representations like language sub-tasks [@driess2023palm; @saycan2022arxiv; @belkhale2024rthactionhierarchiesusing], or keypoints [@nasirianypivot; @huang2024rekep; @fangandliu2024moka; @dipalo2024kat]. Such approaches can often learn from few examples or even perform tasks *zero-shot* without any robot examples [@nasirianypivot; @huang2024rekep; @fangandliu2024moka], but require hand-designed low-level controllers for task execution, limiting their generality. An alternative approach directly trains VLAs to output low-level robot control commands given image and language instruction inputs. The most common design directly embeds actions into discrete tokens, that can be generated with standard autoregressive sequence models, like any popular vision-language model. Existing approaches map from continuous robot actions to discrete action tokens using a simple per-dimension, per-timestep binning scheme [@brohan2022rt; @rt22023arxiv; @kim2024openvla]. We find that this scheme struggles to scale to high-frequency robot control tasks. We propose a new tokenization scheme for robot actions, based on time-series compression techniques, that allows us to train autoregressive VLAs on high-frequency data. A number of works have also proposed alternatives to tokenization, for example by using regression heads or introducing new weights for diffusion decoding [@Doshi24-crossformer; @black2024pi_0; @lee2024behavior; @wen2024tinyvlafastdataefficientvisionlanguageaction]. In comparison, our approach does not require modifications of the underlying pre-trained transformer model, can easily be applied to any pre-trained autoregressive transformer model, and achieves competitive performance to state-of-the-art diffusion-based VLAs [@black2024pi_0] across many tasks, while being significantly more compute efficient to train.

Another set of related work explores *vector-quantized* action representations [@lee2024behavior; @belkhale2024minivla; @mete2024questselfsupervisedskillabstractions]. Such approaches train a vector-quantized encoder-decoder network, for which reconstruction quality can be sensitive to hyperparameter choices and structure [@yu2023magvit]. We find that these methods perform well at coarse, low-fidelity reconstruction tasks, but fail on high-frequency tasks when fine-grained control is required. In comparison, our `\ModelAcronym`{=latex} tokenization scheme has few hyperparameters and can reconstruct actions with high precision while offering strong compression properties.

Preliminaries {#sec:prelim}
=============

`\noindent`{=latex}**Problem formulation.** Our goal is to train policies $\pi(a_{1:H} \vert o)$ that map an observation $o$ to a sequence of future robot actions $a_{1:H}$. We assume that policies output an \`\`action chunk" [@zhao2023learning; @laiaction], a *sequence* of $H$ actions [@chi2023diffusionpolicy; @black2024pi_0; @zhao2023learning], which makes it easier to produce temporally-consistent actions and reduces compounding error. The goal of *action tokenization* is to define a mapping $\mathcal{T}_a: a_{1:H} \rightarrow [T_1, \dots, T_n]$ from a sequence of continuous actions $a_{1:H}$, with dimensionality $\vert \mathcal{A}\vert$, to a sequence of $n$ discrete tokens $T \in \vert \mathcal{V}\vert$ from a vocabulary of size $\vert \mathcal{V}\vert$. Note that the number of tokens $n$ may differ between action sequences, just like sentences of the same length may be tokenized into a variable number of text tokens.

`\noindent`{=latex}**Binning-based action tokenization.** The most commonly used approach for action tokenization is a simple binning discretization scheme [@rt12022arxiv; @rt22023arxiv; @kim2024openvla; @zhen20243dvla; @reed2022generalist]. For a given action $a$, this approach discretizes each dimension independently, dividing the range of values in the training dataset into $N$ uniform bins, most commonly using $N=256$. For a *sequence* of $D$-dimensional actions $a_{1:H}$, this tokenization scheme would be applied to each time step, resulting in a final token sequence $\mathcal{T}_a\big(a_{1:H}\big) = [T_{1, 1}, \dots, T_{1, D}, \dots, T_{H, 1}, \dots, T_{H, D}]$. For high-frequency robot data, this tokenization scheme is sub-optimal: it can easily produce hundreds of tokens per action chunk, which make training challenging and lead to slow inference.

Case Study: How Does Tokenization Affect VLA Training? {#sec:educational_example}
======================================================

```{=latex}
\centering
```
![**Effect of sampling rate on prediction performance.** We train a small autoregressive transformer model on a didactic interpolation task, in which the network must predict the black dashed curve given the four circles. We find that models trained with the binning tokenization approach used in prior VLAs [@rt22023arxiv; @kim2024openvla] produce increasingly poor predictions as we increase the sampling frequency of the underlying signal, due to strong correlation between consecutive tokens at high frequencies. Our `\ModelAcronym{}`{=latex} tokenization approach, based on the discrete cosine transform (DCT), addresses the problem and leads to high-quality predictions across all sampling rates. ](figures/case_study.png){#fig:toy_example width="\\linewidth"}

To illustrate the challenge of training autoregressive policies with current action tokenization approaches, we start with a simple didactic example. We create a synthetic time-series dataset where the goal is to predict a cubic spline that interpolates four randomly-generated points (see `\cref{fig:toy_example}`{=latex}, bottom). This toy problem reflects the challenge faced by policies trained on high-frequency action chunks, which must predict a sequence of continuous actions given some conditioning information. We tokenize the target sequences using the naïve tokenization scheme employed in previous VLA policies, which discretizes each element in the sequence separately into one of 256 bins (see `\cref{sec:prelim}`{=latex}). We then train a small, autoregressive transformer policy to predict the tokenized signal given the conditioning points. We repeat this experiment for different *sampling rates* of the target signal, from 25 to 800 timesteps per sequence, without changing the underlying dataset. This emulates training autoregressive policies on action data collected at different frequencies.

The average prediction MSE of autoregressive models trained at different frequencies is shown in `\cref{fig:toy_example}`{=latex}, top (\`\`naive"). We observe that the model with binning tokenization achieves good prediction performance (i.e., low MSE) for low sampling rates. But as the sampling rate increases, the prediction error steeply increases, until eventually the model simply copies the first action, as seen in the qualitative visualization in `\cref{fig:toy_example}`{=latex}, bottom left. Note that this issue *cannot* be attributed to the data itself: the complexity of the underlying data distribution does not change, and we would expect a model with the same capacity trained for the same number of steps to achieve comparable performance across all sampling rates. So what happened?

To understand how the tokenization scheme impacts learning performance, we need to look at the learning objective itself. Fundamentally, autoregressive models are trained to predict the next token, given all previous tokens. As such, their learning signal is proportional to the marginal information content of $T_i$ given $T_{1:i-1}$. Crucially, when using the naïve per-timestep tokenization scheme, this marginal information *approaches zero* as the control frequency of the training signal increases: for smooth signals, as timesteps get shorter the change per timestep decreases proportionally. This greatly *slows down* the rate of convergence during training and can make it challenging to fit complex, high-frequency datasets. Indeed, such challenges have been observed in prior work. For instance, OpenVLA worked well on the low-frequency BridgeV2 and RT-1 datasets, but has struggled to fit the higher-frequency DROID dataset [@kim2024openvla]. The result of our case study underlines the importance of designing better tokenization schemes for robot actions.

Efficient Action Tokenization via Time-Series Compression {#sec:method}
=========================================================

```{=latex}
\begin{figure*}[t]\centering
    \includegraphics[width=\linewidth]{figures/dct_method.pdf}
    \caption{\textbf{Overview of the \ModelAcronym~action tokenization pipeline.} Given a normalized chunk of actions, we apply discrete cosine transform (DCT) to convert the signal to the frequency domain. We then quantize the DCT coefficients and use byte-pair encoding (BPE) to compress the flattened sequence of per-dimension DCT coefficients into the final action token sequence. See \cref{sec:dct_tokenizer} for a detailed description.}
    \label{fig:method_overview}
\end{figure*}
```
We saw in the previous section how redundancy in high-frequency action trajectories can lead to low marginal information for each action token, and thereby poor training performance. To address this, we need a tokenization approach that compresses the highly redundant action signal into a smaller number of high-information tokens. In this section, we will first describe a simple approach for compressing continuous time series (`\ref{sec:dct}`{=latex}), then use it to design an action tokenization algorithm (`\cref{sec:dct_tokenizer}`{=latex}), and finally explain how we train a *universal* tokenizer for robot actions (`\cref{sec:universal_tokenizer}`{=latex}).

Time-Series Compression via Discrete Cosine Transform {#sec:dct}
-----------------------------------------------------

There is a rich body of work on effectively compressing continuous time series, from approaches that compress signals after transforming them into the frequency domain [@fft; @dct; @jpeg] to *learned* compression approaches, e.g., based on vector quantization [@vqvae; @fsq]. One key takeaway of our work is that *any* sufficiently effective compression approach, when applied to the action targets, is suited to improve the training speed of VLA models. In practice, there are a few considerations that may still lead us to favor some compression algorithms over others, e.g., the complexity of training the tokenizer, and how efficient is it at tokenizing and detokenizing actions.

In this work, we use a compression algorithm based on the discrete cosine transform (DCT) [@dct]. DCT is a frequency-space transform that represents a continuous signal as a sum of cosine elements of various frequencies. Low frequencies capture the overall shape of the signal, while high-frequency components reflect sharp jumps. DCT is a commonly used transformation for compression algorithms, e.g., for JPEG image compression [@jpeg], due to its simplicity and computational efficiency, and its strong compression property on practical images: since pixels often vary smoothly, DCT can often represent most of the information of an input signal in only a few coefficients. Signals can be compressed by omitting frequency components with low weights. Compared to learned compression approaches based on vector quantization, DCT-based compression is an analytical approach, thus extremely simple and fast.

The `\ModelAcronym`{=latex} Tokenization Algorithm {#sec:dct_tokenizer}
--------------------------------------------------

We use the discrete cosine transform to design `\ModelAcronym`{=latex}, a quick and effective tokenization approach for robot actions. We detail the steps from raw robot actions to action tokens in `\cref{fig:method_overview}`{=latex}. We first normalize the input actions, such that the 1st and 99th quantile of values in the training dataset for each action dimension maps to the range $[-1, \dots, 1]$. This initial normalization step is useful to bring the data into a specified range and also makes tokenization of cross-embodied datasets with different action scales easier. We use quantiles to be robust to outlier actions which occasionally occur in large robot datasets. After the data is normalized, we apply the discrete cosine transform to each action dimension separately. To compress the DCT-converted signal we can simply omit insignificant coefficients, which we implement through a scale-and-round operation, where the scaling coefficient is a hyperparameter that trades off between lossiness and compression rate of the tokenization operation.

After the rounding operation, the DCT coefficient matrix is typically sparse, with most entries being zero and only a few significant coefficients remaining per action dimension. To actually realize the compression, we must convert this sparse matrix into a sequence of dense tokens. We flatten the matrix into a 1-dimensional vector of integers, interleaving action dimensions by including all low-frequency components first, and train a byte pair encoding (BPE) tokenizer [@gage1994new] to losslessly compress it into dense action tokens. The BPE step \`\`squashes" the zero-valued components and merges frequently-occurring coefficient combinations across action dimensions. We choose BPE to compress the DCT matrix, since many efficient implementations exist and it can produce a fixed-size output vocabulary that can be easily integrated into the existing vocabulary of vision-language models for VLA training. Other lossless compression algorithms like Huffman coding [@huffmancode] or Lempel-Ziv methods [@lempelziv] (the algorithms underlying the gzip compression approach) could be used instead, but we leave this investigation for future work.

Note that the *order* of flattening the $\vert A \vert \times H$ DCT coefficient matrix prior to BPE encoding can have significant impact on policy training. There are two options: column-first flattening, i.e., concatenate the lowest-frequency components for each dimension first, or row-first flattening, i.e., concatenating all frequency components for a single action dimension first. We choose the former, since we find that predicting the *low-frequency* components, that characterize the overall shape of the output sequence, first during autoregressive prediction leads to more stable policy rollouts.

```{=latex}
\begin{algorithm}[t]

\caption{\ModelAcronym~Tokenizer}
\label{alg:fast}
\begin{algorithmic}
\Require scale $\gamma$, (for inference) BPE dictionary $\Phi$
\Procedure{FASTTokenizer}{$a_{1:H}$}
    \State $C^i_{j} \gets \texttt{DCT}\left(a^{i}_{1:H}\right)$ \Comment{Compute DCT coefficients}
    \State $\bar C^i_{j} \gets \texttt{round}\left(\gamma \cdot C^i_{j}\right)$ \Comment{Quantize coefficients}
    \State $\left[T_k\right] \gets \left[\bar C^1_1, \bar C^2_1, \dots, C^1_2, \dots, C^n_H\right]$ \Comment{Flatten tokens}

\noindent \textbf{BPE Training}:
    \State $\phi \gets \texttt{TrainBPE}(\mathcal{D} := \{[T_k]\})$

\noindent \textbf{Tokenization}:
    \State $\left[{\bar T}_1, \dots, {\bar T}_{\bar k}\right] \gets \texttt{BPE}\left([T_1, \dots, T_k], \phi\right)$
    \State \Return $\text{action\_tokens}$
\EndProcedure
\end{algorithmic}
\end{algorithm}
```
All operations in our tokenization pipeline are easily invertible, allowing fast decoding of predicted actions. The tokenizer has only two hyperparameters: the scale applied to the DCT coefficients before rounding, and the vocabulary size of the BPE compression step. We find that both parameters are not very sensitive, and we use the same values across all our single-dataset tokenization experiments (rounding scale 10, BPE vocabulary size 1024). This is in contrast to end-to-end *learned* compression modules that rely on vector quantization [@vqvae]. Such networks are often tedious to train, and require careful dataset-specific hyperparameter selection to achieve good reconstruction [@yu2023magvit; @fsq]. Our experiments show that our DCT-based tokenization approach trains higher-performing policies than VQ-based approaches, while being significantly simpler and easier to tune.

We empirically demonstrate the benefits of our DCT-based tokenization in the toy example from `\cref{sec:educational_example}`{=latex}. `\cref{fig:toy_example}`{=latex} shows that training the autoregressive model on DCT-compressed target tokens achieves constantly low prediction error across a wide range of sampling frequencies. We provide a concise summary of our tokenization approach in `\cref{alg:fast}`{=latex} and test the effectiveness of `\ModelAcronym`{=latex} tokenization on robot control problems in `\cref{sec:experiments}`{=latex}.

A Universal Robot Action Tokenizer {#sec:universal_tokenizer}
----------------------------------

The only *learned* component of our tokenizer is the vocabulary of the BPE encoder, which needs to be trained for each new dataset that the tokenizer is being applied to. While this learning process is fast (typically only a few minutes), it adds additional friction to using `\ModelAcronym`{=latex} tokenization. Thus, we aim to train a **universal** action tokenizer, that can encode chunks of robot actions from *any* robot. To this end, we train a tokenizer using the pipeline described above on a large, cross-embodied robot action dataset, consisting of approximately one million 1-second action chunks from single-arm, bi-manual and mobile manipulation robots, with joint and end-effector control action spaces and various control frequencies. We provide a detailed breakdown of the data mixture used for training the universal tokenizer in `\cref{sec:app_universal_data_mix}`{=latex}. Once trained, our universal action tokenizer, `\ModelUniversalAcronym`{=latex}, can be applied as a black-box tokenizer on 1-second action sequences from any robot setup. Our experimental evaluation shows that it is competitive to tokenizers tuned for individual datasets.

`\noindent`{=latex}**Code release.** We release our pre-trained universal action tokenizer, `\ModelUniversalAcronym`{=latex}, in a convenient HuggingFace `AutoProcessor` class, that makes it easy to apply the tokenizer to any new robot action chunk in three lines of code:

```{=latex}
\lstset{
  language=Python,
  basicstyle=\ttfamily\small,
  keywordstyle=[1]\color{green!50!black}\bfseries,   %
  keywordstyle=[2]\color{blue}\bfseries,   %
  commentstyle=\color{grey},                      %
  stringstyle=\color{red!70!black},               %
  identifierstyle=\color{black},       %
  showstringspaces=false,
}
```
``` {.python language="Python"}
from transformers import AutoProcessor

tokenizer = AutoProcessor.from_pretrained(
    "physical-intelligence/fast", 
    trust_remote_code=True
)
tokens = tokenizer(action_chunk)
```

For best compression results, we recommend normalizing input actions to range $[-1, \dots, 1]$ via quantile normalization as described in `\cref{sec:dct_tokenizer}`{=latex}, and tokenizing 1-second action chunks at a time. Our module also makes it easy to train a *new* `\ModelAcronym`{=latex} tokenizer on a given dataset of action chunks:

``` {.python language="Python"}
from transformers import AutoProcessor

tokenizer = AutoProcessor.from_pretrained(
    "physical-intelligence/fast", 
    trust_remote_code=True
)
new_tokenizer = tokenizer.fit(action_dataset)
```

Experiments {#sec:experiments}
===========

In our experiments, we test `\ModelAcronym{}`{=latex} with two VLA backbones: $\pi_0$ [@black2024pi_0] and OpenVLA [@kim2024openvla]. We compare `\ModelAcronym{}`{=latex} to alternative action tokenization schemes and ablate key design decisions. We then compare $\pi_0$ models trained with `\ModelAcronym{}`{=latex} tokenization to the state-of-the-art $\pi_0$ flow-matching (diffusion) VLA, and test the scaling of autoregressive VLA training with `\ModelAcronym{}`{=latex} to large, cross-embodied datasets with 10k hours of dexterous robot manipulation data.

Experimental Setup {#sec:exp_setup}
------------------

`\noindent`{=latex}**Policy implementation.** We test different tokenization schemes for autoregressive VLA training with popular VLA backbones. For most of our experiments, we use $\pi_0$ [@black2024pi_0], a VLA based on PaliGemma-3B [@beyer2024paligemma]. We also test with OpenVLA [@kim2024openvla], which is built on Prismatic 7B [@karamcheti2024prismatic]. During training, we tokenize 1-second action chunks and overwrite the least used tokens in the VLM vocabulary with the resulting action tokens, following prior VLAs [@rt22023arxiv; @kim2024openvla]. We fine-tune the VLA models for robot action prediction, without weight freezing. We provide more details on the policy training setup in `\cref{sec:app_policy_training_details}`{=latex}.

```{=latex}
\centering
```
![**Evaluation environments.** We test `\ModelAcronym{}`{=latex} across 7 evaluation environments: 6 real-robot tasks and 1 simulation environment. The tasks are designed to test VLA performance on highly dexterous tasks, like folding cloths from a laundry basket (\`\`Laundry Folding"), and generalization, e.g., zero-shot table-top manipulation in unseen environments (\`\`DROID"). ](figures/environments.jpg){#fig:environments width="0.9\\linewidth"}

`\noindent`{=latex}**Evaluation tasks.** We develop a suite of 7 evaluation tasks (6 real robot, 1 simulated; see `\cref{fig:environments}`{=latex}), designed to test VLA performance on both, highly dexterous tasks like laundry folding, and generalization tasks, like performing table-top manipulations 0-shot in unseen environments.

-   **Libero**: We test on the Libero [@liu2024libero] simulated benchmark suites. We measure average performance across Libero-Spatial, Libero-Object, Libero-Goal, and Libero-10.

-   **Table bussing** [@black2024pi_0] (20 Hz): a UR5 single-arm robot needs to clean a table, sorting 12 objects into a trash bin (for trash) and a plastic container (for plates, bowls, cups and cutlery). The task requires precise grasping of various objects.

-   **T-Shirt folding** [@black2024pi_0] (50 Hz): a bi-manual ARX robot setup needs to fold various shirts on a stationary table top. At the beginning of the task, the shirts are placed flat on the table. Succeeding at the task requires precise grasps and movements to fold the shirt.

-   **Grocery bagging** [@black2024pi_0] (20 Hz): a UR5 single-arm robot needs to pack seven objects from a table into a grocery bag, taking care to not topple or rip the bag in the process. This task requires picking a diverse set of objects and carefully inserting them into the bag.

-   **Toast out of toaster** [@black2024pi_0] (50 Hz): a bimanual Trossen Viper-X robot needs to remove two slices of bread from a toaster and place them on a plate. This task requires precise grasping and placement of the bread slices.

-   **Laundry folding** [@black2024pi_0] (50 Hz): a bi-manual ARX robot needs to take shirts and shorts from a basket, flatten them on a table, fold and stack them. This is the most dexterous task we test. It requires precise grasps, dynamic motions to flatten the cloths, retrying and corrections when cloths got tangled up, and precise placements of the folded cloths on the existing stack of cloths. We report success rate on individual clothing items.

-   **Zero-shot DROID tabletop manipulation** [@khazatsky2024droid] (15 Hz): we test a policy trained on the full DROID dataset across various table-top manipulation tasks like picking and placing objects, wiping, opening and closing drawers etc. Importantly, we test the policy in a completely *unseen* environment, with a new table setup, background, novel objects, viewpoint and table height. To our knowledge, this is the first \`\`zero-shot" evaluation of DROID policies in a completely unseen environment, without co-training or fine-tuning, simply by prompting a pre-trained model with natural language.

Following @black2024pi_0, we use grocery bagging, the toaster task, and laundry folding only to evaluate our most powerful, generalist VLA in `\cref{sec:generalist_vlas}`{=latex}. We provide additional details on training datasets and evaluation tasks in `\cref{sec:app_exp_task_details}`{=latex}.

`\noindent`{=latex}**Comparisons.** We test **`\ModelAcronym`{=latex}**, our DCT-based action tokenization approach, trained on each evaluation dataset individually, and **`\ModelUniversalAcronym`{=latex}**, our universal DCT-based action tokenizer, trained on a large dataset of 1M action sequences. Note that we trained the universal tokenizer on the most diverse real robot dataset we could assemble, which includes data from our real-robot evaluation tasks. We compare both tokenizers to the per-dimension binning scheme used by prior autoregressive VLAs like RT-2 [@rt22023arxiv], RT-2-X [@open_x_embodiment_rt_x_2023] and OpenVLA [@kim2024openvla], dubbed **naïve tokenization**. We apply the binning tokenization to each time step in the action chunk separately and then concatenate. Finally, while our approach provides a compressed tokenization without the need to train any separate model, we can consider an alternative compression scheme that instead trains a model to produce a quantized representation of the action chunk via **FSQ** [@fsq], a simpler alternative to VQ-VAE [@vqvae]. This tokenization strategy has been previously used to tokenize high-dimensional image data [@fsq; @yu2023magvit], and can be viewed as an ablation of our compression-based approach, utilizing compressed representations but with a more complex learning-based alternative to our relatively simple DCT-based method.

Comparing Action Tokenizers for VLA Training {#sec:main_comparison}
--------------------------------------------

```{=latex}
\centering
```
::: {#tab:compression_ratios}
  ------------ ------------------ ------------------- ------------ ------------------------------------ ------
  Dataset       Action Dimension   Control Frequency   Avg. Token              Compression              
                                                         Naive           `\ModelAcronym `{=latex}       
  BridgeV2             7            5`\;`{=latex}Hz        35       `\cellcolor{lightgreen}`{=latex}20   1.75
  DROID                7           15`\;`{=latex}Hz       105       `\cellcolor{lightgreen}`{=latex}29   3.6
  Bussing              7           20`\;`{=latex}Hz       140       `\cellcolor{lightgreen}`{=latex}28   5.0
  Shirt Fold           14          50`\;`{=latex}Hz       700       `\cellcolor{lightgreen}`{=latex}53   13.2
  ------------ ------------------ ------------------- ------------ ------------------------------------ ------

  : **Comparison of the average token count per action chunk** for naïve tokenization and `\ModelAcronym`{=latex}. We use 1-second chunks in all datasets. With our method, each chunk requires many fewer tokens, particularly for high-frequency domains such as the T-shirt folding task, indicating that it is more effective at removing redundancy.
:::

We first provide a comparison of compression rates between our proposed `\ModelAcronym`{=latex} tokenizer and the naïve binning scheme used in prior works in `\cref{tab:compression_ratios}`{=latex}. We use 1-second action chunks from datasets with various action dimensionalities and control frequencies. For both approaches we use the default hyperparameters, which have comparable tokenization errors. We see that `\ModelAcronym`{=latex} achieves a significant compression of the input action sequences across all datasets. The compression benefits are especially pronounced for datasets with high-frequency action data. Interestingly, `\ModelAcronym`{=latex} consistently generates roughly 30 action tokens per chunk per robot arm (i.e., 60 tokens for the bi-manual setup) in each of the domains. This suggests that `\ModelAcronym`{=latex} finds a representation that approximates the complexity of the underlying action signal, and is largely independent of the frequency of the action data.

We note that this compression is not entirely lossless, with a trade-off between compression ratio and reconstruction accuracy determined by the scale parameter $\gamma$ from `\cref{alg:fast}`{=latex}. Figures in `\cref{tab:compression_ratios}`{=latex} are at comparable reconstruction accuracy. Please see `\cref{sec:app_compression_plots}`{=latex} for plots showing the trade-off between compression and fidelity for each of the tokenizers we compare.

```{=latex}
\begin{figure*}[t]\centering
    \includegraphics[width=0.8\linewidth]{figures/main_result.pdf}
    \caption{\textbf{Comparison of policy performance using different tokenization approaches.} We find that tokenization approaches that compress action targets (\ModelAcronym, FSQ) lead to substantially more efficient training than the na\"{i}ve binning tokenization used in prior VLAs. Overall, we find that \ModelAcronym~leads to more effective policy training than FSQ, particularly on dexterous real-robot tasks. Our universal tokenizer, \ModelUniversalAcronym, matches the performance of dataset-specific tokenizers. We report mean and 95\% CI.
    }
    \label{fig:tokenization_comparison}
\end{figure*}
```
Next, we train policies using the policy architecture and tokenization approaches described in `\cref{sec:exp_setup}`{=latex}. We report results in `\cref{fig:tokenization_comparison}`{=latex}.

Overall, we find that the naïve tokenization applied in prior works struggles to learn effective policies on high-frequency robot data. This is particularly apparent for the highest frequency tasks in our evaluations: Table Bussing (20Hz) and T-Shirt Folding (50Hz). On both tasks, policies trained with naïve tokenization are unable to make progress on the task.

In contrast, we find that compression-based tokenization leads to effective training. Comparing `\ModelAcronym{}`{=latex} to our FSQ baseline, we find that `\ModelAcronym{}`{=latex} is as good or at times better, particularly on the dexterous, high-frequency tasks, despite being much simpler and requiring no separate neural network training.

```{=latex}
\centering
```
![**Evaluation environments of `\ModelAcronym{}`{=latex} policy trained on DROID [@khazatsky2024droid].** We find that the same policy checkpoint generalizes robustly, and performs various simple table-top tasks *zero-shot* across three university campuses. ](figures/droid_quali.png){#fig:droid_quali width="\\linewidth"}

Notably, `\ModelAcronym{}`{=latex} tokenization enables the first successful training of a strong generalist policy on the DROID dataset [@khazatsky2024droid], which can be evaluated *zero-shot* in unseen environments, without fine-tuning, by simply prompting it in natural language. All prior works, including the original DROID paper [@khazatsky2024droid] and OpenVLA [@kim2024openvla], did not show zero-shot results and focused entirely on co-training or fine-tuning evaluations instead. We demonstrate the generality of our DROID policy by testing it on various table-top manipulation tasks in environments across three university campuses (`\cref{fig:droid_quali}`{=latex}). Out of the box, the policy can competently perform simple manipulation tasks, like picking and placing objects, opening and closing cupboards and turning on faucets, across a wide range of scenes and camera viewpoints. Even unsuccessful trials show sensible behavior, like approaching the handles of microwave and dish washer doors, even if ultimately failing to open them. We show success and failure videos on our website. While far from perfect, the level of generality and robustness of this policy substantially exceeds that of prior DROID policies.

Universal Action Tokenizer {#sec:uniact}
--------------------------

```{=latex}
\centering
```
![**Universal tokenizer.** We test the compression rate achieved by our `\ModelUniversalAcronym`{=latex} tokenizer vs. naïve tokenization across diverse robot datasets, *unseen* during tokenizer training. We find that `\ModelAcronym`{=latex} is effective across a wide range of robot morphologies, action spaces and control frequencies. ](figures/bench_test_compression.png){#fig:universal_tokenizer_results width="\\linewidth"}

In this section, we evaluate the performance of our *universal* action tokenizer, `\ModelUniversalAcronym`{=latex}, which we trained on 1M real robot action sequences (see `\cref{sec:universal_tokenizer}`{=latex}). To test the *generality* of the tokenizer, we assemble a diverse set of small testing datasets. This set spans a wide range of robot morphologies, action spaces, and control frequencies (see `\cref{fig:universal_tokenizer_results}`{=latex}, with a full list of datasets in `\cref{sec:app_univeral_test_set}`{=latex}). Note that none of these datasets is part of the tokenizer training set. They thus test a scenario in which the tokenizer is applied to a completely new robot setup without recomputing the tokenization. We find that the `\ModelUniversalAcronym{}`{=latex} tokenizer achieves good compression performance across a wide range of robot datasets, reducing the number of action tokens by 2x across all datasets, and significantly more on some.

We also test performance of the universal tokenizer for policy training, and report results alongside the per-dataset tokenizers in `\cref{fig:tokenization_comparison}`{=latex}. Across all tasks, the *universal* tokenizer closely matches the performance of the dataset-specific `\ModelAcronym`{=latex} tokenizers, suggesting that the universal tokenizer can be used as a strong default for robot action tokenization.

Ablation Studies
----------------

We analyze two key aspects of our method: (1) Is our `\ModelAcronym`{=latex} tokenization approach *independent* of the underlying VLA backbone? (2) How important is the BPE compression step, the only learned component of our tokenization pipeline.

```{=latex}
\begin{wrapfigure}{r}{0.4\linewidth}
    \centering
    \vspace{-0.3cm}
    \includegraphics[width=\linewidth]{figures/openvla_results.pdf}
    \label{fig:openvla_results}
    \vspace{-0.3cm}
\end{wrapfigure}
```
To answer the first question, we train an OpenVLA policy [@kim2024openvla] on the challenging high-frequency T-shirt folding dataset, comparing the naïve tokenization approach originally used in OpenVLA to our `\ModelUniversalAcronym{}`{=latex} tokenizer. To comply with the task setup, we modify the OpenVLA model code to accept multiple input images and predict 1-second action chunks. The results on the right demonstrate that `\ModelAcronym{}`{=latex} is able to significantly boost performance of OpenVLA, enabling it to train effectively on high-frequency robot manipulation data. This suggests, that our tokenization approach is *independent* of the underlying model backbone, and may be easily applied to a wide range of pre-trained autoregressive transformer models.

```{=latex}
\begin{wrapfigure}{r}{0.4\linewidth}
    \centering
    \vspace{-0.3cm}
    \includegraphics[width=\linewidth]{figures/nobpe_results.pdf}
    \label{fig:openvla_results}
    \vspace{-0.3cm}
\end{wrapfigure}
```
Secondly, we ablate the BPE encoding step on the table bussing and T-shirt folding tasks. The figure on the right shows that the resulting policies *without BPE encoding* achieve worse rollout performance (but still outperform naïve tokenization). Intuitively, the DCT transform still concentrates most of the signal's information in a few tokens, improving the learning signal. However, without BPE, there is a large number of repeated 0-tokens which dilute the learning signal and also significantly slow down inference, since models need to autoregressively predict hundreds of action tokens, ultimately leading to worse policy performance.

Comparing `\ModelAcronym{}`{=latex} to Diffusion {#sec:ar_vs_diffusion}
------------------------------------------------

In this section, we compare $\pi_0$, a state-of-the-art diffusion VLA, to our model that combines $\pi_0$ with `\ModelAcronym{}`{=latex} and uses autoregressive decoding. We compare the performance of both models on the tasks from `\cref{sec:main_comparison}`{=latex}.

```{=latex}
\centering
```
![**Comparison of diffusion $\pi_0$ [@black2024pi_0] to our $\pi_0$ model with `\ModelAcronym{}`{=latex} decoding on single-task training.** On small datasets (Libero, T-Shirt Folding), both perform comparably. On large datasets (Table Bussing), `\ModelAcronym`{=latex} converges faster. In DROID, we find that `\ModelAcronym`{=latex} follows language instructions better. We report mean and 95% CI. ](figures/pi0_single_task.png){#fig:pi0_single_task_comparison width="\\linewidth"}

We report results in `\cref{fig:pi0_single_task_comparison}`{=latex}. We find that on small datasets (Libero, T-Shirt Folding; $<$50h), both VLAs perform comparably. However, on large datasets like Table Bussing, we find that the `\ModelAcronym`{=latex}-based VLA converges significantly faster, reaching high performance with 3x fewer training steps than the diffusion variant of $\pi_0$. Additionally, we find that the autoregressive $\pi_0$ model trained with `\ModelAcronym`{=latex} tokenization follows language instructions more closely: in the DROID evaluations, the diffusion $\pi_0$ model often ignores the language instructions, leading to a lower score. We will leave a detailed investigation of the language following abilities of diffusion and autoregressive VLAs to future work.

```{=latex}
\begin{figure*}[t]\centering
    \includegraphics[width=\linewidth]{figures/quali_rollout.pdf}
    \caption{\textbf{Rollout of \GeneralistModelAcronym{} on the laundry folding task.} \ModelAcronym{} tokenization enables autoregressive VLAs to perform complex, long-horizon, and dexterous tasks that were impossible with previous tokenization schemes.
    }
    \label{fig:quali_rollout}
\end{figure*}
```
One current limitation of the autoregressive VLA is its inference speed: while $\pi_0$ with diffusion typically predicts one second action chunks within 100ms on an NVIDIA 4090 GPU, the $\pi_0$ model with `\ModelAcronym{}`{=latex} tokenization needs approximately 750ms of inference time per chunk, since it must perform more autoregressive decoding steps (typically 30-60 action tokens need to be decoded, vs. 10 diffusion steps for diffusion $\pi_0$) and use the full 2B parameter language model backbone for autoregressive decoding (vs. a 300M parameter \`\`action expert" for diffusion $\pi_0$). While we did not find this slower inference to hurt performance on the static manipulation tasks we evaluated, it made evaluations significantly slower. Going forward, there are many techniques for accelerating the inference of discrete, autoregressive transformer models that are used extensively in the LLM literature (e.g., speculative decoding, quantization, custom inference kernels, etc.), but we will leave an investigation of these to future work.

Scaling Autoregressive VLAs to Large Robot Datasets {#sec:generalist_vlas}
---------------------------------------------------

We have demonstrated `\ModelAcronym`{=latex}'s effectiveness for training autoregressive VLAs on individual robot datasets, but does it scale to training dexterous *generalist* policies? To test this, we train the `\GeneralistModelAcronym{}`{=latex} model from the previous section on the cross-embodied robot data mixture used by $\pi_0$ [@black2024pi_0], the largest dexterous robot manipulation dataset to date. It includes 903M timesteps from our own datasets. Additionally, 9.1% of the training mixture consists of the open-source datasets BRIDGE v2 [@walke2023bridgedata], DROID [@khazatsky2024droid], and OXE [@open_x_embodiment_rt_x_2023].

```{=latex}
\centering
```
![**Comparison of `\GeneralistModelAcronym`{=latex} and diffusion $\pi_0$ [@black2024pi_0] generalist policies.** `\GeneralistModelAcronym{}`{=latex} matches the performance of diffusion $\pi_0$ while requiring significantly less compute for training. Reported: mean and 95% CI. ](figures/pi0_multi_task.png){#fig:pi0_multi_task_comparison width="\\linewidth"}

We compare zero-shot performance to the diffusion $\pi_0$ model on the tasks from @black2024pi_0 in `\cref{fig:pi0_multi_task_comparison}`{=latex}. Overall, we find that the autoregressive `\GeneralistModelAcronym{}`{=latex} model matches the performance of the diffusion $\pi_0$ model, including on the most challenging *laundry folding* task, **while requiring significantly less compute for training**. We show a qualitative example of `\GeneralistModelAcronym{}`{=latex} performing the laundry folding task in `\cref{fig:quali_rollout}`{=latex} and include additional videos on our website.

Importantly, we find that `\GeneralistModelAcronym{}`{=latex} converges significantly faster than the diffusion $\pi_0$ model: the model in the evaluations above required 5x fewer GPU hours for training than the $\pi_0$ model from @black2024pi_0. We show robot evaluation results for multiple checkpoints throughout the course of training in `\cref{fig:convergence}`{=latex} (averaging performance on two representative tasks: table bussing and t-shirt folding). The results show clearly that `\GeneralistModelAcronym{}`{=latex} achieves high performance significantly faster. For state-of-the-art VLA training runs, which can often use thousands of GPU hours, a 5x reduction in required compute is significant. We include a full comparison across all tasks for a compute-matched $\pi_0$ checkpoint in Appendix, `\cref{fig:pi0_compute_matched}`{=latex} and find that the same conclusions hold: `\GeneralistModelAcronym{}`{=latex} clearly outperforms compute matched $\pi_0$ due to its faster convergence.

To summarize, we have demonstrated that `\ModelAcronym{}`{=latex} tokenization allows us to train autoregressive VLAs on complex, dexterous robot tasks that prior tokenization schemes completely fail on. We have also shown that `\ModelAcronym{}`{=latex}, when combined with state-of-the-art VLAs like $\pi_0$, scales to training generalist, cross-embodied policies that rival the performance of the best diffusion VLAs while being significantly faster to train.

Discussion and Future Work {#sec:conclusion}
==========================

In this paper, we introduced `\ModelAcronym{}`{=latex}, an efficient action tokenizer for high-frequency robotic control data. `\ModelAcronym{}`{=latex} uses the discrete cosine transform (DCT) followed by byte-pair encoding (BPE) to compress action chunks, leading to significantly better compression than existing action tokenizers across a range of robotics domains. Our real-world and simulated VLA experiments show that `\ModelAcronym{}`{=latex} leads to dramatically improved performance over the previously used naïve action discretization approaches, and outperforms more complex learned tokenization methods based on vector quantization. We also showed that we can train `\ModelUniversalAcronym{}`{=latex}, a *universal* action tokenizer, that can serve as a strong default tokenizer for any robot action sequence. Using it, we trained `\GeneralistModelAcronym`{=latex}, a dexterous generalist policy that can match performance of state-of-the-art diffusion VLAs, while being significantly more efficient to train.

There are many exciting directions for future work:

**Action tokenizers.** While we believe that `\ModelAcronym{}`{=latex} is a significant step toward general purpose robot action tokenizers, many questions remain. In this work, we tested `\ModelAcronym{}`{=latex} on static robot manipulators. Our offline experiments demonstrated promising compression capabilities of `\ModelUniversalAcronym{}`{=latex} on other robot morphologies like mobile robots, dexterous hands, and humanoids. Testing actual policy performance on these platforms is an exciting direction for future work. Additionally, exploring alternative compression schemes, and testing the combination of compression-based action encodings with non-autoregressive decoding approaches like diffusion [@black2024pi_0] are interesting directions for future investigation.

**VLA architectures.** Our paper has taken initial steps to explore the trade-offs between two major classes of VLA architectures, autoregressive and diffusion decoding VLAs, but the jury on the best VLA architecture is still out. Future work should carefully explore trade-offs in training speed, language grounding abilities, and expressiveness of either approach.

**Inference speed.** While `\GeneralistModelAcronym{}`{=latex} matches the overall performance of diffusion $\pi_0$, it is slower at inference time (see `\cref{sec:ar_vs_diffusion}`{=latex}). While the slower inference speed was acceptable on the static tasks we evaluated, future work should explore approaches for speeding up inference of autoregressive VLA models to enable them to solve highly dynamic tasks. There is a large literature of inference optimizations for large language models that can be readily applied to autoregressive VLAs.

Acknowledgements {#acknowledgements .unnumbered}
================

We thank Ury Zhilinsky and Kevin Black for their help with setting up data and training infrastructure used in this project. We also thank Pranav Atreya, Haohuan Wang, Lucy Shi, Arhan Jain and Andy Yun for help with DROID policy evaluations at UC Berkeley, Stanford and the University of Washington, and Will Chen for testing and debugging our open-source implementation of `\ModelUniversalAcronym`{=latex}. We thank Noah Brown, Szymon Jakubczak, Adnan Esmail, Tim Jones, Mohith Mothukuri and James Tanner for help with robot maintenance, and Anna Walling for help with robot, data and eval operations. We are grateful to the whole team of robot operators at Physical Intelligence for their enormous contributions to running data collection and policy evaluations. Finally, we thank Claudio Guglieri, Lachy Groom and Karol Hausman for their help with visualizations used in this paper and on the project website.

```{=latex}
\bibliographystyle{plainnat}
```
```{=latex}
\clearpage
```
```{=latex}
\appendix
```
Data Mixture for Training Universal Tokenizer {#sec:app_universal_data_mix}
---------------------------------------------

The training mixture for the universal tokenizer mainly consists of the $\pi_0$ [@black2024pi_0] datasets described in Section `\ref{sec:generalist_vlas}`{=latex}. For many datasets, we include versions with multiple action space parametrizations: joint space, end-effector world frame, and end-effector camera frame, to ensure the generality of the resulting tokenizer. Open X-Embodiment [@open_x_embodiment_rt_x_2023], DROID [@khazatsky2024droid], and Bridge`\;`{=latex}V2 [@walke2023bridgedata] are included in their original form. Before tokenization, all actions are padded to 32 dimensions to accommodate action spaces of different dimensionality.

```{=latex}
\centering
```
```{=latex}
\resizebox{\linewidth}{!}{
    \begin{tabular}{lcccc}
        \toprule
        Dataset Name & Morphology & Action Space & \makecell{Control \\Frequency\\ (Hz)}& \makecell{Mixture \\Weight\\ (\%)}\\
        \midrule
        ARX & Bi-manual & Joint & 50 & 7.2 \\
        AgileX & Bi-manual & Joint & 50 & 1.8 \\
        Fibocom & Mobile & Joint & 50 & 2.9 \\
        Franka FR3 & Single arm & Joint & 20 & 3.7 \\
        Mobile Trossen & Mobile & Joint & 50 & 2.5 \\
        Trossen Biarm & Bi-manual & Joint & 50 & 4.3 \\
        UR5 single & Single arm & Joint & 20 & 10.3 \\
        UR5 biarm & Bi-manual & Joint & 20 & 2.4 \\
        ARX slate mobile & Mobile & Joint & 50 & 2.5 \\
        \midrule
        ARX EE & Bi-manual & EE & 50 & 3.6 \\
        AgileX EE & Bi-manual & EE & 50 & 0.9 \\
        Fibocom EE & Mobile & EE & 50 & 1.4 \\
        Franka FR3 EE & Single arm & EE & 20 & 1.9 \\
        Mobile Trossen EE & Mobile & EE & 50 & 1.2 \\
        Trossen Biarm EE & Bi-manual & EE & 50 & 2.1 \\
        UR5 single EE & Single arm & EE & 20 & 5.2 \\
        UR5 biarm EE & Bi-manual & EE & 20 & 1.2 \\
        ARX slate mobile EE & Mobile & EE & 50 & 1.2 \\
        \midrule
        ARX Cam & Bi-manual & CamFrame & 50 & 3.6 \\
        AgileX Cam & Bi-manual & CamFrame & 50 & 0.9 \\
        Fibocom Cam & Mobile & CamFrame & 50 & 1.4 \\
        Franka FR3 Cam & Single arm & CamFrame & 20 & 1.9 \\
        Mobile Trossen Cam & Mobile & CamFrame & 50 & 1.2 \\
        Trossen Biarm Cam & Bi-manual & CamFrame & 50 & 2.1 \\
        UR5 single Cam & Single arm & CamFrame & 20 & 5.2 \\
        UR5 biarm Cam & Bi-manual & CamFrame & 20 & 1.2 \\
        ARX slate mobile Cam & Mobile & CamFrame & 50 & 1.2 \\
        \midrule
        ALOHA~\citep{zhao2023learning} & Bi-manual & Joint & 50 & 5.0 \\
        DROID~\citep{khazatsky2024droid} & Single arm & Joint & 15 & 11.2 \\
        Bridge\;V2~\citep{walke2023bridgedata} & Single arm & EE & 5 & 5.0 \\
        OpenX~\citep{open_x_embodiment_rt_x_2023} & Single arm & EE & mixed & 3.8 \\ 
        \bottomrule
    \end{tabular}
    }
```
Trading off Between Compression and Reconstruction {#sec:app_compression_plots}
--------------------------------------------------

```{=latex}
\centering
```
![Comparison of compression-reconstruction tradeoff on six training datsets. Any discretization method includes some hyperparameter that controls the tradeoff between reconstruction fidelity and compression level, represented here as number of tokens in the output (vocab size is held constant across all tokenizers). We sweep this hyperparameter (`\ModelAcronym`{=latex}: rounding scale; naïve tokenization: subsampling frequency; FSQ: number of latent tokens) and find that `\ModelAcronym`{=latex} performs well across a wide range of scales. In particular, although it is less efficient than VQ-based tokenizers at low fidelities, it exhibits much better scaling to higher reconstruction fidelity, making `\ModelAcronym`{=latex} much more applicable to fine-grained control problems. Specific instantiations of each tokenizer (`\ModelUniversalAcronym`{=latex}, and naïve tokenization without subsampling) are also shown.](figures/fig_dataset_comparison.png){#fig:dataset-comparison width="\\linewidth"}

Policy Training {#sec:app_policy_training_details}
---------------

We train policies with $\pi_0$ [@black2024pi_0] and OpenVLA [@kim2024openvla] backbones. Depending on the task, policies are conditioned on two or three inputs images (one third person camera, and one wrist camera per robot arm), using a resolution of 224x224 pixels. The VLA backbones encode each image separately via the pre-trained vision encoder and concatenate the resulting tokens. We additionally condition on a natural language task instruction and the robot's proprioceptive state. Both get tokenized via the LLMs language tokenizer, treating them as strings. For the proprioceptive state, we apply a bin tokenization pre-processing, akin to RT-2's action tokenization [@rt22023arxiv], discretizing into 256 bins. We then tokenize the integers as part of the text input sequence. Note that a simple bin tokenization scheme is sufficient for the proprioceptive state, since it is an *input* to the policy (as opposed to the action *outputs*, that require advanced tokenization as our experiments demonstrate).

We train all policies using a short linear learning rate warm-up (1k steps) and then a constant learning rate of 5e-5. We use the AdamW optimizer [@loshchilov2017decoupled] ($b1 = 0.9$, $b2 = 0.95$) without weight decay, clip gradient magnitude to 1 and compute an EMA of the network weights with weight 0.999.

During inference, we use simple greedy autoregressive decoding, except for the bi-manual robot tasks (T-shirt folding, toast out of toaster, laundry folding), where we found a small temperature of $\beta = 0.7$ to be helpful to get policies to move out of the home position (since some of the data included stationary chunks of actions where the robot hovers in the initial position at the beginning of training episodes).

DROID Policy Setup {#sec:app_droid_exp_details}
------------------

Here, we provide further details about our DROID training setup to make it easy for others to reproduce and build on our results. For training on the DROID dataset, we condition the policy on a single third-person view and the wrist camera view. Since DROID provides two external camera views per episode, we randomly sample the third-person view during training. Similarly, DROID provides three natural language annotations for each training episode, and we randomize over them during training. We do not use the camera calibration information. Thus, the trained policy can be tested on new viewpoints out of the box, without the need for calibration. We use joint velocity and absolute gripper position action space, and train the policy to predict 15-step action chunks (we execute 8 or 15-step chunks open-loop at inference time). We apply light data curation: we train only on the episodes marked as \`\`success" (75k episodes) and filter out any idle timesteps with all-zero actions during training (usually timesteps in which the teleoperators reset the position of the VR controller during data collection). Other than that, we found training on the full dataset to work well, though there is likely potential for improving performance with more careful curation. We train policies for three epochs (240k iterations @ 256 batch size), which takes approximately 4 days on 8xH100 GPUs for the 3B parameter VLAs we are using.

Evaluation Tasks and Training Datasets {#sec:app_exp_task_details}
--------------------------------------

```{=latex}
\centering
```
```{=latex}
\centering
```
![Table Bussing](figures/task_bus.jpeg){#fig:experimental_setup_bus width="\\linewidth"}

```{=latex}
\centering
```
![T-Shirt Folding](figures/task_shirt.jpeg){#fig:experimental_setup_shirt width="\\linewidth"}

```{=latex}
\centering
```
![Grocery Bagging](figures/task_grocery.jpeg){#fig:experimental_setup_grocery width="\\linewidth"}

```{=latex}
\centering
```
![Toast out of Toaster](figures/task_toast.jpeg){#fig:experimental_setup_toast width="\\linewidth"}

```{=latex}
\centering
```
![Laundry Folding](figures/task_laundry.jpeg){#fig:experimental_setup_laundry width="\\linewidth"}

Below, we describe all evaluation tasks and training datasets used in our experiments. We detail the distribution of initial conditions and scoring criteria.

**Libero.** We follow the training and evaluation setup of @liu2024libero. We evaluate on the Libero-Spatial, Libero-Object, Libero-Goal and Libero-Long benchmarking suites and use the corresponding datasets provided by the authors for training. We combine all datasets into one dataset with 270k samples, and train one policy jointly on all to reduce the number of policies that need to be trained. We train all policies for a total of 40k iterations ($\approx40$ epochs). We use the re-rendered datasets of @kim2024openvla for our experiments. Success is evaluated as a binary criterion per episode.

**Table Bussing.** This task requires a single UR5e robot arm to clean a table by bussing objects (a mixture of trash, plates, and dishes) into a trash can or bussing bin. The training dataset contains demonstrations in randomized bussing scenes with approximately 70 objects. The evaluation scene, shown in Figure `\ref{fig:experimental_setup_bus}`{=latex}, contains twelve objects on a table in an unseen configuration. The scene was created to stress the capability of the model, with utensils intentionally placed on top of trash, objects obstructing each other, and challenging objects such as chopsticks, transparent plastic, and reflective containers. The overall score is calculated as the percentage of objects correctly thrown away or placed in the bin.

**T-Shirt Folding.** This task requires a bimanual ARX robot to fold a t-shirt. The training dataset has demonstrations of shirt folding with approximately 150 shirts, varying in size, color, and style. The evaluation scene, shown in Figure `\ref{fig:experimental_setup_shirt}`{=latex}, cycles through five seen shirts of varying colors and sizes, each starting from a flat configuration. The overall score is calculated as the percentage of shirts successfully folded, as determined by a human rater.

**Grocery Bagging.** This task requires a single UR5e robot arm to bag groceries. This task was evaluated out-of-the-box on models pretrained with the full mixture detailed in @black2024pi_0. The evaluation scene, shown in Figure `\ref{fig:experimental_setup_grocery}`{=latex}, contains seven items (with varying shapes, sizes, materials, and weights) and a large paper grocery bag. The overall score is calculated as the percentage of items placed into the grocery bag.

**Toast out of Toaster.** This task requires a bi-manual Trossen ViperX robot, mirroring the ALOHA [@zhao2024alohaunleashedsimplerecipe] setup, to take two pieces of toast out of a toaster and place them onto a plate. This task was evaluated out-of-the-box on models pretrained with the full mixture detailed in @black2024pi_0. The evaluation scene is shown in Figure `\ref{fig:experimental_setup_toast}`{=latex} and the overall score tracks task progress, with one point for removing each piece of toast and one point for placing it on the plate, for a score out of four.

**Laundry Folding.** This task requires a bi-manual ARX robot to take a piece of clothing, short or t-shirt, out of a laundry bin and fold it. It is a very challenging task, since successful folding of the tangled up laundry requires multiple steps of unfurling and flattening the laundry before folding can start. Following @black2024pi_0, his task was evaluated with models pretrained on the full $\pi_0$ training mixture detailed in @black2024pi_0 and fine-tuned with a small amount of high-quality, task-specific data. The evaluation scene, shown in Figure `\ref{fig:experimental_setup_laundry}`{=latex}, contains five items of clothing randomly placed in a laundry hamper. The overall score is calculated as the percentage of clothing successfully folded and stacked, as determined by a human rater.

**DROID.** We train on all successful episodes from the DROID dataset (75k episodes, 21M samples) for 240k iterations ($\approx$3 episodes). We apply light data curation (see `\cref{sec:app_droid_exp_details}`{=latex}). After training, we deploy the policy *zero-shot* in new scenes, with unseen scene background, camera angles, and objects. For quantitative evaluation, we design an evaluation suite with 16 tasks and 44 trials total per policy (see `\cref{tab:droid_eval_tasks}`{=latex}). Each trial is scored with a task progress rubric (e.g., 1 point for picking up the correct object, 1 point for placing it in the correct receptacle). We show example scenes from the quantitative evaluation in `\cref{fig:app_droid_eval_setups}`{=latex}. We further run qualitative tests of the policy across various real-world setups on three different university campuses (see `\cref{fig:droid_quali}`{=latex}). We do not measure success rates during these evaluations, but provide numerous qualitative videos of successes and failures to help readers get a sense of the policy's capabilities.

```{=latex}
\centering
```
::: {#tab:droid_eval_tasks}
  Task                                                         Trials
  ----------------------------------------------------------- --------
  Put the spoon in the dish rack                                 4
  Put carrot in bowl                                             4
  Put plate in dish rack                                         2
  Wipe the table                                                 2
  Put the plate on the table                                     2
  Clean up the table                                             2
  Close the drawer                                               4
  Put the stapler on the notebook                                2
  Put stapler in the drawer                                      4
  Clean the whiteboard                                           2
  Put the marker in the cup                                      4
  Put the black sponge in the blue bowl                          2
  Put the red bottle in the black bowl                           2
  Put the watermelon in the purple bowl                          2
  Move the watermelon from the purple bowl to the blue bowl      2
  Put the tape in the purple bowl                                2
  Put the water bottle on the left side of the table             2
  **Total**                                                    **44**

  : DROID evaluation tasks.
:::

```{=latex}
\centering
```
![Setups used for quantitative DROID evaluation. ](figures/droid_eval_setups.png){#fig:app_droid_eval_setups width="\\linewidth"}

```{=latex}
\begin{table*}[]\renewcommand{\arraystretch}{1.5}
\caption{Universal Tokenizer Evaluation Datasets.}
\label{sec:app_univeral_test_set}
\resizebox{\textwidth}{!}{%
\begin{tabular}{ccccccc}
\hline
  Morphology &
  Dataset Name &
  Platform &
  Action Space &
  Action Dim &
  Control Frequency &
  Task \\ \hline
  \multirow{5}{*}{Single Arm} &
  SOAR~\cite{zhou2024soar} &
  WidowX &
  EEF &
  7 &
  5 &
  Pick/place \\
   &
  DROID-Eval EEF~\cite{khazatsky2024droid} &
  Franka &
  EEF &
  7 &
  15 &
  Pick/place \\
   &
  DROID-Eval Joint~\cite{khazatsky2024droid} &
  Franka &
  Joint &
  8 &
  15 &
  Pick/place \\
   &
  SERL~\cite{luo2024serl} &
  Franka &
  EEF &
  7 &
  10 &
  Insertion \\
   &
  $\pi$ Table Bussing~\cite{black2024pi_0} &
  UR5 &
  Joint &
  8 &
  20 &
  Pick/place \\ \hline %
  \multirow{4}{*}{Dexterous} &
  NYU DexHand~\cite{guzey2024bridging} &
  ALLEGRO &
  Joint+EEF &
  30 &
  16 &
  Dexterous manipulation \\
   &
  Berkeley DexHand~\cite{qi2022inhand} &
  ALLEGRO &
  Joint &
  16 &
  20 &
  In-hand manipulation \\ %
   &
  Berkeley DexArm~\cite{singh2024hop} &
  xArm+ALLEGRO &
  Joint &
  23 &
  20 &
  Dextrous pick/place \\ %
   &
  HATO~\cite{lin2024learning} &
  UR5+Psyonic Hand &
  EEF+Joint &
  24 &
  10 &
  Dextrous pick/place \\ \hline %
  \multirow{2}{*}{UMI} &
  UMI~\cite{chi2024universal} &
  UMI &
  EEF &
  7 &
  20 &
  Pick/place \\
   &
  UMI on Legs~\cite{ha2024umilegs} &
  UMI &
  EEF &
  7 &
  20 &
  Whole-body manipulation \\ \hline %
  \multirow{2}{*}{Humanoid} &
  HumanPlus~\cite{fu2024humanplus} &
  Unitree H1 &
  Joint &
  40 &
  50 &
  Whole-body manipulation \\
   &
  UCSD TeleVision~\cite{cheng2024tv} &
  Unitree H1 w/Neck &
  Joint &
  28 &
  60 &
  Manipulation+active perception \\ \hline
  \multirow{1}{*}{Navigation} &
  Waymo~\cite{Ettinger_2021_ICCV} &
  Waymo Car &
  2D delta &
  2 &
  10 &
  Autonomous Driving\\ \hline
\end{tabular}%
}
\end{table*}
```
```{=latex}
\centering
```
![**Comparison of `\GeneralistModelAcronym`{=latex} and *compute-matched* diffusion $\pi_0$ [@black2024pi_0] generalist policies.** `\GeneralistModelAcronym{}`{=latex} clearly outperforms the diffusion VLA when trained with the same amount of training compute, due to its faster convergence. Reported: mean and 95% CI. ](figures/pi0_compute_matched.png){#fig:pi0_compute_matched width="\\linewidth"}

[^1]: $^\ast$: Core contributors`\newline `{=latex}Correspondence to: `research@physicalintelligence.company`
