---
abstract: |
  ```{=latex}
  \renewcommand*{\thefootnote}{\fnsymbol{footnote}}
  ```
  ```{=latex}
  \footnotetext{
  $^\ast$: denotes equal contribution\\Correspondence to: \href{mailto:moojink@stanford.edu,pertsch@berkeley.edu,skaramcheti@stanford.edu}{\texttt{moojink@stanford.edu, pertsch@berkeley.edu, skaramcheti@stanford.edu}}\\
      $^1$Stanford University, $^2$UC Berkeley, $^3$Toyota Research Institute, $^4$Google Deepmind, $^5$Physical Intelligence, $^6$MIT, $^\dagger$Work done in part while at Google Deepmind
  }
  ```
  ```{=latex}
  \renewcommand*{\thefootnote}{\arabic{footnote}}
  ```
  Large policies pretrained on a combination of Internet-scale vision-language data and diverse robot demonstrations have the potential to change how we teach robots new skills: rather than training new behaviors from scratch, we can fine-tune such vision-language-action (VLA) models to obtain robust, generalizable policies for visuomotor control. Yet, widespread adoption of VLAs for robotics has been challenging as 1) existing VLAs are largely closed and inaccessible to the public, and 2) prior work fails to explore methods for efficiently fine-tuning VLAs for new tasks, a key component for adoption. Addressing these challenges, we introduce `\name{}`{=latex}, a `\modelsize{}`{=latex}-parameter open-source VLA trained on a diverse collection of 970k real-world robot demonstrations. `\name{}`{=latex} builds on a Llama 2 language model combined with a visual encoder that fuses pretrained features from DINOv2 and SigLIP. As a product of the added data diversity and new model components, `\name{}`{=latex} demonstrates strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate across 29 tasks and multiple robot embodiments, with 7x fewer parameters. We further show that we can effectively fine-tune `\name{}`{=latex} for new settings, with especially strong generalization results in multi-task environments involving multiple objects and strong language grounding abilities, and outperform expressive from-scratch imitation learning methods such as Diffusion Policy by 20.4% We also explore compute efficiency; as a separate contribution, we show that `\name{}`{=latex} can be fine-tuned on consumer GPUs via modern low-rank adaptation methods and served efficiently via quantization without a hit to downstream success rate. Finally, we release model checkpoints, fine-tuning notebooks, and our PyTorch codebase with built-in support for training VLAs at scale on Open X-Embodiment datasets.
author:
- |
  `\hspace{-1cm}`{=latex}Moo Jin Kim$^{\ast, 1}$ `\quad `{=latex}Karl Pertsch$^{\ast, 1, 2}$ `\quad `{=latex}Siddharth Karamcheti$^{\ast, 1, 3}$\
  `\hspace{-1.5cm}`{=latex}**Ted Xiao**$^{4}$ `\;`{=latex}`\;`{=latex} **Ashwin Balakrishna**$^{3}$ `\;`{=latex}`\;`{=latex} **Suraj Nair**$^{3}$ `\;`{=latex}`\;`{=latex} **Rafael Rafailov**$^{1}$ `\;`{=latex}`\;`{=latex} **Ethan Foster**$^{1}$ `\;`{=latex}`\;`{=latex} **Grace Lam**\
  `\hspace{-1.5cm}`{=latex}**Pannag Sanketi**$^{4}$ `\;`{=latex}`\;`{=latex} **Quan Vuong**$^{5,\dagger}$ `\;`{=latex}`\;`{=latex} **Thomas Kollar**$^{3}$ `\;`{=latex}`\;`{=latex} **Benjamin Burchfiel**$^{3}$ `\;`{=latex}`\;`{=latex} **Russ Tedrake**$^{3,6}$ `\;`{=latex}`\;`{=latex} **Dorsa Sadigh**$^{1}$\
  `\hspace{-1.5cm}`{=latex}**Sergey Levine**$^{2}$ `\;`{=latex}`\;`{=latex} **Percy Liang**$^{1}$ `\;`{=latex}`\;`{=latex} **Chelsea Finn**$^{1}$\
  `\hspace{-2cm}`{=latex}`\large`{=latex}`\website`{=latex} `\vspace{-0.5cm}`{=latex}
bibliography:
- bibref\_definitions\_long.bib
- bibtex.bib
title: |
  `\name{}`{=latex}:\
  An Open-Source Vision-Language-Action Model `\vspace{-0.3cm}`{=latex}
---

```{=latex}
\renewcommand{\floatpagefraction}{.95}
```
```{=latex}
\newcommand{\cmark}{\ding{51}}
```
```{=latex}
\newcommand{\xmark}{\ding{55}}
```
```{=latex}
\newcommand{\pl}[1]{\textcolor{red}{[PL: #1]}}
```
```{=latex}
\newcommand{\figleft}{{\em (Left)}}
```
```{=latex}
\newcommand{\figcenter}{{\em (Center)}}
```
```{=latex}
\newcommand{\figright}{{\em (Right)}}
```
```{=latex}
\newcommand{\figtop}{{\em (Top)}}
```
```{=latex}
\newcommand{\figbottom}{{\em (Bottom)}}
```
```{=latex}
\newcommand{\captiona}{{\em (a)}t}
```
```{=latex}
\newcommand{\captionb}{{\em (b)}}
```
```{=latex}
\newcommand{\captionc}{{\em (c)}}
```
```{=latex}
\newcommand{\captiond}{{\em (d)}}
```
```{=latex}
\newcommand{\newterm}[1]{{\bf #1}}
```
```{=latex}
\def\figref#1{figure~\ref{#1}}
```
```{=latex}
\def\Figref#1{Figure~\ref{#1}}
```
```{=latex}
\def\twofigref#1#2{figures \ref{#1} and \ref{#2}}
```
```{=latex}
\def\quadfigref#1#2#3#4{figures \ref{#1}, \ref{#2}, \ref{#3} and \ref{#4}}
```
```{=latex}
\def\secref#1{section~\ref{#1}}
```
```{=latex}
\def\Secref#1{Section~\ref{#1}}
```
```{=latex}
\def\twosecrefs#1#2{sections \ref{#1} and \ref{#2}}
```
```{=latex}
\def\secrefs#1#2#3{sections \ref{#1}, \ref{#2} and \ref{#3}}
```
```{=latex}
\def\eqref#1{equation~\ref{#1}}
```
```{=latex}
\def\Eqref#1{Equation~\ref{#1}}
```
```{=latex}
\def\plaineqref#1{\ref{#1}}
```
```{=latex}
\def\chapref#1{chapter~\ref{#1}}
```
```{=latex}
\def\Chapref#1{Chapter~\ref{#1}}
```
```{=latex}
\def\rangechapref#1#2{chapters\ref{#1}--\ref{#2}}
```
```{=latex}
\def\algref#1{algorithm~\ref{#1}}
```
```{=latex}
\def\Algref#1{Algorithm~\ref{#1}}
```
```{=latex}
\def\twoalgref#1#2{algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\Twoalgref#1#2{Algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\partref#1{part~\ref{#1}}
```
```{=latex}
\def\Partref#1{Part~\ref{#1}}
```
```{=latex}
\def\twopartref#1#2{parts \ref{#1} and \ref{#2}}
```
```{=latex}
\def\ceil#1{\lceil #1 \rceil}
```
```{=latex}
\def\floor#1{\lfloor #1 \rfloor}
```
```{=latex}
\def\1{\bm{1}}
```
```{=latex}
\newcommand{\train}{\mathcal{D}}
```
```{=latex}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
```
```{=latex}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
```
```{=latex}
\def\eps{{\epsilon}}
```
```{=latex}
\def\reta{{\textnormal{$\eta$}}}
```
```{=latex}
\def\ra{{\textnormal{a}}}
```
```{=latex}
\def\rb{{\textnormal{b}}}
```
```{=latex}
\def\rc{{\textnormal{c}}}
```
```{=latex}
\def\rd{{\textnormal{d}}}
```
```{=latex}
\def\re{{\textnormal{e}}}
```
```{=latex}
\def\rf{{\textnormal{f}}}
```
```{=latex}
\def\rg{{\textnormal{g}}}
```
```{=latex}
\def\rh{{\textnormal{h}}}
```
```{=latex}
\def\ri{{\textnormal{i}}}
```
```{=latex}
\def\rj{{\textnormal{j}}}
```
```{=latex}
\def\rk{{\textnormal{k}}}
```
```{=latex}
\def\rl{{\textnormal{l}}}
```
```{=latex}
\def\rn{{\textnormal{n}}}
```
```{=latex}
\def\ro{{\textnormal{o}}}
```
```{=latex}
\def\rp{{\textnormal{p}}}
```
```{=latex}
\def\rq{{\textnormal{q}}}
```
```{=latex}
\def\rr{{\textnormal{r}}}
```
```{=latex}
\def\rs{{\textnormal{s}}}
```
```{=latex}
\def\rt{{\textnormal{t}}}
```
```{=latex}
\def\ru{{\textnormal{u}}}
```
```{=latex}
\def\rv{{\textnormal{v}}}
```
```{=latex}
\def\rw{{\textnormal{w}}}
```
```{=latex}
\def\rx{{\textnormal{x}}}
```
```{=latex}
\def\ry{{\textnormal{y}}}
```
```{=latex}
\def\rz{{\textnormal{z}}}
```
```{=latex}
\def\rvepsilon{{\mathbf{\epsilon}}}
```
```{=latex}
\def\rvtheta{{\mathbf{\theta}}}
```
```{=latex}
\def\rva{{\mathbf{a}}}
```
```{=latex}
\def\rvb{{\mathbf{b}}}
```
```{=latex}
\def\rvc{{\mathbf{c}}}
```
```{=latex}
\def\rvd{{\mathbf{d}}}
```
```{=latex}
\def\rve{{\mathbf{e}}}
```
```{=latex}
\def\rvf{{\mathbf{f}}}
```
```{=latex}
\def\rvg{{\mathbf{g}}}
```
```{=latex}
\def\rvh{{\mathbf{h}}}
```
```{=latex}
\def\rvu{{\mathbf{i}}}
```
```{=latex}
\def\rvj{{\mathbf{j}}}
```
```{=latex}
\def\rvk{{\mathbf{k}}}
```
```{=latex}
\def\rvl{{\mathbf{l}}}
```
```{=latex}
\def\rvm{{\mathbf{m}}}
```
```{=latex}
\def\rvn{{\mathbf{n}}}
```
```{=latex}
\def\rvo{{\mathbf{o}}}
```
```{=latex}
\def\rvp{{\mathbf{p}}}
```
```{=latex}
\def\rvq{{\mathbf{q}}}
```
```{=latex}
\def\rvr{{\mathbf{r}}}
```
```{=latex}
\def\rvs{{\mathbf{s}}}
```
```{=latex}
\def\rvt{{\mathbf{t}}}
```
```{=latex}
\def\rvu{{\mathbf{u}}}
```
```{=latex}
\def\rvv{{\mathbf{v}}}
```
```{=latex}
\def\rvw{{\mathbf{w}}}
```
```{=latex}
\def\rvx{{\mathbf{x}}}
```
```{=latex}
\def\rvy{{\mathbf{y}}}
```
```{=latex}
\def\rvz{{\mathbf{z}}}
```
```{=latex}
\def\erva{{\textnormal{a}}}
```
```{=latex}
\def\ervb{{\textnormal{b}}}
```
```{=latex}
\def\ervc{{\textnormal{c}}}
```
```{=latex}
\def\ervd{{\textnormal{d}}}
```
```{=latex}
\def\erve{{\textnormal{e}}}
```
```{=latex}
\def\ervf{{\textnormal{f}}}
```
```{=latex}
\def\ervg{{\textnormal{g}}}
```
```{=latex}
\def\ervh{{\textnormal{h}}}
```
```{=latex}
\def\ervi{{\textnormal{i}}}
```
```{=latex}
\def\ervj{{\textnormal{j}}}
```
```{=latex}
\def\ervk{{\textnormal{k}}}
```
```{=latex}
\def\ervl{{\textnormal{l}}}
```
```{=latex}
\def\ervm{{\textnormal{m}}}
```
```{=latex}
\def\ervn{{\textnormal{n}}}
```
```{=latex}
\def\ervo{{\textnormal{o}}}
```
```{=latex}
\def\ervp{{\textnormal{p}}}
```
```{=latex}
\def\ervq{{\textnormal{q}}}
```
```{=latex}
\def\ervr{{\textnormal{r}}}
```
```{=latex}
\def\ervs{{\textnormal{s}}}
```
```{=latex}
\def\ervt{{\textnormal{t}}}
```
```{=latex}
\def\ervu{{\textnormal{u}}}
```
```{=latex}
\def\ervv{{\textnormal{v}}}
```
```{=latex}
\def\ervw{{\textnormal{w}}}
```
```{=latex}
\def\ervx{{\textnormal{x}}}
```
```{=latex}
\def\ervy{{\textnormal{y}}}
```
```{=latex}
\def\ervz{{\textnormal{z}}}
```
```{=latex}
\def\rmA{{\mathbf{A}}}
```
```{=latex}
\def\rmB{{\mathbf{B}}}
```
```{=latex}
\def\rmC{{\mathbf{C}}}
```
```{=latex}
\def\rmD{{\mathbf{D}}}
```
```{=latex}
\def\rmE{{\mathbf{E}}}
```
```{=latex}
\def\rmF{{\mathbf{F}}}
```
```{=latex}
\def\rmG{{\mathbf{G}}}
```
```{=latex}
\def\rmH{{\mathbf{H}}}
```
```{=latex}
\def\rmI{{\mathbf{I}}}
```
```{=latex}
\def\rmJ{{\mathbf{J}}}
```
```{=latex}
\def\rmK{{\mathbf{K}}}
```
```{=latex}
\def\rmL{{\mathbf{L}}}
```
```{=latex}
\def\rmM{{\mathbf{M}}}
```
```{=latex}
\def\rmN{{\mathbf{N}}}
```
```{=latex}
\def\rmO{{\mathbf{O}}}
```
```{=latex}
\def\rmP{{\mathbf{P}}}
```
```{=latex}
\def\rmQ{{\mathbf{Q}}}
```
```{=latex}
\def\rmR{{\mathbf{R}}}
```
```{=latex}
\def\rmS{{\mathbf{S}}}
```
```{=latex}
\def\rmT{{\mathbf{T}}}
```
```{=latex}
\def\rmU{{\mathbf{U}}}
```
```{=latex}
\def\rmV{{\mathbf{V}}}
```
```{=latex}
\def\rmW{{\mathbf{W}}}
```
```{=latex}
\def\rmX{{\mathbf{X}}}
```
```{=latex}
\def\rmY{{\mathbf{Y}}}
```
```{=latex}
\def\rmZ{{\mathbf{Z}}}
```
```{=latex}
\def\ermA{{\textnormal{A}}}
```
```{=latex}
\def\ermB{{\textnormal{B}}}
```
```{=latex}
\def\ermC{{\textnormal{C}}}
```
```{=latex}
\def\ermD{{\textnormal{D}}}
```
```{=latex}
\def\ermE{{\textnormal{E}}}
```
```{=latex}
\def\ermF{{\textnormal{F}}}
```
```{=latex}
\def\ermG{{\textnormal{G}}}
```
```{=latex}
\def\ermH{{\textnormal{H}}}
```
```{=latex}
\def\ermI{{\textnormal{I}}}
```
```{=latex}
\def\ermJ{{\textnormal{J}}}
```
```{=latex}
\def\ermK{{\textnormal{K}}}
```
```{=latex}
\def\ermL{{\textnormal{L}}}
```
```{=latex}
\def\ermM{{\textnormal{M}}}
```
```{=latex}
\def\ermN{{\textnormal{N}}}
```
```{=latex}
\def\ermO{{\textnormal{O}}}
```
```{=latex}
\def\ermP{{\textnormal{P}}}
```
```{=latex}
\def\ermQ{{\textnormal{Q}}}
```
```{=latex}
\def\ermR{{\textnormal{R}}}
```
```{=latex}
\def\ermS{{\textnormal{S}}}
```
```{=latex}
\def\ermT{{\textnormal{T}}}
```
```{=latex}
\def\ermU{{\textnormal{U}}}
```
```{=latex}
\def\ermV{{\textnormal{V}}}
```
```{=latex}
\def\ermW{{\textnormal{W}}}
```
```{=latex}
\def\ermX{{\textnormal{X}}}
```
```{=latex}
\def\ermY{{\textnormal{Y}}}
```
```{=latex}
\def\ermZ{{\textnormal{Z}}}
```
```{=latex}
\def\vzero{{\bm{0}}}
```
```{=latex}
\def\vone{{\bm{1}}}
```
```{=latex}
\def\vmu{{\bm{\mu}}}
```
```{=latex}
\def\vtheta{{\bm{\theta}}}
```
```{=latex}
\def\va{{\bm{a}}}
```
```{=latex}
\def\vb{{\bm{b}}}
```
```{=latex}
\def\vc{{\bm{c}}}
```
```{=latex}
\def\vd{{\bm{d}}}
```
```{=latex}
\def\ve{{\bm{e}}}
```
```{=latex}
\def\vf{{\bm{f}}}
```
```{=latex}
\def\vg{{\bm{g}}}
```
```{=latex}
\def\vh{{\bm{h}}}
```
```{=latex}
\def\vi{{\bm{i}}}
```
```{=latex}
\def\vj{{\bm{j}}}
```
```{=latex}
\def\vk{{\bm{k}}}
```
```{=latex}
\def\vl{{\bm{l}}}
```
```{=latex}
\def\vm{{\bm{m}}}
```
```{=latex}
\def\vn{{\bm{n}}}
```
```{=latex}
\def\vo{{\bm{o}}}
```
```{=latex}
\def\vp{{\bm{p}}}
```
```{=latex}
\def\vq{{\bm{q}}}
```
```{=latex}
\def\vr{{\bm{r}}}
```
```{=latex}
\def\vs{{\bm{s}}}
```
```{=latex}
\def\vt{{\bm{t}}}
```
```{=latex}
\def\vu{{\bm{u}}}
```
```{=latex}
\def\vv{{\bm{v}}}
```
```{=latex}
\def\vw{{\bm{w}}}
```
```{=latex}
\def\vx{{\bm{x}}}
```
```{=latex}
\def\vy{{\bm{y}}}
```
```{=latex}
\def\vz{{\bm{z}}}
```
```{=latex}
\def\evalpha{{\alpha}}
```
```{=latex}
\def\evbeta{{\beta}}
```
```{=latex}
\def\evepsilon{{\epsilon}}
```
```{=latex}
\def\evlambda{{\lambda}}
```
```{=latex}
\def\evomega{{\omega}}
```
```{=latex}
\def\evmu{{\mu}}
```
```{=latex}
\def\evpsi{{\psi}}
```
```{=latex}
\def\evsigma{{\sigma}}
```
```{=latex}
\def\evtheta{{\theta}}
```
```{=latex}
\def\eva{{a}}
```
```{=latex}
\def\evb{{b}}
```
```{=latex}
\def\evc{{c}}
```
```{=latex}
\def\evd{{d}}
```
```{=latex}
\def\eve{{e}}
```
```{=latex}
\def\evf{{f}}
```
```{=latex}
\def\evg{{g}}
```
```{=latex}
\def\evh{{h}}
```
```{=latex}
\def\evi{{i}}
```
```{=latex}
\def\evj{{j}}
```
```{=latex}
\def\evk{{k}}
```
```{=latex}
\def\evl{{l}}
```
```{=latex}
\def\evm{{m}}
```
```{=latex}
\def\evn{{n}}
```
```{=latex}
\def\evo{{o}}
```
```{=latex}
\def\evp{{p}}
```
```{=latex}
\def\evq{{q}}
```
```{=latex}
\def\evr{{r}}
```
```{=latex}
\def\evs{{s}}
```
```{=latex}
\def\evt{{t}}
```
```{=latex}
\def\evu{{u}}
```
```{=latex}
\def\evv{{v}}
```
```{=latex}
\def\evw{{w}}
```
```{=latex}
\def\evx{{x}}
```
```{=latex}
\def\evy{{y}}
```
```{=latex}
\def\evz{{z}}
```
```{=latex}
\def\mA{{\bm{A}}}
```
```{=latex}
\def\mB{{\bm{B}}}
```
```{=latex}
\def\mC{{\bm{C}}}
```
```{=latex}
\def\mD{{\bm{D}}}
```
```{=latex}
\def\mE{{\bm{E}}}
```
```{=latex}
\def\mF{{\bm{F}}}
```
```{=latex}
\def\mG{{\bm{G}}}
```
```{=latex}
\def\mH{{\bm{H}}}
```
```{=latex}
\def\mI{{\bm{I}}}
```
```{=latex}
\def\mJ{{\bm{J}}}
```
```{=latex}
\def\mK{{\bm{K}}}
```
```{=latex}
\def\mL{{\bm{L}}}
```
```{=latex}
\def\mM{{\bm{M}}}
```
```{=latex}
\def\mN{{\bm{N}}}
```
```{=latex}
\def\mO{{\bm{O}}}
```
```{=latex}
\def\mP{{\bm{P}}}
```
```{=latex}
\def\mQ{{\bm{Q}}}
```
```{=latex}
\def\mR{{\bm{R}}}
```
```{=latex}
\def\mS{{\bm{S}}}
```
```{=latex}
\def\mT{{\bm{T}}}
```
```{=latex}
\def\mU{{\bm{U}}}
```
```{=latex}
\def\mV{{\bm{V}}}
```
```{=latex}
\def\mW{{\bm{W}}}
```
```{=latex}
\def\mX{{\bm{X}}}
```
```{=latex}
\def\mY{{\bm{Y}}}
```
```{=latex}
\def\mZ{{\bm{Z}}}
```
```{=latex}
\def\mBeta{{\bm{\beta}}}
```
```{=latex}
\def\mPhi{{\bm{\Phi}}}
```
```{=latex}
\def\mLambda{{\bm{\Lambda}}}
```
```{=latex}
\def\mSigma{{\bm{\Sigma}}}
```
```{=latex}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
```
```{=latex}
\def\tA{{\tens{A}}}
```
```{=latex}
\def\tB{{\tens{B}}}
```
```{=latex}
\def\tC{{\tens{C}}}
```
```{=latex}
\def\tD{{\tens{D}}}
```
```{=latex}
\def\tE{{\tens{E}}}
```
```{=latex}
\def\tF{{\tens{F}}}
```
```{=latex}
\def\tG{{\tens{G}}}
```
```{=latex}
\def\tH{{\tens{H}}}
```
```{=latex}
\def\tI{{\tens{I}}}
```
```{=latex}
\def\tJ{{\tens{J}}}
```
```{=latex}
\def\tK{{\tens{K}}}
```
```{=latex}
\def\tL{{\tens{L}}}
```
```{=latex}
\def\tM{{\tens{M}}}
```
```{=latex}
\def\tN{{\tens{N}}}
```
```{=latex}
\def\tO{{\tens{O}}}
```
```{=latex}
\def\tP{{\tens{P}}}
```
```{=latex}
\def\tQ{{\tens{Q}}}
```
```{=latex}
\def\tR{{\tens{R}}}
```
```{=latex}
\def\tS{{\tens{S}}}
```
```{=latex}
\def\tT{{\tens{T}}}
```
```{=latex}
\def\tU{{\tens{U}}}
```
```{=latex}
\def\tV{{\tens{V}}}
```
```{=latex}
\def\tW{{\tens{W}}}
```
```{=latex}
\def\tX{{\tens{X}}}
```
```{=latex}
\def\tY{{\tens{Y}}}
```
```{=latex}
\def\tZ{{\tens{Z}}}
```
```{=latex}
\def\gA{{\mathcal{A}}}
```
```{=latex}
\def\gB{{\mathcal{B}}}
```
```{=latex}
\def\gC{{\mathcal{C}}}
```
```{=latex}
\def\gD{{\mathcal{D}}}
```
```{=latex}
\def\gE{{\mathcal{E}}}
```
```{=latex}
\def\gF{{\mathcal{F}}}
```
```{=latex}
\def\gG{{\mathcal{G}}}
```
```{=latex}
\def\gH{{\mathcal{H}}}
```
```{=latex}
\def\gI{{\mathcal{I}}}
```
```{=latex}
\def\gJ{{\mathcal{J}}}
```
```{=latex}
\def\gK{{\mathcal{K}}}
```
```{=latex}
\def\gL{{\mathcal{L}}}
```
```{=latex}
\def\gM{{\mathcal{M}}}
```
```{=latex}
\def\gN{{\mathcal{N}}}
```
```{=latex}
\def\gO{{\mathcal{O}}}
```
```{=latex}
\def\gP{{\mathcal{P}}}
```
```{=latex}
\def\gQ{{\mathcal{Q}}}
```
```{=latex}
\def\gR{{\mathcal{R}}}
```
```{=latex}
\def\gS{{\mathcal{S}}}
```
```{=latex}
\def\gT{{\mathcal{T}}}
```
```{=latex}
\def\gU{{\mathcal{U}}}
```
```{=latex}
\def\gV{{\mathcal{V}}}
```
```{=latex}
\def\gW{{\mathcal{W}}}
```
```{=latex}
\def\gX{{\mathcal{X}}}
```
```{=latex}
\def\gY{{\mathcal{Y}}}
```
```{=latex}
\def\gZ{{\mathcal{Z}}}
```
```{=latex}
\def\sA{{\mathbb{A}}}
```
```{=latex}
\def\sB{{\mathbb{B}}}
```
```{=latex}
\def\sC{{\mathbb{C}}}
```
```{=latex}
\def\sD{{\mathbb{D}}}
```
```{=latex}
\def\sF{{\mathbb{F}}}
```
```{=latex}
\def\sG{{\mathbb{G}}}
```
```{=latex}
\def\sH{{\mathbb{H}}}
```
```{=latex}
\def\sI{{\mathbb{I}}}
```
```{=latex}
\def\sJ{{\mathbb{J}}}
```
```{=latex}
\def\sK{{\mathbb{K}}}
```
```{=latex}
\def\sL{{\mathbb{L}}}
```
```{=latex}
\def\sM{{\mathbb{M}}}
```
```{=latex}
\def\sN{{\mathbb{N}}}
```
```{=latex}
\def\sO{{\mathbb{O}}}
```
```{=latex}
\def\sP{{\mathbb{P}}}
```
```{=latex}
\def\sQ{{\mathbb{Q}}}
```
```{=latex}
\def\sR{{\mathbb{R}}}
```
```{=latex}
\def\sS{{\mathbb{S}}}
```
```{=latex}
\def\sT{{\mathbb{T}}}
```
```{=latex}
\def\sU{{\mathbb{U}}}
```
```{=latex}
\def\sV{{\mathbb{V}}}
```
```{=latex}
\def\sW{{\mathbb{W}}}
```
```{=latex}
\def\sX{{\mathbb{X}}}
```
```{=latex}
\def\sY{{\mathbb{Y}}}
```
```{=latex}
\def\sZ{{\mathbb{Z}}}
```
```{=latex}
\def\emLambda{{\Lambda}}
```
```{=latex}
\def\emA{{A}}
```
```{=latex}
\def\emB{{B}}
```
```{=latex}
\def\emC{{C}}
```
```{=latex}
\def\emD{{D}}
```
```{=latex}
\def\emE{{E}}
```
```{=latex}
\def\emF{{F}}
```
```{=latex}
\def\emG{{G}}
```
```{=latex}
\def\emH{{H}}
```
```{=latex}
\def\emI{{I}}
```
```{=latex}
\def\emJ{{J}}
```
```{=latex}
\def\emK{{K}}
```
```{=latex}
\def\emL{{L}}
```
```{=latex}
\def\emM{{M}}
```
```{=latex}
\def\emN{{N}}
```
```{=latex}
\def\emO{{O}}
```
```{=latex}
\def\emP{{P}}
```
```{=latex}
\def\emQ{{Q}}
```
```{=latex}
\def\emR{{R}}
```
```{=latex}
\def\emS{{S}}
```
```{=latex}
\def\emT{{T}}
```
```{=latex}
\def\emU{{U}}
```
```{=latex}
\def\emV{{V}}
```
```{=latex}
\def\emW{{W}}
```
```{=latex}
\def\emX{{X}}
```
```{=latex}
\def\emY{{Y}}
```
```{=latex}
\def\emZ{{Z}}
```
```{=latex}
\def\emSigma{{\Sigma}}
```
```{=latex}
\newcommand{\etens}[1]{\mathsfit{#1}}
```
```{=latex}
\def\etLambda{{\etens{\Lambda}}}
```
```{=latex}
\def\etA{{\etens{A}}}
```
```{=latex}
\def\etB{{\etens{B}}}
```
```{=latex}
\def\etC{{\etens{C}}}
```
```{=latex}
\def\etD{{\etens{D}}}
```
```{=latex}
\def\etE{{\etens{E}}}
```
```{=latex}
\def\etF{{\etens{F}}}
```
```{=latex}
\def\etG{{\etens{G}}}
```
```{=latex}
\def\etH{{\etens{H}}}
```
```{=latex}
\def\etI{{\etens{I}}}
```
```{=latex}
\def\etJ{{\etens{J}}}
```
```{=latex}
\def\etK{{\etens{K}}}
```
```{=latex}
\def\etL{{\etens{L}}}
```
```{=latex}
\def\etM{{\etens{M}}}
```
```{=latex}
\def\etN{{\etens{N}}}
```
```{=latex}
\def\etO{{\etens{O}}}
```
```{=latex}
\def\etP{{\etens{P}}}
```
```{=latex}
\def\etQ{{\etens{Q}}}
```
```{=latex}
\def\etR{{\etens{R}}}
```
```{=latex}
\def\etS{{\etens{S}}}
```
```{=latex}
\def\etT{{\etens{T}}}
```
```{=latex}
\def\etU{{\etens{U}}}
```
```{=latex}
\def\etV{{\etens{V}}}
```
```{=latex}
\def\etW{{\etens{W}}}
```
```{=latex}
\def\etX{{\etens{X}}}
```
```{=latex}
\def\etY{{\etens{Y}}}
```
```{=latex}
\def\etZ{{\etens{Z}}}
```
```{=latex}
\newcommand{\pdata}{p_{\rm{data}}}
```
```{=latex}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
```
```{=latex}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
```
```{=latex}
\newcommand{\pmodel}{p_{\rm{model}}}
```
```{=latex}
\newcommand{\Pmodel}{P_{\rm{model}}}
```
```{=latex}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
```
```{=latex}
\newcommand{\pencode}{p_{\rm{encoder}}}
```
```{=latex}
\newcommand{\pdecode}{p_{\rm{decoder}}}
```
```{=latex}
\newcommand{\precons}{p_{\rm{reconstruct}}}
```
```{=latex}
\newcommand{\laplace}{\mathrm{Laplace}}
```
```{=latex}
\newcommand{\E}{\mathbb{E}}
```
```{=latex}
\newcommand{\Ls}{\mathcal{L}}
```
```{=latex}
\newcommand{\R}{\mathbb{R}}
```
```{=latex}
\newcommand{\emp}{\tilde{p}}
```
```{=latex}
\newcommand{\lr}{\alpha}
```
```{=latex}
\newcommand{\reg}{\lambda}
```
```{=latex}
\newcommand{\rect}{\mathrm{rectifier}}
```
```{=latex}
\newcommand{\softmax}{\mathrm{softmax}}
```
```{=latex}
\newcommand{\sigmoid}{\sigma}
```
```{=latex}
\newcommand{\softplus}{\zeta}
```
```{=latex}
\newcommand{\KL}{D_{\mathrm{KL}}}
```
```{=latex}
\newcommand{\Var}{\mathrm{Var}}
```
```{=latex}
\newcommand{\standarderror}{\mathrm{SE}}
```
```{=latex}
\newcommand{\Cov}{\mathrm{Cov}}
```
```{=latex}
\newcommand{\normlzero}{L^0}
```
```{=latex}
\newcommand{\normlone}{L^1}
```
```{=latex}
\newcommand{\normltwo}{L^2}
```
```{=latex}
\newcommand{\normlp}{L^p}
```
```{=latex}
\newcommand{\normmax}{L^\infty}
```
```{=latex}
\newcommand{\parents}{Pa}
```
```{=latex}
\DeclareMathOperator*{\argmax}{arg\,max}
```
```{=latex}
\DeclareMathOperator*{\argmin}{arg\,min}
```
```{=latex}
\DeclareMathOperator{\sign}{sign}
```
```{=latex}
\DeclareMathOperator{\Tr}{Tr}
```
```{=latex}
\let\ab\allowbreak
```
```{=latex}
\newcommand{\todo}[1]{ \textcolor{red}{\bf #1}}
```
```{=latex}
\newcommand{\needcite}[1]{ \textcolor{magenta}{\textbf{[Cite?]} #1}}
```
```{=latex}
\newcommand{\KP}[1]{ \textcolor{blue}{K: \bf #1}}
```
```{=latex}
\newcommand{\sk}[1]{ \textcolor{skviolet}{\textbf{[Sidd]: #1}}}
```
```{=latex}
\newcommand{\russt}[1]{ \textcolor{orange}{RT: \bf #1}}
```
```{=latex}
\newcommand{\bb}[1]{ \textcolor{ForestGreen}{BB: \bf #1}}
```
```{=latex}
\newcommand{\ie}{i.e., }
```
```{=latex}
\newcommand{\eg}{e.g., }
```
```{=latex}
\newcommand{\Skip}[1]{}
```
```{=latex}
\newcommand{\name}{OpenVLA}
```
```{=latex}
\newcommand{\modelsize}{7B}
```
```{=latex}
\newcommand{\ngpus}{64~A100}
```
```{=latex}
\newcommand{\ntraindays}{14}
```
```{=latex}
\newcommand{\trainlearningrate}{2e-5}
```
```{=latex}
\newcommand{\batchsize}{2048}
```
```{=latex}
\newcommand{\niterations}{150k}
```
```{=latex}
\newcommand{\nepisodes}{970k}
```
```{=latex}
\newcommand{\nepochs}{27}
```
```{=latex}
\newcommand{\license}{MIT}
```
```{=latex}
\newcommand{\website}{\url{https://openvla.github.io}}
```
```{=latex}
\makeatletter
```
```{=latex}
\let\@oldmaketitle\@maketitle
```
```{=latex}
\renewcommand{\@maketitle}{\@oldmaketitle%
  \begin{center}
  \captionsetup{type=figure}
  \includegraphics[width=\textwidth]{figures/openvla_teaser.pdf}
    \captionof{figure}{We present \name{}, a 7B-parameter open-source vision-language-action model (VLA), trained 
    on 970k robot episodes from the Open X-Embodiment dataset~\citep{open_x_embodiment_rt_x_2023}. \name{} sets a new state of the art for generalist robot manipulation policies. It supports controlling multiple robots out of the box and can be quickly adapted to new robot domains via parameter-efficient fine-tuning. The \name{} checkpoints and PyTorch training pipeline are fully open-source and models can be downloaded and fine-tuned from HuggingFace.
    } 
    \label{fig:teaser}
    \end{center}
}
```
```{=latex}
\makeatother
```
```{=latex}
\maketitle
```
Introduction {#sec:intro}
============

A key weakness of learned policies for robotic manipulation is their inability to generalize beyond their training data: while existing policies trained for individual skills or language instructions have the capacity to extrapolate behaviors to new initial conditions such as object positions or lighting [@rt12022arxiv; @chi2023diffusionpolicy], they lack robustness to scene distractors or novel objects [@xie2023decomposing; @octo_2023] and struggle to execute unseen task instructions [@walke2023bridgedata; @rt22023arxiv]. Yet beyond robotics, existing foundation models for vision and language such as CLIP [@radford2021clip], SigLIP [@zhai2023siglip], and Llama 2 [@touvron2023llama2] are capable of these types of generalization and more, stemming from the priors captured by their Internet-scale pretraining datasets. While reproducing this scale of pretraining for robotics is still an open challenge --- even the largest robot manipulation datasets [@open_x_embodiment_rt_x_2023; @khazatsky2024droid] only have 100K to 1M examples -- this imbalance suggests an opportunity: using existing foundation models for vision and language *as a core building block* for training robotic policies that can generalize to objects, scenes, and tasks beyond their training data.

Towards this goal, existing work has explored integrating pretrained language and vision-language models for robotic representation learning [@nair2022r3m; @Karamcheti2023LanguageDrivenRL; @shridhar2022cliport] and as a component in modular systems for task planning and execution [@stone2023open; @driess2023palm]. More recently, they have been used for directly learning vision-language-action models [VLAs; @rt22023arxiv; @open_x_embodiment_rt_x_2023; @covariant_ai_2024; @wayve_ai_2024] for control. VLAs provide a direct instantiation of using pretrained vision-and-language foundation models for robotics, directly fine-tuning visually-conditioned language models (VLMs) such as PaLI [@chen2022pali; @chen2023pali3] to generate robot control actions. By building off of strong foundation models trained on Internet-scale data, VLAs such as RT-2 [@rt22023arxiv] demonstrate impressive robustness results, as well as an ability to generalize to novel objects and tasks, setting a new standard for generalist robot policies. Yet, there are two key reasons preventing the widespread use of existing VLAs: 1) current models [@rt22023arxiv; @open_x_embodiment_rt_x_2023; @covariant_ai_2024; @wayve_ai_2024] are *closed*, with limited visibility into model architecture, training procedures, and data mixture, and 2) existing works do not provide best practices for *deploying and adapting VLAs* to new robots, environments, and tasks --- especially on commodity hardware (e.g., consumer-grade GPUs). We argue that to develop a rich foundation for future research and development, robotics needs open-source, generalist VLAs that support effective fine-tuning and adaptation, akin to the existing ecosystem around open-source language models [@wolf-etal-2020-transformers; @touvron2023llama; @jiang2023mistral; @team2024gemma].

To this end, we introduce `\name{}`{=latex}, a `\modelsize{}`{=latex}-parameter open-source VLA that establishes a new state of the art for generalist robot manipulation policies.[^1] `\name{}`{=latex} consists of a pretrained visually-conditioned language model backbone that captures visual features at multiple granularities, fine-tuned on a large, diverse dataset of `\nepisodes{}`{=latex} robot manipulation trajectories from the Open-X Embodiment [@open_x_embodiment_rt_x_2023] dataset --- a dataset that spans a wide range of robot embodiments, tasks, and scenes. As a product of increased data diversity and new model components, `\name{}`{=latex} outperforms the 55B-parameter RT-2-X model [@rt22023arxiv; @open_x_embodiment_rt_x_2023], the prior state-of-the-art VLA, by 16.5% absolute success rate across 29 evaluation tasks on the WidowX and Google Robot embodiments. We additionally investigate *efficient fine-tuning strategies for VLAs*, a new contribution not explored in prior work, across 7 diverse manipulation tasks spanning behaviors from object pick-and-place to cleaning a table. We find that fine-tuned `\name{}`{=latex} policies clearly outperform fine-tuned pretrained policies such as Octo [@octo_2023]. Compared to from-scratch imitation learning with diffusion policies [@chi2023diffusionpolicy], fine-tuned `\name{}`{=latex} shows substantial improvement on tasks involving grounding language to behavior in multi-task settings with multiple objects. Following these results, we are the first to demonstrate the effectiveness of compute-efficient fine-tuning methods leveraging low-rank adaptation [LoRA; @hu2021lora] and model quantization [@dettmers2024qlora] to facilitate adapting `\name{}`{=latex} models on consumer-grade GPUs instead of large server nodes without compromising performance. As a final contribution, we open-source all models, deployment and fine-tuning notebooks, and the `\name{}`{=latex} codebase for training VLAs at scale, with the hope that these resources enable future work exploring and adapting VLAs for robotics.

Related Work {#sec:related_work}
============

#### Visually-Conditioned Language Models

Visually-conditioned language models (VLMs), which are trained on Internet-scale data to generate natural language from input image(s) and language prompts, have been adopted for myriad applications from visual question answering [@goyal2017making; @hudson2019gqa; @singh2019textvqa; @bigham2010vizwiz] to object localization [@kazemzadeh2014refcoco; @yu2016refcoco]. One of the key advances fueling recent VLMs are model architectures that bridge features from pretrained vision encoders [@radford2021clip; @zhai2023siglip; @oquab2023dinov2] with pretrained language models [@touvron2023llama2; @jiang2023mistral; @google-gemma; @microsoft-phi; @bai2023qwen], directly building on advances in both computer vision and natural language modelling to create powerful multimodal models. While early work explored various architectures for cross-attending between vision and language features [@li2022blip; @li2023blip2; @dai2023instructblip; @tan2019lxmert; @laurencon2023obelics], new open-source VLMs [@liu2023llava; @liu2023llavav15; @chen2023pali3; @karamcheti2024prismatic] have converged on a simpler \`\`patch-as-token" approach, in which patch features from pretrained visual transformers are treated as tokens, and are then projected into the input space of a language model. This simplicity makes it easy to repurpose existing tools for training language models at scale for VLM training. We employ these tools in our work to scale VLA training, and specifically use VLMs from @karamcheti2024prismatic as our pretrained backbone, as they are trained from *multi-resolution visual features*, fusing low-level spatial information from DINOv2 [@oquab2023dinov2] with higher-level semantics from SigLIP [@zhai2023siglip] to aid in visual generalization.

#### Generalist Robot Policies

A recent trend in robotics works towards training multi-task \`\`generalist" robot policies [@kalashnikov2018qt; @mtopt2021arxiv; @ebert2021bridge; @walke2023bridgedata; @rt12022arxiv; @ehsani2023imitating; @bharadhwaj2023roboagent] on large diverse robot datasets [@pinto2016supersizing; @mandlekar2018roboturk; @kalashnikov2018qt; @gupta2018robot; @dasari2019robonet; @cabi2019; @jang2022bc; @walke2023bridgedata; @rt12022arxiv; @fang2023rh20t; @bharadhwaj2023roboagent; @khazatsky2024droid; @open_x_embodiment_rt_x_2023], spanning many different robot embodiments [@devin2017learning; @dasari2019robonet; @hu2022know; @yang2023polybot; @reed2022a; @salhotra2023bridging; @radosavovic2023robot; @shahGNMGeneralNavigation2023; @bousmalis2023robocat; @shah2023vint; @open_x_embodiment_rt_x_2023; @octo_2023; @yang2024pushing]. Notably, Octo [@octo_2023] trains a generalist policy that can control multiple robots out-of-the-box and allows for flexible fine-tuning to new robot setups. A key difference between these approaches and `\name{}`{=latex} is the model architecture. Prior works like Octo typically compose pretrained components such as language embeddings or visual encoders with additional model components initialized from scratch [@walke2023bridgedata; @rt12022arxiv; @octo_2023], learning to \`\`stitch" them together during the course of policy training. Unlike these works, `\name{}`{=latex} adopts a more end-to-end approach, directly fine-tuning VLMs to generate robot actions by treating them as tokens in the language model vocabulary. Our experimental evaluation shows that this simple yet scalable pipeline substantially boosts performance and generalization ability over prior generalist policies.

#### Vision-Language-Action Models

A number of works have explored the use of VLMs for robotics, `\eg `{=latex}for visual state representations [@nair2022r3m; @Karamcheti2023LanguageDrivenRL], object detection [@gadre2023cows], high-level planning [@driess2023palm], and for providing a feedback signal [@du2023vision; @ma2023liv; @zhang2023grounding; @sontakke2024roboclip]. Others integrate VLMs directly into end-to-end visuomotor manipulation policies [@shridhar2022cliport; @stone2023open], but incorporate significant structure into the policy architecture or require calibrated cameras, which limits their applicability. A number of recent works have explored similar recipes to ours and directly fine-tuned large pretrained VLMs for predicting robot actions [@rt22023arxiv; @open_x_embodiment_rt_x_2023; @huang2023embodied; @li2023vision; @covariant_ai_2024; @wayve_ai_2024; @zhen20243dvla]. Such models are often referred to as vision-language-action models (VLAs), since they fuse robot control actions directly into VLM backbones. This has three key benefits: (1) it performs alignment of pretrained vision and language components on a large, Internet-scale vision-language dataset, (2) the use of a generic architecture, not custom-made for robot control, allows us to leverage the scalable infrastructure underlying modern VLM training [@pytorch_amp; @dao2023flashattention; @zhao2023pytorch] and scale to training billion-parameter policies with minimal code modifications, and (3) it provides a direct pathway for robotics to benefit from the rapid improvements in VLMs. Existing works on VLAs either focus on training and evaluating in single robot or simulated setups [@huang2023embodied; @li2023vision; @dorka2024matters; @zhen20243dvla] and thus lack generality, or are closed and do not support efficient fine-tuning to new robot setups [@rt22023arxiv; @open_x_embodiment_rt_x_2023; @covariant_ai_2024; @wayve_ai_2024].

Most closely related, RT-2-X [@open_x_embodiment_rt_x_2023] trains a 55B-parameter VLA policy on the Open X-Embodiment dataset and demonstrates state-of-the-art generalist manipulation policy performance. However, our work differs from RT-2-X in multiple important aspects: (1) by combining a strong open VLM backbone with a richer robot pretraining dataset, `\name{}`{=latex} outperforms RT-2-X in our experiments while being an order of magnitude smaller; (2) we thoroughly investigate fine-tuning of `\name{}`{=latex} models to new target setups, while RT-2-X does not investigate the fine-tuning setting; (3) we are the first to demonstrate the effectiveness of modern parameter-efficient fine-tuning and quantization approaches for VLAs; and (4) `\name{}`{=latex} is the first generalist VLA that is open-source and thus supports future research on VLA training, data mixtures, objectives, and inference.

The `\name{}`{=latex} Model {#sec:model}
===========================

We introduce the `\name{}`{=latex} model, a `\modelsize{}`{=latex}-parameter vision-language-action model (VLA) trained on `\nepisodes{}`{=latex} robot demonstrations from the Open X-Embodiment dataset [@open_x_embodiment_rt_x_2023]. There are many, largely unexplored, questions around best practices for developing VLA models, `\eg `{=latex}what are the best model backbones, datasets, and hyperparameters to use for training. Below, we detail our approach for developing `\name{}`{=latex} and summarize our key learnings. Concretely, we first provide a brief overview of modern VLMs, which form the backbone of `\name{}`{=latex} (`\cref{sec:prismatic_preliminaries}`{=latex}); then describe our basic training recipe and dataset (`\cref{sec:train_procedure}`{=latex} and `\cref{sec:data_mix}`{=latex}); discuss key design decisions (`\cref{sec:train_details}`{=latex}); and provide details of the used infrastructure for training and inference (`\cref{sec:infra}`{=latex}).

```{=latex}
\centering
```
![ **`\name{}`{=latex} model architecture.** Given an image observation and a language instruction, the model predicts 7-dimensional robot control actions. The architecture consists of three key components: (1) a **vision encoder** that concatenates Dino V2 [@oquab2023dinov2] and SigLIP [@zhai2023sigmoid] features, (2) a **projector** that maps visual features to the language embedding space, and (3) the **LLM backbone**, a Llama 2 7B-parameter large language model [@touvron2023llama2]. ](figures/openvla_model.png){#fig:architecture width="\\linewidth"}

Preliminaries: Vision-Language Models {#sec:prismatic_preliminaries}
-------------------------------------

The architecture of most recent VLMs [@liu2023llava; @liu2023llavav15; @chen2023pali3; @karamcheti2024prismatic] consists of three main parts (see `\cref{fig:architecture}`{=latex}): (1) a **visual encoder** that maps image inputs to a number of \`\`image patch embeddings", (2) a **projector** that takes the output embeddings of the visual encoder and maps them into the input space of a language model, and (3) a **large language model (LLM) backbone**. During VLM training, the model is trained end-to-end with a next text token prediction objective on paired or interleaved vision and language data curated from various Internet sources.

In this work, we build on the Prismatic-7B VLM [@karamcheti2024prismatic]. Prismatic follows the same standard architecture described above, with a 600M-parameter **visual encoder**, a small 2-layer MLP **projector**, and a 7B-parameter Llama 2 **language model backbone** [@touvron2023llama2]. Notably, Prismatic uses a *two-part* visual encoder, consisting of pretrained SigLIP [@zhai2023sigmoid] and DinoV2 [@oquab2023dinov2] models. Input image patches are passed separately through both encoders and the resulting feature vectors are concatenated channel-wise. In contrast to the more commonly used vision encoders such as CLIP- [@radford2021learning] or SigLIP-only encoders, the addition of DinoV2 features has been shown to be helpful for improved spatial reasoning [@karamcheti2024prismatic], which can be particularly helpful for robot control.

SigLIP, DinoV2, and Llama 2 do not release details about their training data, which likely consists of trillions of tokens of Internet-sourced image-text, image-only, and text-only data respectively. The Prismatic VLM is fine-tuned on top of these components using the LLaVA 1.5 data mixture [@liu2023llavav15], which contains a total of approximately 1M image-text and text-only data samples from open-source datasets [@sharma2018conceptual; @schuhmann2021laion; @hudson2019gqa; @sidorov2020textcaps; @liu2023llava].

`\name{}`{=latex} Training Procedure {#sec:train_procedure}
------------------------------------

To train `\name{}`{=latex}, we fine-tune a pretrained Prismatic-7B VLM backbone for robot action prediction (see `\cref{fig:architecture}`{=latex}). We formulate the action prediction problem as a \`\`vision-language" task, where an input observation image and a natural language task instruction are mapped to a string of predicted robot actions [@rt22023arxiv]. To enable the VLM's language model backbone to predict robot actions, we represent the actions in the output space of the LLM by mapping continuous robot actions to discrete tokens used by the language model's tokenizer. Following @rt22023arxiv, we discretize each dimension of the robot actions separately into one of 256 bins. For each action dimension, we set the bin width to uniformly divide the interval between the 1^st^ and 99^th^ quantile of the actions in the training data. Using quantiles instead of the min-max bounds @rt22023arxiv used allows us to ignore outlier actions in the data that could otherwise drastically expand the discretization interval and reduce the effective granularity of our action discretization.

Using this discretization, we obtain N discrete integers $\in [0 \dots 255]$ for an $N$-dimensional robot action. Unfortunately, the tokenizer used by `\name{}`{=latex}'s language backbone, the Llama tokenizer [@touvron2023llama2], only reserves 100 \`\`special tokens" for tokens newly introduced during fine-tuning, which is too few for the 256 tokens of our action discretization. Instead, we again opt for simplicity and follow @rt22023arxiv's approach by simply overwriting the 256 *least used* tokens in the Llama tokenizer's vocabulary (which corresponds to the last 256 tokens) with our action tokens. Once the actions are processed into a sequence of tokens, `\name{}`{=latex} is trained with a standard next-token prediction objective, evaluating the cross-entropy loss on the predicted action tokens only. We discuss key design decisions for implementing this training procedure in `\cref{sec:train_details}`{=latex}. Next, we describe the robot dataset we use for `\name{}`{=latex} training.

Training Data {#sec:data_mix}
-------------

The goal in constructing the `\name{}`{=latex} training dataset is to capture a large diversity of robot embodiments, scenes, and tasks. This enables the final model to control various robots out of the box *and* admits efficient fine-tuning to new robot setups. We leverage the Open X-Embodiment dataset [@open_x_embodiment_rt_x_2023] (OpenX) as a base to curate our training dataset. The full OpenX dataset, at the time of writing, consists of more than 70 individual robot datasets, with more than 2M robot trajectories, that were pooled into a coherent and easy-to-use data format in a large community effort. To make training on this data practical, we apply multiple steps of data curation to the raw dataset.

The goals of this curation are to ensure (1) a coherent input and output space across all training datasets, and (2) a balanced mix of embodiments, tasks, and scenes in the final training mixture.[^2] To address (1), we follow [@open_x_embodiment_rt_x_2023; @octo_2023] and restrict our training dataset to contain only manipulation datasets with at least one 3^rd^ person camera and use single-arm end-effector control. For (2), we leverage the data mixture weights of Octo [@octo_2023] for all datasets that pass the first round of filtering. Octo heuristically down-weights or removes less diverse datasets and up-weights datasets with larger task and scene diversity; see @octo_2023 for details.

We also experimented with incorporating a few additional datasets into our training mixture that were added to the OpenX dataset since the release of Octo, including the DROID dataset [@khazatsky2024droid], although at a conservative mixture weight of 10%. In practice, we found that the action token accuracy on DROID remained low throughout training, suggesting a larger mixture weight or model may be required to fit its diversity in the future. To not jeopardize the quality of the final model, we removed DROID from the data mixture for the final third of training. We provide a complete overview of the used datasets and mixture weights in `\cref{sec:app:data_mix}`{=latex}.

`\name{}`{=latex} Design Decisions {#sec:train_details}
----------------------------------

When developing the `\name{}`{=latex} model, we explored various design decisions in smaller-scale experiments before starting the final model training run. Concretely, we trained and evaluated `\name{}`{=latex} models on BridgeData V2 [@walke2023bridgedata] for our initial experiments, instead of training on the full OpenX mixture, to increase iteration speed and reduce computational cost. We summarize key learnings from these explorations below.

**VLM Backbone.** Initially, we experimented with multiple VLM backbones. Apart from Prismatic [@karamcheti2024prismatic], we tested fine-tuning IDEFICS-1 [@idefics2024] and LLaVA [@liu2024visual] for robot action prediction. We found that LLaVA and IDEFICS-1 performed comparably on tasks with only one object in the scene, but LLaVA demonstrated stronger language grounding in tasks that involved multiple objects in the scene and required the policy to manipulate the *correct* object, `\ie `{=latex}the object specified in the language instruction. Concretely, LLaVA improved upon IDEFICS-1 by 35% in absolute success rate, averaged across five language grounding tasks in a BridgeData V2 sink environment. The fine-tuned Prismatic VLM policy achieved further improvements, outperforming the LLaVA policy by roughly 10% in absolute success rate across both simple single-object tasks and multi-object, language grounding tasks. We attribute this performance delta to improved spatial reasoning capabilities afforded by the fused SigLIP-DinoV2 backbones (see `\cref{sec:prismatic_preliminaries}`{=latex}). In addition to the performance enhancements, Prismatic also provides a modular and easy-to-use codebase, so we ultimately chose it to be the backbone for the `\name{}`{=latex} model.

**Image Resolution.** The resolution of input images has significant impact on the computational requirements of VLA training, since higher-resolution images result in more image patch tokens and thus longer context lengths that quadratically increase training compute. We compared VLAs with $224 \times 224$px and $384 \times 384$px inputs, but found no performance difference in our evaluations, while the latter takes 3x longer to train. We thus opt for a resolution of $224 \times 224$px for the final `\name{}`{=latex} model. Note that on many VLM benchmarks, increased resolution does improve performance [@karamcheti2024prismatic; @mckinzie2024mm1; @lin2023vila], but we did not see this trend (yet) for VLAs.

**Fine-Tuning Vision Encoder.** Prior work on VLMs found that freezing vision encoders during VLM training typically leads to higher performance [@karamcheti2024prismatic]. Intuitively, a frozen vision encoder may better preserve the robust features learned from its Internet-scale pretraining. However, we found fine-tuning the vision encoder during VLA training to be crucial for good VLA performance. We hypothesize that the pretrained vision backbone may not capture sufficient fine-grained spatial details about important parts of the scene to enable precise robotic control.

**Training Epochs.** Typical LLM or VLM training runs complete at most one or two epochs through their training dataset. In contrast, we found it important for VLA training to iterate through the training dataset significantly more times, with real robot performance continually increasing until training action token accuracy surpasses 95%. Our final training run completes `\nepochs{}`{=latex} epochs through its training dataset.

**Learning Rate.** We swept the learning rate across multiple orders of magnitude for VLA training, and achieved the best results using a fixed learning rate of `\trainlearningrate{}`{=latex} (the same learning rate used during VLM pretraining [@karamcheti2024prismatic]). We did not find learning rate warmup to provide benefits.

Infrastructure for Training and Inference {#sec:infra}
-----------------------------------------

The final `\name{}`{=latex} model is trained on a cluster of `\ngpus{}`{=latex} GPUs for `\ntraindays{}`{=latex} days, or a total of 21,500 A100-hours, using a batch size of `\batchsize{}`{=latex}. During inference, `\name{}`{=latex} requires 15GB of GPU memory when loaded in `bfloat16` precision (`\ie `{=latex}without quantization) and runs at approximately 6Hz on one NVIDIA RTX 4090 GPU (without compilation, speculative decoding, or other inference speed-up tricks). We can further reduce the memory footprint of `\name{}`{=latex} during inference via quantization, without compromising performance in real-world robotics tasks, as shown in `\cref{sec:quantization_exp}`{=latex}. We report inference speed on various consumer- and server-grade GPUs in `\cref{fig:inference_speed}`{=latex}. For convenience, we implement a remote VLA inference server to allow real-time remote streaming of action predictions to the robot -- removing the requirement of having access to a powerful local compute device to control the robot. We release this remote inference solution as part of our open-source code release (`\cref{sec:code}`{=latex}).

The `\name{}`{=latex} Codebase {#sec:code}
==============================

Along with our model, we release the `\name{}`{=latex} codebase, a modular PyTorch codebase for training VLA models (see `\website{}`{=latex}). It scales from fine-tuning VLAs on individual GPUs to training billion-parameter VLAs on multi-node GPU clusters, and supports modern techniques for large transformer model training such as automatic mixed precision (AMP, @pytorch_amp), FlashAttention [@dao2023flashattention], and fully sharded data parallelism (FSDP, @zhao2023pytorch). Out of the box, the `\name{}`{=latex} codebase has full support for training on the Open X dataset, integrates with HuggingFace's [@wolf-etal-2020-transformers] `AutoModel` class, and supports LoRA fine-tuning [@hu2021lora] and quantized model inference [@dettmers2022gpt3; @dettmers2024qlora].

Experiments {#sec:experiments}
===========

The goal of our experimental evaluations is to test `\name{}`{=latex}'s ability to serve as a powerful multi-robot control policy out of the box, as well as be a good initialization for fine-tuning to new robot tasks. Concretely, we aim to answer the following questions:

1.  How does `\name{}`{=latex} compare to prior generalist robot policies, when evaluating on multiple robots and various types of generalization?

2.  Can `\name{}`{=latex} be effectively fine-tuned on a new robot setup and task, and how does it compare to state-of-the-art data-efficient imitation learning approaches?

3.  Can we use parameter-efficient fine-tuning and quantization to reduce the computational requirements for training and inference of `\name{}`{=latex} models and make them more accessible? What are the performance-compute trade-offs?

Direct Evaluations on Multiple Robot Platforms {#sec:zero_shot_exp}
----------------------------------------------

```{=latex}
\centering
```
![**BridgeData V2 WidowX robot evaluation tasks and results.** We evaluate `\name{}`{=latex} and prior state-of-the-art generalist robot policies on a comprehensive suite of tasks covering several axes of generalization, as well as tasks that specifically assess language conditioning ability. `\name{}`{=latex} achieves highest overall performance and even outperforms closed-source model RT-2-X in all categories except for semantic generalization. Average success rates $\pm$ StdErr are computed across 170 total rollouts per approach. See `\cref{app:tab:bridge_results_detailed}`{=latex} for detailed results.](figures/bridge_results.png){#fig:bridge_results width="\\linewidth"}

**Robot Setups and Tasks.** We evaluate `\name{}`{=latex}'s performance \`\`out-of-the-box" on two robot embodiments: the WidowX robot from the BridgeData V2 evaluations [@walke2023bridgedata] (see `\cref{fig:teaser}`{=latex}, left) and the mobile manipulation robot from the RT-1 and RT-2 evaluations [@rt12022arxiv; @rt22023arxiv] (\`\`Google robot"; see `\cref{fig:teaser}`{=latex}, middle). Both platforms have been extensively used in prior works for evaluating generalist robot policies [@rt12022arxiv; @rt22023arxiv; @open_x_embodiment_rt_x_2023; @octo_2023]. We define a comprehensive set of evaluation tasks in each environment that covers various axes of generalization, such as **visual** (unseen backgrounds, distractor objects, colors/appearances of objects); **motion** (unseen object positions/orientations); **physical** (unseen object sizes/shapes); and **semantic** (unseen target objects, instructions, and concepts from the Internet) generalization. We also assess language conditioning ability in scenes with multiple objects, testing whether the policy can manipulate the correct target object, as specified in the user's prompt. See bottom row of `\cref{fig:bridge_results}`{=latex} and `\cref{fig:rt1_results}`{=latex} for example task images in the BridgeData V2 and Google robot evaluations, respectively. Overall, we evaluated each method in 170 rollouts (17 tasks with 10 trials each) for BridgeData V2 experiments and 60 rollouts (12 tasks with 5 trials each) for Google robot experiments. A detailed breakdown of all tasks and how they differ from the training data is in `\cref{sec:app:detailed_setups}`{=latex}. All evaluations in this and the following sections are conducted as A/B evaluations, using the same tasks with the same sets of initial robot and object states, to ensure fair comparison.

**Comparisons.** We compare `\name{}`{=latex}'s performance to three prior generalist manipulation policies: **RT-1-X** [@open_x_embodiment_rt_x_2023], **RT-2-X** [@open_x_embodiment_rt_x_2023], and **Octo** [@octo_2023]. **RT-1-X** (35M parameters) and **Octo** (93M parameters) are transformer policies trained from scratch on subsets of the OpenX dataset; Octo is the state-of-the-art model among open-source manipulation policies. **RT-2-X** (55B parameters) is a state-of-the-art, closed-source VLA that leverages Internet-pretrained vision and language backbones.

The results are summarized in `\cref{fig:bridge_results}`{=latex} for BridgeData V2 evaluations and `\cref{fig:rt1_results}`{=latex} for Google robot evaluations (per-task breakdown in Appendix, `\cref{app:tab:bridge_results_detailed}`{=latex} and `\cref{app:tab:rt1_robot_results_detailed}`{=latex}). We find that both RT-1-X and Octo struggle on the tested tasks, often failing to manipulate the correct object, especially when distractors are present, and in some cases causing the robot to wave its arm around aimlessly. Note that our evaluations test even larger degrees of generalization than the evaluations performed in those prior works to challenge the Internet-pretrained VLA models. Thus, lower performance of models without Internet pretraining is expected. RT-2-X clearly outperforms both RT-1-X and Octo, demonstrating the benefits of large, pretrained VLMs for robotics.

```{=latex}
\begin{wrapfigure}{r}{0.5\textwidth}
    \centering
    \vspace{-0.2cm}
    \includegraphics[width=\linewidth]{figures/rt1_results.pdf}
    \caption{\textbf{Google robot evaluation results.} We evaluate generalist robot policies on in-distribution and out-of-distribution (OOD) tasks on the mobile manipulator used in RT-1 and RT-2 evaluations~\citep{rt12022arxiv,rt22023arxiv}. We find that \name{} and RT-2-X attain comparable performance and significantly outperform RT-1-X and Octo overall. Average success rates $\pm$ StdErr are computed across 60 total rollouts per approach. See \cref{app:tab:rt1_robot_results_detailed} for detailed results.}
    \label{fig:rt1_results}
    \vspace{-0.2cm}
\end{wrapfigure}
```
Notably, `\name{}`{=latex} performs comparably to RT-2-X on Google robot evaluations and significantly outperforms RT-2-X on BridgeData V2 evaluations despite being an order of magnitude smaller (7B vs. 55B parameters). Qualitatively, we find that both RT-2-X and OpenVLA exhibit markedly more robust behaviors than the other tested models, such as approaching the correct object when distractor objects are present, properly orienting the robot's end-effector to align with the orientation of the target object, and even recovering from mistakes such as insecurely grasping objects (see `\website{}`{=latex} for qualitative rollout examples). RT-2-X achieves higher performance in semantic generalization tasks, as shown in `\cref{fig:bridge_results}`{=latex}, which is expected given that it uses larger-scale Internet pretraining data and is co-fine-tuned with both robot action data and Internet pretraining data to better preserve the pretraining knowledge, rather than being fine-tuned solely on robot data, like `\name{}`{=latex}. However, `\name{}`{=latex} performs comparably or better in all other task categories in both BridgeData V2 and Google robot evaluations. The performance difference can be attributed to a combination of factors: we curated a much larger training dataset for `\name{}`{=latex} with `\nepisodes{}`{=latex} trajectories (vs. 350k for RT-2-X); we performed more careful cleaning of the training dataset and, `\eg `{=latex}filtered out all-zero actions in the Bridge dataset (see `\cref{sec:app:rt2x_vs_openvla_in_bridge}`{=latex} for a detailed discussion); and `\name{}`{=latex} uses a fused vision encoder that combines pretrained semantic *and* spatial features. See `\cref{sec:app:additional_ablation_experiments}`{=latex} for ablation analyses of these components.

Data-Efficient Adaptation to New Robot Setups {#sec:finetuning_exp}
---------------------------------------------

While prior works mainly focused on directly evaluating VLAs \`\`out-of-the-box" [@driess2023palm; @rt22023arxiv; @open_x_embodiment_rt_x_2023], effective *fine-tuning* of VLA models to new tasks and robot setups is largely unexplored, yet is key for their widespread adoption. In this section, we investigate `\name{}`{=latex}'s ability to be quickly adapted to a new *real-world* robot setup. (See `\cref{sec:app:libero_sim_experiments}`{=latex} for fine-tuning experiments in simulation.)

**Robot setups and tasks.** We test a simple fine-tuning recipe for the `\name{}`{=latex} model: full fine-tuning of all model parameters, using small datasets with 10--150 demonstrations of a target task (see `\cref{fig:finetune_results}`{=latex}; we explore parameter-efficient fine-tuning approaches in `\cref{sec:param_efficient_finetuning}`{=latex}). We test `\name{}`{=latex} in two setups: **Franka-Tabletop**, a stationary, table-mounted Franka Emika Panda 7-DoF robot arm; and **Franka-DROID**, the Franka robot arm setup from the recently released DROID dataset [@khazatsky2024droid], mounted on a movable standing desk. The setups use 5Hz and 15 Hz non-blocking controllers, respectively. We choose Franka robot arms as the target embodiment for our fine-tuning experiments since they are widely used in the robot learning community and thus a likely \`\`target" of `\name{}`{=latex} fine-tuning. We test on setups with different control frequencies to test `\name{}`{=latex}'s applicability to a range of use cases.

```{=latex}
\centering
```
![**Adapting to new robot setups.** We evaluate the state-of-the-art Diffusion Policy trained from scratch on seven Franka Emika Panda tasks (10--150 demonstrations each), as well as generalist robot policies Octo and `\name{}`{=latex} fine-tuned on the same data. Diffusion Policy exhibits strong performance on narrow single-instruction tasks, while Octo and `\name{}`{=latex} perform better on diverse fine-tuning tasks involving multiple instructions and distractor objects. Overall, `\name{}`{=latex} achieves highest aggregate performance across both setups, suggesting that it is an effective default for learning a policy on a downstream task. Average success rates $\pm$ StdErr are computed across 129 rollouts per approach (99 for Franka-Tabletop tasks and 30 for Franka-DROID tasks). See `\cref{app:tab:detailed_finetune_results}`{=latex} for detailed results.](figures/finetune_results.png){#fig:finetune_results width="\\linewidth"}

```{=latex}
\vspace{-0.3cm}
```
**Comparisons.** We compare to **Diffusion Policy** [@chi2023diffusionpolicy], a state-of-the-art data-efficient imitation learning approach, trained from scratch. We also compare to **Diffusion Policy (matched)**, a version of Diffusion Policy that matches the input and output specifications of `\name{}`{=latex}.[^3] Additionally, we evaluate **Octo** [@octo_2023] fine-tuned on the target dataset, since it is currently the best generalist policy that supports fine-tuning (fine-tuning of RT-2-X is not supported through its inference API). We also fine-tune `\name{}`{=latex} on the same target dataset, and the resulting policy is denoted by **OpenVLA**. Finally, as an ablation experiment, we compare to **`\name{}`{=latex} (scratch)**, where we directly fine-tune the underlying base Prismatic VLM on the target robot setup -- rather than fine-tuning the OpenX-pretrained `\name{}`{=latex} model -- to assess the benefit of large-scale robot pretraining.

We present the results in `\cref{fig:finetune_results}`{=latex} (per-task breakdown in Appendix, `\cref{app:tab:detailed_finetune_results}`{=latex}). We find that both versions of Diffusion Policy are competitive with or outperform the generalist policies Octo and `\name{}`{=latex} on narrower single-instruction tasks like \`\`Put Carrot in Bowl" and \`\`Pour Corn into Pot", but the pretrained generalist policies perform better in more diverse fine-tuning tasks that involve multiple objects in the scene and require language conditioning. OpenX pretraining for Octo and `\name{}`{=latex} enables the models to better adapt to these more diverse tasks where language grounding is important; we see evidence for this in the lower performance of `\name{}`{=latex} (scratch).

Overall, we find that `\name{}`{=latex} achieves the highest average performance. Notably, most prior works achieve strong performance only in *either* narrow single-instruction *or* diverse multi-instruction tasks, resulting in widely varying success rates. `\name{}`{=latex} is the only approach that achieves at least 50% success rate across all tested tasks, suggesting that it can be a strong default option for imitation learning tasks, particularly if they involve a diverse set of language instructions. For narrower but highly dexterous tasks, Diffusion Policy still shows smoother and more precise trajectories; incorporating action chunking and temporal smoothing, as implemented in Diffusion Policy, may help `\name{}`{=latex} attain the same level of dexterity and may be a promising direction for future work (see `\cref{sec:conclusion}`{=latex} for a detailed discussion of current limitations).

Parameter-Efficient Fine-Tuning {#sec:param_efficient_finetuning}
-------------------------------

The full fine-tuning runs of `\name{}`{=latex} in the previous section used 8 A100 GPUs for 5-15 hours per task (depending on the dataset size) to achieve high performance. While this is substantially less compute than what is required for VLA pretraining, in this section we explore even more compute- and parameter-efficient fine-tuning approaches and investigate their effectiveness.

```{=latex}
\begin{wraptable}{r}{0.6\textwidth}
\centering
\vspace{-0.3cm}
\caption{\textbf{Parameter-efficient fine-tuning evaluation.} LoRA fine-tuning achieves the best performance-compute trade-off, matching full fine-tuning performance while training only 1.4\% of the model parameters. Mean success $\pm$ StdErr computed across 33 rollouts per approach on select Franka-Tabletop tasks (see \cref{app:tab:detailed_peft_results} for details).\newline$^\ast$: Sharded across 2~GPUs with FSDP~\citep{zhao2023pytorch}.}
\label{tab:partial_finetune_results}
\resizebox{0.6\textwidth}{!}{\begin{tabular}{lccc}
\toprule
Strategy & Success Rate & Train Params ($\times 10^6$) & VRAM (batch 16) \\
\midrule
Full FT & \textbf{69.7 $\pm$ 7.2 \%} & 7,188.1 & 163.3 GB* \\
\midrule
Last layer only & 30.3 $\pm$ 6.1 \% & 465.1 & 51.4 GB \\
Frozen vision & 47.0 $\pm$ 6.9 \% & 6,760.4 & 156.2 GB* \\
Sandwich & 62.1 $\pm$ 7.9 \% & 914.2 & 64.0 GB \\
LoRA, rank=32 & \textbf{68.2 $\pm$ 7.5\%} & \textbf{97.6} & \textbf{59.7 GB} \\
\hspace{1.05cm}rank=64 & \textbf{68.2 $\pm$ 7.8\%} & 195.2 & 60.5 GB \\
\bottomrule
\end{tabular}}
\end{wraptable}
```
Concretely, we compare the following fine-tuning approaches: **full fine-tuning** updates all weights during fine-tuning, as described in `\cref{sec:finetuning_exp}`{=latex}; **last layer only** fine-tunes only the last layer of `\name{}`{=latex}'s transformer backbone and the token embedding matrix; **frozen vision** freezes the vision encoder but fine-tunes all other weights; **sandwich fine-tuning** unfreezes the vision encoder, token embedding matrix, and last layer; and **LoRA** uses the popular low-rank adaptation technique of @hu2021lora with multiple rank values $r$, applied to all linear layers of the model.

We report fine-tuning success rates across multiple Franka-Tabletop tasks, as well as training parameter count and GPU memory requirements, in `\cref{tab:partial_finetune_results}`{=latex}.[^4] We find that only fine-tuning the network's last layer or freezing the vision encoder leads to poor performance, suggesting that further adaptation of the visual features to the target scene is crucial. In contrast, \`\`sandwich fine-tuning" achieves better performance since it fine-tunes the vision encoder, and it consumes less GPU memory since it does not fine-tune the full LLM backbone. Lastly, LoRA achieves the best trade-off between performance and training memory consumption, outperforming \`\`sandwich fine-tuning" and matching full fine-tuning performance while fine-tuning only 1.4% of the parameters. We find that the LoRA rank has negligible effect on policy performance and thus recommend using a default rank of $r=32$. With LoRA, we can fine-tune `\name{}`{=latex} on a new task within 10-15 hours on a *single* A100 GPU -- an 8x reduction in compute compared to full fine-tuning.

Memory-Efficient Inference via Quantization {#sec:quantization_exp}
-------------------------------------------

```{=latex}
\centering
```
![image](figures/inference_speed.png){width="\\linewidth"} `\captionof{figure}{
        \textbf{\name{} inference speed for various GPUs.} Both bfloat16 and int4 quantization achieve high throughput, especially on GPUs with Ada Lovelace architecture (RTX 4090, H100). Further speed-ups are possible with modern LLM inference frameworks like TensorRT-LLM~\citep{tensorrt_llm}. $\spadesuit$: Model sharded across two GPUs to fit.
    }`{=latex} `\label{fig:inference_speed}`{=latex}

```{=latex}
\hfill
```
```{=latex}
\centering
```
```{=latex}
\resizebox{\textwidth}{!}{\begin{tabular}{lcc}
        \toprule
        Precision & Bridge Success & VRAM \\
        \midrule
        bfloat16 & 71.3 $\pm$ 4.8\% & 16.8 GB \\
        int8 & 58.1 $\pm$ 5.1\% & 10.2 GB \\
        int4 & 71.9 $\pm$ 4.7\% & 7.0 GB \\
        \bottomrule
    \end{tabular}}
```
```{=latex}
\captionof{table}{\textbf{Performance with quantized inference.} 4-bit quantization matches the performance of bfloat16 inference (our default approach) while reducing the GPU memory footprint by more than half. Mean success $\pm$ StdErr computed across 8~representative BridgeData~V2 tasks~\citep{walke2023bridgedata} and 80~rollouts per approach (see \cref{app:tab:detailed_quantized_inference_results} for details).}
```
`\label{tab:quantized_results}`{=latex}

`\name{}`{=latex}, a 7B-parameter model, consumes more memory at inference time than prior open-source generalist policies such as Octo, which has $<$100M parameters. We follow best-practices from LLM serving by saving and loading `\name{}`{=latex} in `bfloat16` precision for inference (our default approach), which cuts the memory footprint in half, allowing us to serve `\name{}`{=latex} on GPUs with only 16GB of GPU memory. In this section, we test whether we can further reduce the required memory for policy inference and broaden accessibility of VLA policies, by using modern quantization techniques developed for serving LLMs [@dettmers2022gpt3; @dettmers2024qlora]. These approaches load the weights of the network at lower precision, thereby trading off reduced memory requirements for potentially reduced inference speed and accuracy.

Concretely, we investigate serving the `\name{}`{=latex} model with 8-bit and 4-bit precision on 8 representative BridgeData V2 tasks. We report memory footprint and rollout performance in `\cref{tab:quantized_results}`{=latex}. We also report achievable control frequencies on various consumer- and server-grade GPUs in `\cref{fig:inference_speed}`{=latex}. We observe that 8-bit quantization slows down inference across most GPUs, due to the overhead of the added quantization operations. 4-bit inference achieves higher throughput, since reduced GPU memory transfer compensates for the quantization overhead.

As a result of the reduced inference speed, we observe a substantial performance decrease with 8-bit quantization: on the A5000 GPU we use for our evaluations, we can only run the model at 1.2Hz, which significantly changes the system dynamics compared to the training dataset for the 5Hz non-blocking controller used in the BridgeData V2 tasks.[^5] Notably, 4-bit quantization results in similar performance as bfloat16 half-precision inference despite requiring less than half the amount of GPU memory. 4-bit quantized models can run at 3Hz on the A5000, thus more closely matching the system dynamics during data collection.

Discussion and Limitations {#sec:conclusion}
==========================

In this work, we presented `\name{}`{=latex}, a state-of-the-art, open-source vision-language-action model that obtains strong performance for cross-embodiment robot control out-of-the-box. We also demonstrated that `\name{}`{=latex} can be easily adapted to new robot setups via parameter-efficient fine-tuning techniques.

The current `\name{}`{=latex} model has several limitations. First, it currently only supports single-image observations. In reality, real-world robot setups are heterogeneous, with a wide range of possible sensory inputs [@octo_2023]. Expanding `\name{}`{=latex} to support multiple image and proprioceptive inputs as well as observation history is an important avenue for future work. Exploring the use of VLMs pretrained on *interleaved* image and text data may facilitate such flexible-input VLA fine-tuning.

Secondly, improving the inference throughput of `\name{}`{=latex} is critical to enable VLA control for high-frequency control setups such as ALOHA [@zhao2023learning], which runs at 50Hz. This will also enable testing VLAs on more dexterous, bi-manual manipulation tasks than what we investigated in this work. Exploring the use of action chunking or alternative inference-time optimization techniques such as speculative decoding [@leviathan2023fast] offer potential remedies.

Additionally, there is room for further performance improvements. While `\name{}`{=latex} outperforms prior generalist policies, it does not yet offer very high reliability on the tested tasks, typically achieving \<90% success rate.

Finally, due to compute limitations, many VLA design questions remain underexplored: What effect does the size of the base VLM have on VLA performance? Does co-training on robot action prediction data and Internet-scale vision-language data substantially improve VLA performance? What visual features are best-suited for VLA models? We hope that the release of the `\name{}`{=latex} model and codebase will enable the community to jointly investigate these questions.

```{=latex}
\acknowledgments{
We are grateful to the Toyota Research Institute for providing significant funding and compute resources required to carry out this research. We also thank the Stanford Center for Research on Foundation Models for providing additional compute resources and Google DeepMind for alpha access to the RT-2-X API for our evaluations. We acknowledge additional support from Volkswagen, Physical Intelligence, ONR grants N00014-22-1-2621 and N00014-22-1-2293, the National Science Foundation through IIS-2246811, and DARPA ANSR.
}
```
```{=latex}
\clearpage
```
```{=latex}
\appendix
```
Data Mixture Details {#sec:app:data_mix}
====================

We list our used data mixture in `\cref{tab:data_mix}`{=latex}. The mixture mostly follows [@octo_2023], with a few additional datasets.

```{=latex}
\centering
```
::: {#tab:data_mix}
  `\multicolumn{2}{c}{\textbf{\name{} Training Dataset Mixture}}`{=latex}   
  ------------------------------------------------------------------------- -----------
  Fractal [@brohan2022rt]                                                         12.7%
  Kuka [@kalashnikov2018qt]                                                       12.7%
  Bridge[@ebert2021bridge; @walke2023bridgedata]                                  13.3%
  Taco Play [@rosetebeas2022latent; @mees2023grounding]                            3.0%
  Jaco Play [@dass2023jacoplay]                                                    0.4%
  Berkeley Cable Routing [@luo2023multistage]                                      0.2%
  Roboturk [@DBLP:journals/corr/abs-1811-02790]                                    2.3%
  Viola [@zhu2023viola]                                                            0.9%
  Berkeley Autolab UR5 [@BerkeleyUR5Website]                                       1.2%
  Toto [@zhou2023train]                                                            2.0%
  Language Table [@lynch2023interactive]                                           4.4%
  Stanford Hydra Dataset [@belkhale2023hydra]                                      4.4%
  Austin Buds Dataset [@zhu2022bottom]                                             0.2%
  NYU Franka Play Dataset [@cui2022play]                                           0.8%
  Furniture Bench Dataset [@heo2023furniturebench]                                 2.4%
  UCSD Kitchen Dataset [@ucsd_kitchens]                                          \<0.1%
  Austin Sailor Dataset [@nasiriany2022sailor]                                     2.2%
  Austin Sirius Dataset [@liu2022robot]                                            1.7%
  DLR EDAN Shared Control [@quere_shared_2020]                                   \<0.1%
  IAMLab CMU Pickup Insert [@saxena2023multiresolution]                            0.9%
  UTAustin Mutex [@shah2023mutex]                                                  2.2%
  Berkeley Fanuc Manipulation [@fanuc_manipulation2023]                            0.7%
  CMU Stretch [@mendonca2023structured]                                            0.2%
  BC-Z [@jang2022bc]                                                               7.5%
  FMB Dataset [@luo2024fmb]                                                        7.1%
  DobbE [@shafiullah2023dobbe]                                                     1.4%
  DROID [@khazatsky2024droid]                                                 10.0%[^6]

  : `\name{}`{=latex} training data mixture using datasets from the Open X-Embodiment dataset [@open_x_embodiment_rt_x_2023], following [@octo_2023] with a few additions.
:::

Evaluation Tasks and Detailed Results {#sec:app:detailed_setups}
=====================================

In this section, we provide more details on the BridgeData V2 WidowX and Google robot evaluations discussed in `\cref{sec:zero_shot_exp}`{=latex}, as well as the Franka-Tabletop and Franka-DROID fine-tuning evaluations discussed in `\cref{sec:finetuning_exp}`{=latex}.

BridgeData V2 WidowX Evaluation Details {#sec:app:bridge_evaluation}
---------------------------------------

Here we focus specifically on BridgeData V2 evaluations discussed in `\cref{sec:zero_shot_exp}`{=latex}.

### BridgeData V2 Evaluation Tasks {#sec:app:bridge_evaluation_tasks}

As described in `\cref{sec:zero_shot_exp}`{=latex}, we evaluate each generalist robot manipulation policy on 17 tasks with 10 trials each. In this section, we provide details on the task categories and individual tasks.

```{=latex}
\centering
```
![**BridgeData V2 WidowX robot evaluation tasks.** We evaluate every generalist robot policy on 4 types out-of-distribution (OOD) generalization tasks: **visual**, **motion**, **physical**, and **semantic** (as defined in `\cref{sec:zero_shot_exp}`{=latex}). Every pair of images shows the start state and an example end state after the robot completes the task. We also rigorously assess **language grounding** in the 3 tasks shown in the bottom 3 rows, by changing the prompt while fixing the initial state and testing whether the policy can approach the correct target object.](figures/bridge_tasks_vertical.png){#fig:app:bridge_tasks width="0.65\\linewidth"}

In total, we evaluate on 5 visual generalization tasks, 2 motion generalization tasks, 3 physical generalization tasks, 4 semantic generalization tasks, and 3 language grounding tasks. Note that all tasks we evaluate on introduce some form of distribution shift since we are unable to procure the exact objects used in the original dataset (other distribution shifts naturally arise as we reproduce a real-world test environment originally constructed at a different location; see `\cref{sec:app:compare_to_orig_bridge}`{=latex} for a detailed discussion on such distribution shifts). All 17 tasks are depicted in `\cref{fig:app:bridge_tasks}`{=latex}. Each rollout is marked as a failure (0) or success (1). In some more difficult tasks, we record partial successes (0.5); we describe the conditions for partial credit in the task descriptions below.

Below we describe each of the 17 tasks, in the order shown in `\cref{fig:app:bridge_tasks}`{=latex}:

1.  **Put Eggplant into Pot (Easy Version)**: The robot's goal is to pick up the eggplant and drop it into the pot. This is a **visual generalization** task because we use a handcrafted paper pot that has a different appearance than the pot used in the original BridgeData V2 training dataset (since we are unable to procure the original pot). Unlike all 16 other tasks, for this particular task we initialize the robot's end-effector directly above the eggplant before rolling out the policy; hence, we call this the \`\`Easy Version" of the \`\`Put Eggplant into Pot" task.

2.  **Put Eggplant into Pot**: This is the same task as described above, except that the robot's end-effector is not initialized directly above the eggplant. Instead, we initialize it in a position that is fixed across all rollouts, which means that the robot must horizontally reach for the eggplant first before manipulating it. (Note: The same applies to all other tasks described below.) This is a **visual generalization** task for the same reason as above.

3.  **Put Cup from Counter into Sink**: The robot's goal is to pick up the pink cup from either the kitchen countertop or drying rack and place it into the sink on the right. This is a **visual generalization** task because we use a pink cup rather than a blue cup (a blue cup is used in the original BridgeData V2 dataset, but we find that none of the methods we evaluate is able to manipulate it reliably -- most likely because the color of the cup blends in with the color of the sink).

4.  **Put Eggplant into Pot (w/ Clutter)**: This is the same task as the \`\`Put Eggplant into Pot" task, except that it is more difficult due to the presence of several distractor objects. It is a **visual generalization** task for the same reason discussed in the normal \`\`Put Eggplant into Pot" task, and even more so given unseen distractors in the scene. *Partial credit (0.5 out of 1) is rewarded when the robot moves towards the correct target object.*

5.  **Put Yellow Corn on Pink Plate**: The robot's goal is to pick up the yellow corn and place it on the pink plate. This is a **visual generalization** task due to the presence of unseen distractor objects in the scene, such as a green dinosaur on the countertop in the back section of the sink. *Partial credit (0.5 out of 1) is rewarded when the robot moves towards the correct target object.*

6.  **Lift Eggplant**: The robot's goal is to grasp and lift the eggplant into the air. This is a **motion generalization** task because the eggplant is initialized in unseen positions and/or orientations, and the robot is forced to move beyond its training distribution of positions and/or orientations and often perform long-range reaching in order to complete the task. (Note: Long-range reaching is not demonstrated in this environment in the original BridgeData V2 demonstrations; see `\cref{sec:app:compare_to_orig_bridge}`{=latex} for details.) We find that this task, though seemingly simple, is deceptively challenging for many policies. *Partial credit (0.5 out of 1) is rewarded when the robot makes contact with the eggplant.*

7.  **Put Carrot on Plate (w/ Height Change)**: The robot's goal is to pick up the carrot and place it on the yellow plate. This is a **motion generalization** task because the plate is elevated from its usual position at the bottom of the sink, and the robot must adjust its trajectory to correctly place the carrot on the elevated platform (without knocking down the plate in the process). *Partial credit (0.5 out of 1) is rewarded when the robot grasps the carrot and touches the plate with it.*

8.  **Put Carrot on Plate**: This is the same task as above, except that the plate is at its normal position (at the bottom of the sink or drying rack). We consider this a **physical generalization** task because the carrot has a different size and shape than the one used in the original BridgeData V2 dataset, which is shorter and narrower. (Note that the previous version of this task listed above would also technically be a physical generalization task since it involves the same carrot, but we list it under the \`\`motion generalization" category since that is the focus there.)

9.  **Flip Pot Upright**: The robot's goal is to manipulate the pot such that it is oriented upright in the sink at the end of the episode. This is a **physical generalization** task because this pot has a different size and shape than the one used in the original BridgeData V2 training demonstrations (the pot we use is wider and shorter).

10. **Lift AAA Battery**: The robot's goal is simply to grasp the AAA battery and lift it up into the air. This is considered a **physical generalization** task because the battery is much smaller and thinner than target objects seen in the BridgeData V2 training demonstrations in this environment; see `\cref{sec:app:compare_to_orig_bridge}`{=latex} for details. (Note that this target object does not exist in the original BridgeData V2 demonstrations in this environment, so this is also an instance of \`\`semantic generalization", but we classify it solely as \`\`physical generalization" since that is the main focus here).

11. **Move Skull into Drying Rack**: The robot's goal is to grasp the skull windup toy and drop it into the yellow drying rack in the left part of the sink. This is a **semantic generalization** task since the skull is an unseen target object (does not appear in the BridgeData V2 training demonstrations).

12. **Lift White Tape**: The robot's goal is to grasp and lift the white roll of tape into the air. This is a **semantic generalization** task since the white tape roll is an unseen target object (does not appear in the BridgeData V2 training demonstrations). (Note that this task may also be considered as \`\`physical generalization" because of its shape being different than the objects seen in the training demonstrations in this environment; most policies struggle to grasp objects with this ring structure, and they often move the robot's end-effector directly into the center region.)

13. **Take Purple Grapes out of Pot**: The robot's goal is to grasp the purple grapes lying inside the steel pot and remove it from the pot (by lifting it out and/or dropping it anywhere outside the pot). This is a **semantic generalization** task because it is an unseen language instruction; the robot has never seen this task in the original BridgeData V2 training dataset.

14. **Stack Blue Cup on Pink Cup**: The robot's goal is to grasp the blue cup and place it securely on top of the pink cup. This is a **semantic generalization** task because it is an unseen language instruction; the robot has never seen this task in this environment in the original BridgeData V2 training dataset. *Partial credit (0.5 out of 1) is rewarded when the robot grasps the blue cup and touches the pink cup with the blue cup.*

15. **Put {Eggplant, Red Bottle} into Pot**: This is a **language grounding** task. The robot's goal is to put the specified target object into the pot. Both the eggplant and red bottle are present in the scene. We conduct paired evaluations: for the same initial state, we prompt the policy to target the eggplant in one episode, and then the red bottle in the next episode. We test each method 5 times with the eggplant and 5 times with the red bottle, using the same set of 5 initial states for both target objects. *Partial credit (0.5 out of 1) is rewarded when the robot moves towards the correct target object.*

16. **Lift {Cheese, Red Chili Pepper}**: This is a **language grounding** task. The robot's goal is to grasp and lift the specified target object. We conduct paired evaluations as described in the task above. *Partial credit (0.5 out of 1) is rewarded when the robot moves towards the correct target object.*

17. **Put {Blue Cup, Pink Cup} on Plate**: This is a **language grounding** task. The robot's goal is to grasp the specified target object and place it onto the plate. We conduct paired evaluations as described in other language grounding tasks. *Partial credit (0.5 out of 1) is rewarded when the robot moves towards the correct target object.*

### Comparing Evaluation Tasks to Original BridgeData V2 Training Data {#sec:app:compare_to_orig_bridge}

We conduct our evaluations in a sink environment used in the original BridgeData V2 dataset [@walke2023bridgedata]. We reproduce the environment to match the original environment in the BridgeData V2 dataset with rough approximations for the robot's location relative to the sink, as well as the camera's placement relative to the scene. Given the lack of precise measurements of these positions in the original dataset, we are unable to reproduce the *exact* environment setup, and natural distribution shifts arise due to slightly different robot, sink, and camera placements. In addition, since we evaluate robot policies in a different location than where the training demonstrations were collected from, other natural distribution shifts arise. For example, the lighting conditions and background (`\eg `{=latex}visible areas behind the sink) are inevitably different than what was seen in the training dataset. Furthermore, we are unable to procure the exact set of objects used in the original BridgeData V2 dataset, so there are distribution shifts between the objects used at train time and those used at test time.

Despite all these challenges, we find that certain generalist policies, such as `\name{}`{=latex} and RT-2-X, can still generalize and perform various tasks fairly reliably \`\`out-of-the-box". Other generalist policies, such as RT-1-X and Octo, can also complete some tasks, though they struggle when tested with more difficult generalization tasks in our BridgeData V2 evaluation suite.

The original BridgeData V2 dataset includes demonstrations of the following seven tasks in this specific sink environment: \`\`Flip Pot Upright", \`\`Put Carrot on Plate", \`\`Put Cup from Counter (or Drying Rack) into Sink", \`\`Put Eggplant into Pot", \`\`Put Knife on Cutting Board", \`\`Put Spoon in Pot", and \`\`Turn Lever Vertical to Front". See `\cref{fig:app:original_bridge_tasks}`{=latex} for samples images of all these tasks from the original dataset. Note that all training demonstrations collected in this environment are initialized such that the robot's end-effector is positioned directly above the target object in the beginning of the episode. (However, this is not the case across all environments in the BridgeData V2 dataset; in some other environments, the robot is initialized farther away from the target object, so it must horizontally reach for the object first before manipulating it.)

```{=latex}
\centering
```
![**Original BridgeData V2 sink environment tasks.** Images from sample demonstrations in the sink environment from the original BridgeData V2 dataset reveal that all demonstrations in this environment were initialized such that the robot's end-effector was positioned immediately above the target object. Note that these initial states are different from the initial states we use in our BridgeData V2 evaluation tasks shown in `\cref{fig:app:bridge_tasks}`{=latex}. In our evaluations, we always initialize the robot's end-effector to a fixed location above the sink, rather than positioning it directly above the target object (except for one task: \`\`Put Eggplant into Pot (Easy Version)").](figures/original_bridge_tasks.png){#fig:app:original_bridge_tasks width="\\linewidth"}

In our BridgeData V2 evaluation suite, only one task -- \`\`Put Eggplant into Pot (Easy Version") -- is initialized with the robot's end-effector hovering directly over the target object; in all 16 other tasks, the end-effector is initialized at a fixed location above the sink such that the robot must horizontally reach towards the object. This initial condition, in combination with the distribution shifts we introduce in the various types of OOD generalization in our evaluation suite, challenges the generalist policies and requires a high degree of robustness in order to complete the tasks successfully. Hence, the success rates for policies like RT-1-X and Octo are lower than what is reported in prior works. However, we find that other policies such as RT-2-X and `\name{}`{=latex} still achieve relatively strong performance despite all these distribution shifts and challenges.

### Detailed BridgeData V2 Evaluation Results {#sec:app:detailed_bridge_eval_results}

See `\cref{app:tab:bridge_results_detailed}`{=latex} for the full BridgeData V2 WidowX evaluation results. The number of successes for each method, out of 10 trials, is listed for each of 17 tasks. `\name{}`{=latex} achieves strongest performance in the majority of the tasks and has the highest aggregate success rate among the generalist policies. RT-2-X also shows good performance, outperforming RT-1-X and Octo, though it does not perform as well as `\name{}`{=latex}. RT-1-X and Octo generally experience difficulty in these generalization tasks.

```{=latex}
\centering
```
```{=latex}
\resizebox{\textwidth}{!}{%
\begin{tabular}{llccccc}
\toprule
Category & Task & \# Trials & \makecell{RT-1-X\\\# Successes} & \makecell{Octo\\\# Successes} & \makecell{RT-2-X\\\# Successes} & \makecell{\name{}~(ours)\\\# Successes} \\
\midrule
Visual gen & Put Eggplant into Pot (Easy Version) & 10 & 1 & 5 & 7 & \textbf{10} \\
Visual gen & Put Eggplant into Pot & 10 & 0 & 1 & 5 & \textbf{10} \\
Visual gen & Put Cup from Counter into Sink & 10 & 1 & 1 & 0 & \textbf{7} \\
Visual gen & Put Eggplant into Pot (w/ Clutter) & 10 & 1 & 3.5 & 6 & \textbf{7.5} \\
Visual gen & Put Yellow Corn on Pink Plate & 10 & 1 & 4 & 8 & \textbf{9} \\
Motion gen & Lift Eggplant & 10 & 3 & 0.5 & 6.5 & \textbf{7.5} \\
Motion gen & Put Carrot on Plate (w/ Height Change) & 10 & 2 & 1 & \textbf{4.5} & \textbf{4.5} \\
Physical gen & Put Carrot on Plate & 10 & 1 & 0 & 1 & \textbf{8} \\
Physical gen & Flip Pot Upright & 10 & 2 & 6 & 5 & \textbf{8} \\
Physical gen & Lift AAA Battery & 10 & 0 & 0 & 2 & \textbf{7} \\
Semantic gen & Move Skull into Drying Rack & 10 & 1 & 0 & \textbf{5} & \textbf{5} \\
Semantic gen & Lift White Tape & 10 & \textbf{3} & 0 & 0 & 1 \\
Semantic gen & Take Purple Grapes out of Pot & 10 & \textbf{6} & 0 & 5 & 4 \\
Semantic gen & Stack Blue Cup on Pink Cup & 10 & 0.5 & 0 & \textbf{5.5} & 4.5 \\
Language grounding & Put \{Eggplant, Red Bottle\} into Pot & 10 & 2.5 & 4 & \textbf{8.5} & 7.5 \\
Language grounding & Lift \{Cheese, Red Chili Pepper\} & 10 & 1.5 & 2.5 & 8.5 & \textbf{10} \\
Language grounding & Put \{Blue Cup, Pink Cup\} on Plate & 10 & 5 & 5.5 & 8.5 & \textbf{9.5} \\
\midrule
&& Mean Success Rate & \cellcolor{lightlightgray}18.5$\pm$2.7\% & \cellcolor{lightlightgray}20.0$\pm$2.6\% & \cellcolor{lightlightgray}50.6$\pm$3.5\% & \cellcolor{lightlightgray}\textbf{70.6$\pm$3.2\%} \\
\bottomrule
\end{tabular}}
```
Additionally, in `\cref{app:tab:detailed_quantized_inference_results}`{=latex}, we provide the full evaluation results for the quantized inference experiments that were summarized in `\cref{tab:quantized_results}`{=latex}. For these evaluations, we test policies on 8 representative BridgeData V2 tasks spanning all task categories in the full evaluation suite.

```{=latex}
\centering
```
```{=latex}
\resizebox{\textwidth}{!}{%
\begin{tabular}{llcccc}
\toprule
Category & Task & \# Trials & \makecell{bfloat16\\\# Successes} & \makecell{int8\\\# Successes} & \makecell{int4\\\# Successes} \\
\midrule
Visual gen & Put Eggplant into Pot (Easy Version) & 10 & \textbf{9} & 7 & \textbf{9} \\
Visual gen & Put Eggplant into Pot & 10 & \textbf{7} & \textbf{7} & \textbf{7} \\
Visual gen & Put Cup from Counter into Sink & 10 & 5 & 3 & \textbf{7} \\
Motion gen & Lift Eggplant & 10 & 6 & 4 & \textbf{7.5} \\
Physical gen & Put Carrot on Plate & 10 & 6 & 5 & \textbf{7} \\
Physical gen & Lift AAA Battery & 10 & \textbf{7} & 5 & 3 \\
Semantic gen & Take Purple Grapes out of Pot & 10 & 8 & 8 & \textbf{9} \\
Language grounding & Put \{Eggplant, Red Bottle\} into Pot & 10 & \textbf{9} & 7.5 & 8 \\
\midrule
&& Mean Success Rate & \cellcolor{lightlightgray}\textbf{71.3 $\pm$ 4.8\%} & \cellcolor{lightlightgray}58.1 $\pm$ 5.1\% & \cellcolor{lightlightgray}\textbf{71.9 $\pm$ 4.7\%} \\
\bottomrule
\end{tabular}}
```
Google Robot Evaluation Details {#sec:app:rt1_robot_evaluation}
-------------------------------

In this section, we provide more details on the Google robot evaluations introduced in `\cref{sec:zero_shot_exp}`{=latex}.

### Google Robot Evaluation Tasks {#sec:app:rt1_robot_evaluation_tasks}

```{=latex}
\centering
```
![**Google robot evaluation tasks.** We evaluate every generalist robot policy on in-distribution tasks and out-of-distribution (OOD) generalization tasks. OOD tasks involve unseen backgrounds, target objects, instructions/object relations, and semantic concepts (e.g., photos from the Internet that do not appear in robot action data).](figures/rt1_robot_tasks.jpeg){#fig:app:rt1_robot_tasks width="\\linewidth"}

On the Google robot, we evaluate each generalist robot policy on 12 tasks with 5 rollouts each, for a total of 60 rollouts. The first five tasks test on in-distribution conditions, and the last seven tasks test on more difficult out-of-distribution (OOD) conditions. All tasks are depicted in `\cref{fig:app:rt1_robot_tasks}`{=latex}. Each rollout is marked as a failure (0) or success (1).

We describe the 12 tasks below:

1.  **Pick Coke Can** (in-distribution): The robot is positioned in front of a platform with a can of Coke on top of it. The robot's goal is to grasp and lift the Coke can.

2.  **Move Apple near Green Can** (in-distribution): The robot is positioned in front of a platform with an apple and a green soda can on top of it. The robot's goal is to grasp the apple and move it next to the green can.

3.  **Move Blue Chip Bag near Apple** (in-distribution): The robot is positioned in front of a platform with a blue bag of chips and an apple on top of it. The robot's goal is to grasp the blue bag of chips and move it close to the apple.

4.  **Place Coke Can Upright** (in-distribution): The robot is positioned in front of a platform with a can of Coke on top of it, and the can is oriented horizontally on its side. The robot's goal is to grasp the Coke can and orient it to be in a vertical position.

5.  **Open Middle Drawer** (in-distribution): The robot is positioned in front of a set of three drawers. The robot's goal is to grasp the middle drawer handle and pull the drawer open.

6.  **Move Orange near Brown Chip Bag** (OOD): The robot is positioned in front of a platform with a brown bag of chips and an orange on top of it. A tablecloth with blue sky and white cloud patterns covers the platform underneath the objects. The robot's goal is to grasp the orange and bring it next to the bag of chips. This task is OOD because the orange is an unseen object relative to the training dataset, and the tablecloth is an unseen background.[^7]

7.  **Pick Pepsi Can** (OOD): The robot is positioned in front of a platform with a can of Pepsi on top of it. A tablecloth with bright yellow/brown patterns covers the platform underneath the can. The robot's goal is to grasp and lift the can. This task is OOD because the Pepsi can is an unseen object, and the tablecloth is an unseen background.

8.  **Pick Banana** (OOD): The robot is positioned in front of a platform with an apple, a can of Coke, and a banana. The robot's goal is to grasp and lift the banana. This task is OOD because the banana is an unseen target object.

9.  **Pick Green Cup** (OOD): The robot is positioned in front of a platform with a banana, a can of Pepsi, and a green cup. The robot's goal is to grasp and lift the green cup. This task is OOD because all objects in the scene are unseen in the training data.

10. **Place Apple on Plate** (OOD): The robot is positioned in front of a platform with a plate and an apple. The robot's goal is to grasp the apple and move it onto the plate. This task is OOD because it is a novel instruction describing an unseen object relation: training demonstrations only cover moving the apple *near* the plate, rather than placing it *on top of* the plate.

11. **Place Banana in Pan** (OOD): The robot is positioned in front of a platform with a pan and a banana. The robot's goal is to grasp the banana and move it into the pan. This task is OOD because the banana is an unseen target object, and it is a novel instruction describing an unseen object relation, as explained in the previous task.

12. **Move Coke Can to Taylor Swift** (OOD): The robot is positioned in front of a platform with a can of Coke and photos of three different celebrities, including Taylor Swift. The robot's goal is to grasp the can and move it to the photo of Taylor Swift. This task is OOD because the photos of the celebrities are unseen in the robot interaction data.

### Detailed Google Robot Evaluation Results {#sec:app:detailed_rt1_robot_results}

```{=latex}
\centering
```
```{=latex}
\resizebox{\textwidth}{!}{%
\begin{tabular}{llccccc}
\toprule
Category & Task & \# Trials & \makecell{RT-1-X\\\# Successes} & \makecell{Octo\\\# Successes} & \makecell{RT-2-X\\\# Successes} & \makecell{\name{}~(ours)\\\# Successes} \\
\midrule
In-distribution & Pick Coke Can & 5 & \textbf{5} & 1 & \textbf{5} & \textbf{5} \\
In-distribution & Move Apple near Green Can & 5 & 3 & 3 & 3 & \textbf{5} \\
In-distribution & Move Blue Chip Bag near Apple & 5 & 0 & 3 & 4 & \textbf{5} \\
In-distribution & Place Coke Can Upright & 5 & 0 & 0 & \textbf{4} & \textbf{4} \\
In-distribution & Open Middle Drawer & 5 & 0 & \textbf{4} & 2 & 3 \\
OOD & Move Orange near Brown Chip Bag & 5 & 1 & 2 & \textbf{5} & \textbf{5} \\
OOD & Pick Pepsi Can & 5 & 3 & 0 & \textbf{5} & 4 \\
OOD & Pick Banana & 5 & \textbf{5} & 3 & \textbf{5} & \textbf{5} \\
OOD & Pick Green Cup & 5 & 1 & 0 & \textbf{5} & \textbf{5} \\
OOD & Place Apple on Plate & 5 & 0 & 0 & \textbf{4} & \textbf{4} \\
OOD & Place Banana in Pan & 5 & 0 & 0 & 2 & \textbf{4} \\
OOD & Move Coke Can near Taylor Swift & 5 & 2 & 0 & \textbf{3} & 2 \\
\midrule
&& Mean Success Rate & \cellcolor{lightlightgray}33.3$\pm$6.1\% & \cellcolor{lightlightgray}26.7$\pm$5.8\% & \cellcolor{lightlightgray}\textbf{78.3$\pm$5.4\%} & \cellcolor{lightlightgray}\textbf{85.0$\pm$4.6\%}\\
\bottomrule
\end{tabular}}
```
Full results for the Google robot evaluations are shown in `\cref{app:tab:rt1_robot_results_detailed}`{=latex}. Overall, we find that RT-1-X and Octo experience difficulty on the evaluation tasks; they are often unable to achieve a single success out of five trials in several tasks. On the other hand, RT-2-X and `\name{}`{=latex} demonstrate strong performance, completing every task at least two times out of five trials; these two VLA policies perform comparably with each other on this particular evaluation suite.

Data-Efficient Adaptation Experiment Details {#sec:app:franka_evaluation}
--------------------------------------------

In this section, we provide more details on the data-efficient adaptation experiments discussed in `\cref{sec:finetuning_exp}`{=latex}, where we investigate the effectiveness of fine-tuned `\name{}`{=latex} policies on new robot setups such as Franka-Tabletop and Franka-DROID.

### Franka-Tabletop and Franka-DROID Tasks {#sec:app:franka_evaluation_tasks}

We collect 10--150 demonstrations of each of seven tasks. The first six tasks correspond to a robot setup which we denote as \`\`Franka-Tabletop" (Franka Emika Panda robot mounted on top of a table), and the final task corresponds to a robot setup which we call \`\`Franka-DROID".

In the Franka-Tabletop setup, the first three of six tasks correspond to single-instruction tasks and are narrow, while the last three tasks correspond to multi-instruction tasks in which multiple objects are present in the scene and the robot must manipulate the correct one depending on the language instruction.

```{=latex}
\centering
```
![**Franka-Tabletop fine-tuning tasks.** Franka-Tabletop tasks used in the data-efficient adaptation experiments in `\cref{sec:finetuning_exp}`{=latex} and described in detail in `\cref{fig:app:franka_tabletop_tasks}`{=latex} are depicted above. The first three of six tasks, shown in the top three rows, only involve a single instruction, while the last three tasks in the bottom three rows involve multiple objects and instructions (the instructions specify the target object or target location). The first column shows sample initial states matching the training data distribution, while the second column shows out-of-distribution (OOD) initial states (`\eg `{=latex}unseen backgrounds, target objects, distractors, and object positions/orientations). Every policy in `\cref{sec:finetuning_exp}`{=latex} is evaluated with 10--12 rollouts on in-distribution tasks and 5--6 rollouts on OOD tasks.](figures/franka_tabletop_tasks.png){#fig:app:franka_tabletop_tasks width="0.8\\linewidth"}

Below we describe each of the six Franka-Tabletop tasks shown in `\cref{fig:app:franka_tabletop_tasks}`{=latex}:

1.  **Put Carrot in Bowl** (single-instruction): The robot's goal is to grasp the carrot and place it into the bowl. We collect 50 demonstrations of this task for the training dataset, randomly placing the carrot and the bowl at different locations on the table in every episode. The carrot is always initialized on the left side of the bowl. During evaluation, each trial is recorded as a success (1) or failure (0); there is no partial credit.

2.  **Pour Corn into Pot** (single-instruction): The robot's goal is to grasp the red bowl, move towards the steel pot, and pour the contents (a yellow corn) into the pot. We collect 50 demonstrations of this task for the training dataset, randomly placing the bowl and the pot at different locations on the table in every episode. The bowl is always initialized on the right side of the pot. During evaluation, each trial is recorded as a success (1) or failure (0); there is no partial credit.

3.  **Flip Pot Upright** (single-instruction): The robot's goal is to grasp the steel pot (which is initially oriented vertically), rotate it to be in the upright position, and place it back onto the table. We collect only 10 demonstrations of this task for the training dataset, randomly placing the steel pot at various locations within a small section of the table. During evaluation, each trial is recorded as a success (1), failure (0), or partial success (0.5). Partial successes include grasping the pot but not orienting it upright, or knocking it over to the upright position but not carefully guiding it. The robot must release the pot at the end of the episode for full credit.

4.  **Move \<object\> onto Plate** (multi-instruction): The robot's goal is to grasp one out of three objects (depending on the target specified in the language instruction) and place it on the plate on the right side of the table. We collect 150 demonstrations of this task for the training dataset, randomly placing different combinations of three objects on the table and selecting one as the target. The plate is always initialized on the right side of the table. During evaluation, each trial is recorded as a success (1), failure (0), or partial success (0.5). Partial success is recorded when the first object that the robot makes contact with is the correct target object (`\ie `{=latex}the object specified in the language instruction), but the robot does not complete the task.

5.  **Knock \<object\> Over** (multi-instruction): The robot's goal is to approach one out of three objects (depending on the target specified in the language instruction) and push it until it falls over. We collect 70 demonstrations of this task for the training dataset, randomly placing different combinations of three objects on the table and selecting one as the target. During evaluation, each trial is recorded as a success (1), failure (0), or partial success (0.5). Partial success is recorded when the first object that the robot makes contact with is the correct target object (`\ie `{=latex}the object specified in the language instruction), but the robot does not complete the task.

6.  **Cover \<object\> with Towel** (multi-instruction): The robot's goal is to grasp the blue towel and place it on one out of three objects (depending on the target specified in the language instruction). We collect 45 demonstrations of this task for the training dataset, randomly placing different combinations of three objects on the table. During evaluation, each trial is recorded as a success (1), failure (0), or partial success (0.5). Partial success is recorded when the first object that the robot touches with the towel is the correct target object (`\ie `{=latex}the object specified in the language instruction), but the robot does not complete the task (`\eg `{=latex}it drops the towel onto the table instead of on top of the target object). Full credit is given when any part of the towel is resting over the top surface of the target object, `\ie `{=latex}the object does not need to be fully covered.

For every Franka-Tabletop task, we evaluate each method with 10--12 in-distribution trials and 5--6 OOD generalization trials. The in-distribution and OOD test conditions are depicted in `\cref{fig:app:franka_tabletop_tasks}`{=latex} (second column).

We describe the OOD test conditions for each of the six tasks below:

1.  Put Carrot in Bowl (OOD): An eggplant (unseen object) replaces the carrot.

2.  Pour Corn into Pot (OOD): An unseen brown tablecloth covers the tabletop.

3.  Flip Pot Upright (OOD): An unseen white tablecloth covers the tabletop

4.  Move \<object\> onto Plate (OOD): A set of three unseen objects are placed on the table.

5.  Knock \<object\> Over (OOD): Two unseen distractor objects (red plastic cup and brown box) are positioned behind the set of three seen objects.

6.  Cover \<object\> with Towel (OOD): The three objects on the table are placed upside-down and at unseen positions.

Finally, in the Franka-DROID environment, we experiment with one task and variants of it: **Wipe Table** (see `\cref{fig:app:droid_wipe_task}`{=latex}). In this task, the robot's goal is to grab the brush and sweep all three small brown objects into the dustpan. We collect 70 demonstrations for this task for the training dataset, varying the positions of all the objects.

```{=latex}
\centering
```
![**Franka-DROID fine-tuning task.** The \`\`Wipe Table" task shown here is the final task used in the data-efficient adaptation experiments in `\cref{sec:finetuning_exp}`{=latex}. The left image shows the initial conditions for an in-distribution trial. The right image shows an out-of-distribution trial in which unseen distractor objects are present on the table. To fully complete the task, the robot must grab the brush and sweep all three objects into the dustpan.](figures/droid_wipe_task.jpeg){#fig:app:droid_wipe_task width="0.6\\linewidth"}

At test time, we evaluate on in-distribution conditions matching the training data (`\cref{fig:app:droid_wipe_task}`{=latex}, left), as well as out-of-distribution (OOD) conditions in which distractor objects are also present in the scene on the table (`\cref{fig:app:droid_wipe_task}`{=latex}, right). Since there are various possible outcomes for each trial, we define a scoring rubric as follows: The maximum score for each trial is 2 points. The policy receives the full 2 points if the robot sweeps all three objects into the dustpan. It receives 1 point for successfully sweeping one or two objects into the dustpan. Otherwise, it receives 0 points. We evaluate each policy with 18 in-distribution trials and 12 OOD trials, so each policy receives an aggregate score out of 60 points.

### Detailed Franka-Tabletop and Franka-DROID Evaluation Results {#sec:app:detailed_franka_evaluation_results}

Full evaluation results for both Franka-Tabletop and Franka-DROID evaluations are shown in `\cref{app:tab:detailed_finetune_results}`{=latex}. We evaluate the methods discussed in `\cref{sec:finetuning_exp}`{=latex}. We find that Diffusion Policy demonstrates strong performance on the single-instruction Franka-Tabletop tasks (`\eg `{=latex}\`\`Put Carrot in Bowl" and \`\`Pour Corn in Pot"), outperforming other methods. However, `\name{}`{=latex} and Octo achieve higher performance in the more diverse multi-instruction tasks (\`\`Move \<object\> onto Plate", \`\`Knock \<object\> Over", and \`\`Cover \<object\> with Towel"). In the Franka-DROID environment, `\name{}`{=latex} obtains best results. Overall, we find that `\name{}`{=latex} achieves the highest average performance across both tasks.

```{=latex}
\centering
```
```{=latex}
\resizebox{\textwidth}{!}{\begin{tabular}{llcccccc}
\toprule
&& \# trials & Diffusion Policy & \makecell{Diffusion Policy\\(matched)} & Octo & \makecell{\name{}\\(scratch)} & \makecell{\name{}\\(ours)} \\
\midrule
Franka-Tabletop (5Hz) & ``Put Carrot in Bowl'' (in-distribution) & 10 & \textbf{90.0\%} & 80.0\% & 40.0\% & 70.0\% & 70.0\% \\
& ``Put Carrot in Bowl'' (OOD) & 5 & 20.0\% & 0.0\% & 20.0\% & 0.0\% & \textbf{40.0\%} \\
& ``Pour Corn into Pot'' (in-distribution) & 10 & \textbf{100.0\%} & 90.0\% & 0.0\% & 10.0\% & 50.0\% \\
& ``Pour Corn into Pot'' (OOD) & 5 & \textbf{80.0\%} & 60.0\% & 0.0\% & 20.0\% & 60.0\% \\
& ``Flip Pot Upright'' (in-distribution) & 10 & \textbf{100.0\%} & 85.0\% & 40.0\% & 85.0\% & \textbf{100.0\%} \\
& ``Flip Pot Upright'' (OOD) & 5 & 50.0\% & 20.0\% & 0.0\% & 40.0\% & \textbf{80.0\%} \\
& ``Move <object> onto Plate'' (in-distribution) & 12 & 25.0\% & 25.0\% & 41.7\% & 8.3\% & \textbf{75.0\%} \\
& ``Move <object> onto Plate'' (OOD) & 6 & 8.3\% & 33.3\% & 8.3\% & 33.3\% & \textbf{58.3\%} \\
& ``Knock <object> Over'' (in-distribution) & 12 & 33.3\% & 25.0\% & \textbf{83.3\%} & 75.0\% & 75.0\% \\
& ``Knock <object> Over'' (OOD) & 6 & 16.7\% & 16.7\% & 33.3\% & 58.3\% & \textbf{83.3\%} \\
& ``Cover <object> with Towel'' (in-distribution) & 12 & 16.7\% & 20.8\% & \textbf{91.7\%} & 41.7\% & 50.0\% \\
& ``Cover <object> with Towel'' (OOD) & 6 & 16.7\% & 33.3\% & \textbf{91.7\%} & 50.0\% & 50.0\% \\
\midrule
& Average & & \cellcolor{lightlightgray} 48.5$\pm$4.9\% & \cellcolor{lightlightgray} 43.4$\pm$4.7\% & \cellcolor{lightlightgray} 43.4$\pm$4.4\% & \cellcolor{lightlightgray} 43.4$\pm$4.6\% & \cellcolor{lightlightgray} \textbf{67.2$\pm$4.0\%} \\
\midrule
Franka-DROID (15Hz) & ``Wipe Table'' (in-distribution) & 18 & 50.0\% & 27.8\% & 52.8\% & 25.0\% & 55.6\% \\
& ``Wipe Table'' + Distractors (OOD) & 12 & 12.5\% & 25.0\% & 16.7\% & 16.7\% & 62.5\% \\
\midrule
& Average & & \cellcolor{lightlightgray} 35.0$\pm$8.0\% & \cellcolor{lightlightgray} 26.7$\pm$7.5\% & \cellcolor{lightlightgray} 38.3$\pm$8.5\% & \cellcolor{lightlightgray} 21.7$\pm$6.6\% & \cellcolor{lightlightgray} \textbf{58.3$\pm$7.2\%} \\
\bottomrule
\end{tabular}}
```
Additionally, in `\cref{app:tab:detailed_peft_results}`{=latex}, we show the detailed version of the parameter-efficient fine-tuning experiment results summarized in `\cref{tab:partial_finetune_results}`{=latex}. In these experiments, we use a representative subset of two Franka-Tabletop tasks, with both in-distribution and OOD variants: one narrow single-instruction task ("Put Carrot in Bowl") and one diverse multi-instruction task ("Move \<object\> onto Plate"). We use the same number of training demonstrations used in `\cref{sec:finetuning_exp}`{=latex} (50 and 150, respectively), which is delineated in `\cref{sec:app:franka_evaluation_tasks}`{=latex}.

```{=latex}
\centering
```
```{=latex}
\resizebox{\textwidth}{!}{\begin{tabular}{llcc|ccccc}
\toprule
&& \# trials & \makecell{Full FT} & Last layer only & \makecell{Frozen vision} & \makecell{Sandwich} & LoRA, r=32 & LoRA, r=64 \\
\midrule
Franka-Tabletop (5Hz) & ``Put Carrot in Bowl'' (in-distribution) & 10 & 90.0 & 40.0 & 40.0 & 90.0 & 60.0 & 90.0 \\
& ``Put Carrot in Bowl'' (OOD) & 5 & 40.0 & 0.0 & 40.0 & 0.0 & 60.0 & 40.0 \\
& ``Move <object> onto Plate'' (in-distribution) & 12 & 79.2 & 33.3 & 50.0 & 75.0 & 75.0 & 62.5 \\
& ``Move <object> onto Plate'' (OOD) & 6 & 41.7 & 33.3 & 58.3 & 41.7 & 75.0 & 66.7 \\
\midrule
& Average & & \cellcolor{lightlightgray}\textbf{69.7$\pm$7.2\%} & \cellcolor{lightlightgray} 30.3$\pm$6.1\% & \cellcolor{lightlightgray} 47.0$\pm$6.9\% & \cellcolor{lightlightgray}62.1$\pm$7.9\% & \cellcolor{lightlightgray} \textbf{68.2$\pm$7.5\%} & \cellcolor{lightlightgray} \textbf{68.2$\pm$7.8\%} \\
\bottomrule
\end{tabular}}
```
RT-2-X vs. OpenVLA in BridgeData V2 Evaluations {#sec:app:rt2x_vs_openvla_in_bridge}
===============================================

In this section, we provide additional details on RT-2-X vs. `\name{}`{=latex} comparisons in BridgeData V2 evaluations discussed in `\cref{sec:zero_shot_exp}`{=latex}. As discussed previously, `\name{}`{=latex} is pretrained on a larger subset of OpenX data than RT-2-X and uses a fused SigLIP-DinoV2 vision backbone rather than a single visual encoder. However, in addition to these factors, we believe that `\name{}`{=latex}'s significant improvement upon RT-2-X specifically in BridgeData V2 evaluations (as shown in `\cref{fig:bridge_results}`{=latex}) also stems from more careful preprocessing of the Bridge dataset.

During the development of the `\name{}`{=latex} model, we discovered that the original version of the BridgeData V2 dataset contained many transitions with all-zero (no-op) actions. For instance, in every demonstration, an all-zero action was recorded as the ground-truth action in the first timestep. Consequently, training a highly expressive VLA model on the original dataset without any data preprocessing led to a policy that frequently predicted all-zero actions and froze during evaluations. Therefore, we simply filtered out the first transition in every demonstration when training the `\name{}`{=latex} model, and this was sufficient for mitigating the freezing behavior in most cases.

However, the RT-2-X model was trained without such data preprocessing, so it often suffers the aforementioned freezing behavior if deployed out of the box without modifying the model querying procedure -- which severely deteriorates rollout performance. Since this is a proprietary model that is infeasible for us to re-train (e.g., with our preprocessed version of the BridgeData V2 dataset), we mitigated this issue by simply querying the *second-most-likely* action from the model, since the first-most-likely action was often all zeros while the second-most-likely action was not. (Note that this is the same workaround that was applied by the developers of the RT-2-X model for BridgeData V2 evaluations reported in the Open X-Embodiment experiments [@open_x_embodiment_rt_x_2023].) This workaround led to much stronger RT-2-X performance on BridgeData V2 evaluations -- though we believe that it is still suboptimal compared to re-training the model on the preprocessed version of the dataset.

We also tried to *dynamically* query RT-2-X, i.e., by first sampling the first-most-likely action and then sampling the second-most-likely action if the first one was all zeros. However, we empirically found that dynamic querying led to worse performance than simply querying the second-most-likely action at all times. We hypothesize that this is due to a change in the robot's dynamics that arises from dynamic querying: pausing in the middle of a trajectory to re-query the model leads to slight interruptions in the robot's movement due to non-neglible latency in the querying pipeline, and this leads to subtle performance degradation. Therefore, we report the performance of RT-2-X when always querying the second-most-likely action, as done in the Open X-Embodiment project [@open_x_embodiment_rt_x_2023].

Additional Experiments and Ablations {#sec:app:additional_ablation_experiments}
====================================

In this section, we conduct several additional experiments to analyze the effects of individual components of the `\name{}`{=latex} model architecture and training scheme, as well as provide quantitative evidence for claims made in earlier sections of this work. We aim to answer the following questions:

1.  How important is OpenX training and how does it impact OpenVLA's performance (`\cref{sec:app:openx_pretraining_ablation}`{=latex})?

2.  What effect does using a fused SigLIP-DinoV2 vision encoder have on `\name{}`{=latex}'s performance, compared to using a SigLIP-only vision encoder (`\cref{sec:app:dual_vs_single_vision_encoder}`{=latex})?

3.  Is it better to fine-tune or freeze the vision encoder in `\name{}`{=latex} (`\cref{sec:app:finetuned_vs_frozen_vision_encoder}`{=latex})?

4.  How do the quantized inference results discussed in `\cref{sec:param_efficient_finetuning}`{=latex} change when policy performance is disentangled from model inference speed (`\cref{sec:app:additional_quantized_inference_experiments}`{=latex})?

We discuss the experimental setup and results addressing each of the above questions sequentially in the following sections.

OpenX Training Data Ablation Experiments {#sec:app:openx_pretraining_ablation}
----------------------------------------

As discussed in `\cref{sec:data_mix}`{=latex}, `\name{}`{=latex} is trained on a large dataset of robot embodiments, scenes, and tasks from the Open X-Embodiment dataset [@open_x_embodiment_rt_x_2023] (OpenX). In this section, we ablate the OpenX mixture and train a VLA policy solely on one robot dataset, to assess the impact of OpenX training on policy performance. Note that we have already observed the negative effect of ablating OpenX training in the fine-tuning regime, as discussed in `\cref{sec:finetuning_exp}`{=latex} (see `\name{}`{=latex} (Scratch)), but we discuss additional experiments on another robot embodiment in this section to provide more supporting evidence.

**Experimental setup and tasks.** We compare the original `\name{}`{=latex} model with **`\name{}`{=latex}-Bridge**, which is produced by taking the same pretrained VLM as `\name{}`{=latex} (Prismatic VLM [@karamcheti2024prismatic]) and fine-tuning it solely on BridgeData V2 [@walke2023bridgedata] rather than the entire OpenX training mixture discussed in `\cref{sec:app:data_mix}`{=latex}. We evaluate `\name{}`{=latex} and `\name{}`{=latex}-Bridge on a subset of 8 representative tasks from the BridgeData V2 WidowX robot evaluation suite discussed in `\cref{sec:app:bridge_evaluation_tasks}`{=latex}. The tasks are listed in `\cref{app:tab:bridge_ablation_results}`{=latex}.

**Results.** Results for the OpenX training mixture ablation are shown in `\cref{app:tab:bridge_ablation_results}`{=latex}. By comparing `\name{}`{=latex} with `\name{}`{=latex}-Bridge, we see that performance drops drastically (reduction of 30 percent in absolute success rate), which demonstrates the importance of OpenX pretraining on final policy performance. Although the language grounding performance is not impacted, we observe performance reduction across all generalization categories. This result suggests that the large diversity of scenes, objects, and tasks in the OpenX training mixture is essential for unlocking improved generalization capabilities in the `\name{}`{=latex} model.

```{=latex}
\centering
```
```{=latex}
\resizebox{\textwidth}{!}{%
\begin{tabular}{llcccc}
\toprule
Category & Task & \# Trials & \makecell{\name{}\\\# Successes} & \makecell{\name{}-Bridge\\\# Successes} & \makecell{\name{}-Bridge-SigLIP\\\# Successes} \\
\midrule
Visual gen & Put Eggplant into Pot (Easy Version) & 10 & 10 & 8 & 8 \\
Visual gen & Put Eggplant into Pot & 10 & 10 & 2 & 3 \\
Visual gen & Put Cup from Counter into Sink & 10 & 7 & 4 & 2 \\
Motion gen & Lift Eggplant & 10 & 7.5 & 5.5 & 6.5 \\
Physical gen & Put Carrot on Plate & 10 & 8 & 4 & 1 \\
Physical gen & Lift AAA Battery & 10 & 7 & 2 & 2 \\
Semantic gen & Take Purple Grapes out of Pot & 10 & 4 & 3 & 3 \\
Language grounding & Put \{Eggplant, Red Bottle\} into Pot & 10 & 7.5 & 8 & 7 \\
\midrule
&& Mean Success Rate & \cellcolor{lightlightgray}76.3 $\pm$ 4.8\% & \cellcolor{lightlightgray}45.6 $\pm$ 5.6\% & \cellcolor{lightlightgray}40.6 $\pm$ 5.5\% \\
\bottomrule
\end{tabular}}
```
Dual vs. Single Vision Encoder Experiments {#sec:app:dual_vs_single_vision_encoder}
------------------------------------------

The `\name{}`{=latex} model architecture consists of a fused vision backbone that combines the SigLIP [@zhai2023siglip] and DinoV2 [@oquab2023dinov2] encoders. In this section, we ablate the DinoV2 component to assess the importance of using a dual vision encoder.

**Experimental setup and tasks.** We instantiate a model, **`\name{}`{=latex}-Bridge-SigLIP**, which is a version of `\name{}`{=latex} that is trained only on BridgeData V2 and consists of only the SigLIP encoder as the vision backbone. We compare this model with the `\name{}`{=latex}-Bridge model discussed in the previous section (`\cref{sec:app:openx_pretraining_ablation}`{=latex}), which shares the same model architecture as the original `\name{}`{=latex} model and is only trained on Bridge robot data. Therefore, the only difference between `\name{}`{=latex}-Bridge-SigLIP and `\name{}`{=latex}-Bridge is that the former omits the DinoV2 encoder in the vision backbone. We evaluate these models on the same subset of 8 Bridge tasks described in the previous section.

**Results.** Results for the dual vision encoder ablation are shown in `\cref{app:tab:bridge_ablation_results}`{=latex}. The drop in performance from `\name{}`{=latex}-Bridge to `\name{}`{=latex}-Bridge-SigLIP implies that additionally including the DinoV2 encoder in the vision backbone improves policy performance. However, the 5 percent reduction in performance here is not as significant as the 30 percent drop in performance observed from ablating OpenX training. The low-level spatial features represented in DinoV2 appear to aid generalization in only some cases.

Fine-Tuned vs. Frozen Vision Encoder Experiments {#sec:app:finetuned_vs_frozen_vision_encoder}
------------------------------------------------

As discussed in `\cref{sec:train_details}`{=latex}, prior work on VLMs observed higher performance from freezing the vision encoder than fine-tuning its parameters [@karamcheti2024prismatic]. However, when training `\name{}`{=latex}, we fine-tuned all 7B parameters in the model, including the SigLIP-DinoV2 vision backbone, as we discovered early on during development that fine-tuning the vision encoder led to higher-performing VLAs --- a finding which held across various pretrained VLMs and model architectures. We discuss details of such findings below.

**Experimental setup and tasks.** In this section, we report the performance of two VLA policies produced by fine-tuning two different pretrained models from the Prismatic VLMs [@karamcheti2024prismatic] repository on BridgeData V2. The two pretrained models are named **SigLIP ViT-SO 224px** and **LLaVa v1.5 7B (Reproduction)**; see @karamcheti2024prismatic for details on their architectures and training mixtures. We evaluate both policies on various Bridge tasks shown in `\cref{app:tab:frozen_vision_encoder_results}`{=latex}. Note that the evaluation configurations here differ from previously discussed Bridge evaluations, so the results are not directly comparable to results in other similar experiments.

**Results.** Results for the fine-tuned vs. frozen vision encoder experiments are shown in `\cref{app:tab:frozen_vision_encoder_results}`{=latex}. We find that for both VLAs tested, fine-tuning the vision encoder leads to significantly higher success rates across various tasks. Qualitatively, in some cases, deploying the frozen vision encoder policies leads to unstable robot behaviors that are clearly suboptimal. Consequently, we decided early on during development to not conduct further experimentation with frozen vision encoders.

```{=latex}
\centering
```
```{=latex}
\resizebox{\textwidth}{!}{%
\begin{tabular}{ll|cc|cc}
\toprule
&& \multicolumn{2}{c|}{SigLIP ViT-SO 224px} & \multicolumn{2}{c}{LLaVa v1.5 7B (Reproduction)} \\
\cline{3-4} \cline{5-6}
Task & \# Trials & \makecell{Frozen Vision\\\# Successes} & \makecell{Fine-Tuned\\\# Successes} & \makecell{Frozen Vision\\\# Successes} & \makecell{Fine-Tuned\\\# Successes} \\
\midrule
Put Eggplant into Pot & 10 & 7 & 10 & 5 & 9 \\
Put Corn on Plate & 10 & 10 & 9 & 0 & 9 \\
\midrule
& Mean Success Rate & \cellcolor{lightlightgray}85 & \cellcolor{lightlightgray}\textbf{95} & \cellcolor{lightlightgray}25 & \cellcolor{lightlightgray}\textbf{90} \\
\midrule
Put \{ Eggplant, Red Bottle \} into Pot & 4 & 2 & 4 & -- & 3 \\
Put \{ Blue Cup, Pink Cup \} on Plate & 4 & 0 & 0 & -- & 0 \\
Lift \{ Cheese, Red Chili Pepper \} & 4 & 0 & 3 & -- & 2 \\
Put \{ Strawberry, Lime \} into Pot & 4 & 1 & 0 & -- & 3 \\
Move \{ Sushi, Grapes \} & 4 & 3 & 4 & -- & 3 \\
\midrule
& Mean Success Rate & \cellcolor{lightlightgray}30 & \cellcolor{lightlightgray}\textbf{55} & \cellcolor{lightlightgray}-- & \cellcolor{lightlightgray}55 \\
\bottomrule
\end{tabular}}
```
Additional Quantized Inference Experiments: Disentangling Policy Performance and Model Inference Speed {#sec:app:additional_quantized_inference_experiments}
------------------------------------------------------------------------------------------------------

In `\cref{sec:param_efficient_finetuning}`{=latex}, we evaluated `\name{}`{=latex} with different levels of precision at inference time: half precision (bfloat16), 8-bit quantization, and 4-bit quantization. 8-bit quantization led to lower BridgeData V2 performance relative to the other two approaches, and we hypothesized that the reduction in performance was caused by lower model inference speed from the operations used in 8-bit quantization. In this section, we conduct experiments to assess the veracity of this claim.

Specifically, we evaluate `\name{}`{=latex} again with the three different levels of precision listed above, but now with *blocking control*. In other words, each action is fully executed on the robot before the next one is predicted by the policy and executed by the controller. This scheme controls system dynamics across methods with varying amounts of latency and thus allows us to test the quality of a policy's action predictions, independent of its prediction speed. Effectively, the precision levels that have higher throughput -- bfloat16 and 4-bit quantization -- are forced to run slower to match the dynamics observed when deploying `\name{}`{=latex} with 8-bit precision. Therefore, we expect `\name{}`{=latex}'s performance with 8-bit precision to match the performance of bfloat16 and 4-bit precision under blocking control.

**Experimental setup and tasks.** We report the performance of `\name{}`{=latex} with blocking control and quantized inference on the same subset of 8 BridgeData V2 tasks used in `\cref{sec:app:openx_pretraining_ablation}`{=latex} and `\cref{sec:app:dual_vs_single_vision_encoder}`{=latex}.

**Results.** Quantized inference experiment results with blocking control are shown in `\cref{app:tab:blocking_control_results}`{=latex}. Unlike in `\cref{tab:quantized_results}`{=latex}, where 8-bit quantization led to the worst rollout performance due to low inference speed, here we observe that 8-bit quantization performs comparably to bfloat16 precision and 4-bit quantization given that we evaluate with blocking control to remove the influence of varying inference speeds on task performance. This confirms our hypothesis about the effect of inference speed on 8-bit quantization performance in previous experiments (when using non-blocking control). We also see no substantial performance degradation when using the lowest precision, 4-bit, as also observed in `\cref{sec:param_efficient_finetuning}`{=latex}.

```{=latex}
\centering
```
```{=latex}
\resizebox{\textwidth}{!}{%
\begin{tabular}{llcccc}
\toprule
Category & Task & \# Trials & \makecell{bfloat16\\\# Successes} & \makecell{int8\\\# Successes} & \makecell{int4\\\# Successes} \\
\midrule
Visual gen & Put Eggplant into Pot (Easy Version) & 10 & 10 & 10 & 10 \\
Visual gen & Put Eggplant into Pot & 10 & 9 & 10 & 10 \\
Visual gen & Put Cup from Counter into Sink & 10 & 5 & 5 & 3 \\
Motion gen & Lift Eggplant & 10 & 8 & 7 & 7.5 \\
Physical gen & Put Carrot on Plate & 10 & 10 & 10 & 10 \\
Physical gen & Lift AAA Battery & 10 & 3 & 6 & 4 \\
Semantic gen & Take Purple Grapes out of Pot & 10 & 2 & 2 & 2 \\
Language grounding & Put \{Eggplant, Red Bottle\} into Pot & 10 & 9 & 9.5 & 8.5 \\
\midrule
&& Mean Success Rate & \cellcolor{lightlightgray}70.0 $\pm$ 5.1\% & \cellcolor{lightlightgray}74.4 $\pm$ 4.9\% & \cellcolor{lightlightgray}68.8 $\pm$ 5.2\% \\
\bottomrule
\end{tabular}}
```
LIBERO Simulation Experiments {#sec:app:libero_sim_experiments}
=============================

Our previous discussions in `\cref{sec:finetuning_exp}`{=latex} and `\cref{sec:param_efficient_finetuning}`{=latex} focused on adapting `\name{}`{=latex} to novel *real-world* robot setups and tasks. This section explores adapting `\name{}`{=latex} to *simulated* robot setups and tasks, specifically utilizing the **LIBERO** benchmark [@liu2024libero]. Our experimentation in simulation offers two key advantages:

1.  Demonstration of versatility: We show that `\name{}`{=latex}, despite having been pretrained exclusively on real-world robot data, can effectively adapt to simulated domains, overcoming potential disparities between real-world and simulated environments and dynamics.

2.  Enhanced accessibility and reproducibility: Integration of `\name{}`{=latex} into a publicly available simulation platform makes our model more accessible to other researchers, especially those who may not have access to robotic hardware. Additionally, simulated experiments are more easily reproduced than their real-world counterparts.

We discuss the experimental setup in `\cref{sec:app:libero_sim_setup}`{=latex} and the results in `\cref{sec:app:libero_sim_results}`{=latex}. We release the materials required to reproduce the experiments along with the `\name{}`{=latex} codebase.

LIBERO Simulation Experimental Setup {#sec:app:libero_sim_setup}
------------------------------------

**Simulation setup and tasks.** The LIBERO benchmark [@liu2024libero] consists of four task suites designed for studying lifelong learning in robotic manipulation, and the original paper therefore investigates both forward and backward transfer to a variety of tasks. In our experiments, we focus solely on supervised fine-tuning on the target task suite, measuring the performance of various policies trained via behavioral cloning on successful demonstrations of the tasks.

We perform experiments with the following four task suites, which each contain 10 tasks with 50 human-teleoperated demonstrations each:

-   **LIBERO-Spatial** consists of the same set of objects but different layouts, and tests the model's understanding of spatial relationships.

-   **LIBERO-Object** consists of the same scene layouts but different objects, and tests the model's understanding of object types.

-   **LIBERO-Goal** consists of the same objects and layouts but different task goals, and tests the model's knowledge of different task-oriented behaviors.

-   **LIBERO-Long** (also called **LIBERO-10**) consists of *long-horizon* tasks with diverse objects, layouts, and tasks.

We make the following modifications to each of the training datasets above:

1.  To accommodate methods requiring higher-resolution images (such as $256 \times 256$px or $224 \times 224$px), we regenerate all demonstrations at an increased resolution of $256 \times 256$px. Originally, the dataset provided by the benchmark consists of $128 \times 128$px images. We find that simply upscaling these images to $256 \times 256$px results in poor image quality. Therefore, we choose to begin with higher-resolution images, which can be downscaled as necessary, ensuring higher image quality across various resolution requirements. These higher-resolution images were obtained by stepping through the simulation environments with the actions stored in the provided human-collected demonstrations and saving the images rendered by the simulator.

2.  We filter out all \`\`no-op" actions from the dataset, i.e., actions that have near-zero magnitude in the translation and rotation components and do not change the state of the robot's gripper. We find that this simple data cleaning step is crucial for highly expressive single-step policies such as `\name{}`{=latex}, which otherwise learn to imitate these no-op actions and consequently freeze indefinitely at certain states during evaluation.

3.  We rotate all third-person images at both train and test time by 180 degrees because we observe that the LIBERO environments return images that are upside down on our hardware.

4.  Since we train policies via imitation learning, which expects demonstrations to be successful, we replay all demonstrations in the corresponding simulation environments and filter out the demonstrations that fail to complete the task (as determined by the environments' success criteria). As a result, we remove 68 of 500 LIBERO-Spatial demonstrations, 46 of 500 LIBERO-Object demonstrations, 72 of 500 LIBERO-Goal demonstrations, and 121 of 500 LIBERO-Long demonstratinos.

5.  For all methods in our comparisons, we only utilize the static third-person camera images; we do not use the wrist camera images that are additionally provided in the original datasets. This is for sake of having fair comparisons, as OpenVLA's visual inputs only consist of third-person camera images.

**Comparisons.** The methods that we compare include **Diffusion Policy[^8]** [@chi2023diffusionpolicy] trained from scratch, **Octo** [@octo_2023] fine-tuned on the target dataset, and **`\name{}`{=latex}** fine-tuned on the target dataset via LoRA ($r=32$) as described in `\cref{sec:param_efficient_finetuning}`{=latex}. Each policy is trained independently on each of the task suites above (rather than training a single policy on all four suites combined). All policies are trained with the same set of demonstrations, so all methods benefit from the data cleaning steps described above.

**Evaluation details.** To ensure lower variance in the experimental results, all methods are evaluated across 500 trials for each task suite, and the reported performance is the average success rate over three random seeds (resulting in 1500 total trials per statistic). Although we modify the training datasets, as described earlier, we do not change the test environments but rather use the same initial environment configurations provided by the original LIBERO benchmark.

LIBERO Simulation Experimental Results {#sec:app:libero_sim_results}
--------------------------------------

We present the LIBERO experimental results in `\cref{app:tab:libero_sim_results_table}`{=latex}. Importantly, we observe that `\name{}`{=latex} can be effectively adapted to tasks in the LIBERO simulation environments, as it obtains highest average success rate and rank among the tested methods. However, we find that the overall margin between `\name{}`{=latex} and the other methods are tighter here than in the real-world fine-tuning experiments discussed in `\cref{sec:finetuning_exp}`{=latex}. We attribute this to the fact that `\name{}`{=latex} was pretrained with purely real-world robot data and no simulation data, which suggests that fine-tuning the model on simulated robot tasks may not be as effective as fine-tuning it on real-world tasks due to the domain gap between simulated and real-world environments and dynamics. We see evidence for this notion in the results obtained by Octo -- another policy pretrained on large amounts of real-world robot data -- which also only achieves a small boost in overall performance relative to a simple, strong baseline such as Diffusion Policy trained from scratch. We expect increased gains in performance for the pretrained and fine-tuned methods if simulation data is added to the pretraining data mixture.

```{=latex}
\centering
```
```{=latex}
\resizebox{\textwidth}{!}{\begin{tabular}{l|cc|cc|cc|cc|cc}
\toprule
& \multicolumn{2}{c|}{LIBERO-Spatial} & \multicolumn{2}{c|}{LIBERO-Object} & \multicolumn{2}{c|}{LIBERO-Goal} & \multicolumn{2}{c|}{LIBERO-Long} & \multicolumn{2}{c}{Average} \\
\cline{2-3} \cline{4-5} \cline{6-7} \cline{8-9} \cline{10-11}
& SR ($\uparrow$) & Rank ($\downarrow$)& SR ($\uparrow$) & Rank ($\downarrow$)& SR ($\uparrow$) & Rank ($\downarrow$)& SR ($\uparrow$) & Rank ($\downarrow$)& SR ($\uparrow$) & Rank ($\downarrow$)\\
\midrule
Diffusion Policy from scratch & 78.3 $\pm$ 1.1\% & 3 & \textbf{92.5 $\pm$ 0.7\%} & 1 & 68.3 $\pm$ 1.2\% & 3 & 50.5 $\pm$ 1.3\% & 3 & 72.4 $\pm$ 0.7\% & 2.5 \\
Octo fine-tuned & 78.9 $\pm$ 1.0\% & 2 & 85.7 $\pm$ 0.9\% & 3 & \textbf{84.6 $\pm$ 0.9\%} & 1 & 51.1 $\pm$ 1.3\% & 2 & 75.1 $\pm$ 0.6\% & 2 \\
OpenVLA fine-tuned (ours) & \textbf{84.7 $\pm$ 0.9\%} & 1 & 88.4 $\pm$ 0.8\% & 2 & 79.2 $\pm$ 1.0\% & 2 & \textbf{53.7 $\pm$ 1.3\%} & 1 & \textbf{76.5 $\pm$ 0.6\%} & \textbf{1.5} \\
\bottomrule
\end{tabular}}
```

[^1]: `\name{}`{=latex} uses multiple pretrained model components: SigLIP [@zhai2023siglip] and DinoV2 [@oquab2023dinov2] vision encoders and a Llama 2 [@touvron2023llama2] language model backbone. For all three models, weights are open, but not their training data or code. We release training data, code and model weights for reproducing `\name{}`{=latex} on top of these components.

[^2]: Octo [@octo_2023] demonstrated training across datasets with heterogeneous sensory inputs. While very promising, we leave an investigation of VLA training across heterogeneous sensor modalities and action spaces to future work.

[^3]: The full Diffusion Policy uses a two-step observation history with both images and proprioceptive state, and performs receding horizon control by predicting a chunk of $T$ future actions and executing the first $X$ actions in open-loop fashion before predicting the next chunk (for 15Hz control, we set $T=16, X=8$ like in the DROID prior work [@khazatsky2024droid]; for 5Hz control, we reduce the chunk sizes to $T=8, X=3$). It is also the only method in `\cref{sec:finetuning_exp}`{=latex} that predicts *absolute* Cartesian coordinates to control the robot; all other methods use *relative* position control. Diffusion Policy (matched) uses a single image as input, has no proprioceptive information and no observation history, and predicts a single relative position control action without action chunking.

[^4]: In `\cref{sec:param_efficient_finetuning}`{=latex} and `\cref{sec:quantization_exp}`{=latex}, we experiment with a version of the `\name{}`{=latex} model that is pretrained with a smaller robot data mixture (the same OpenX dataset mixture as Octo) and has a slightly smaller architecture which only uses a SigLIP [@zhai2023sigmoid] vision backbone instead of the fused DinoSigLIP encoder. We find that this simpler architecture still achieves strong performance in both fine-tuning tasks and \`\`out-of-the-box" tasks.

[^5]: We attribute the performance loss to low inference speed, since both 8-bit and 4-bit quantization achieve comparable token accuracy to bfloat16 inference when evaluated offline on training data. See `\cref{sec:app:additional_quantized_inference_experiments}`{=latex} for supporting details.

[^6]: We remove DROID for the last third of training due to slow learning progress (see `\cref{sec:data_mix}`{=latex}) and re-distribute its mixture weights across all other datasets.

[^7]: See Appendix of @rt22023arxiv for a detailed list of OOD conditions in Google robot evaluations.

[^8]: We use the implementation of Diffusion Policy that is described in the DROID dataset paper [@khazatsky2024droid], which conditions action generation on DistilBERT [@sanh2019distilbert] language embeddings of the task label.
