OpenVLA: An Open-Source Vision-Language-Action Model

Source

Core Claim

OpenVLA makes the RT-2-style VLA recipe open: a pretrained VLM is fine-tuned on robot trajectories so image observations and language instructions map to discretized end-effector action tokens.

Method Notes

  • The model uses an autoregressive action-token interface rather than a diffusion or flow action expert.
  • It is a strong baseline for semantic transfer and open reproducibility, but the action representation is still quantized.
  • The source is a useful counterweight to the diffusion/flow trend: modern robotics still uses classical Transformer/VLM next-token machinery when action precision and frequency demands are manageable.

Evidence And Limitations

OpenVLA reports competitive results across several robot policy evaluations and provides code/weights. Reported limitations include single-image observations, limited native history/proprioception support in the initial version, and lower inference rates than compact control-specialized policies.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Control and action interfaceadjacentFine-tunes a VLM backbone to emit discretized robot action tokens conditioned on image and language.Does not model future trajectories under candidate actions or provide counterfactual rollouts.
Context interfaceadjacentCombines visual observations and natural-language instructions as policy context.Lacks channel context, topology, event streams, and numeric system context for TSFMs.
BenchmarksadjacentEvaluates multi-robot out-of-the-box control and fine-tuning across BridgeData, Google Robot, and Franka tasks.Physical manipulation metrics do not test latent-state time-series modeling or digital-world control.

Open Questions

  • Can action-token VLAs match diffusion/flow action experts after better tokenization such as FAST?
  • Which robotics tasks are bottlenecked by semantic understanding rather than continuous-control fidelity?