OpenVLA: An Open-Source Vision-Language-Action Model

Source

Raw Markdown: paper_openvla-2024.md
PDF: paper_openvla-2024.pdf
Preprint: arXiv 2406.09246
Project page: openvla.github.io
Official code: github.com/openvla/openvla
Official weights: openvla/openvla-7b

Core Claim

OpenVLA makes the RT-2-style VLA recipe open: a pretrained VLM is fine-tuned on robot trajectories so image observations and language instructions map to discretized end-effector action tokens.

Method Notes

The model uses an autoregressive action-token interface rather than a diffusion or flow action expert.
It is a strong baseline for semantic transfer and open reproducibility, but the action representation is still quantized.
The source is a useful counterweight to the diffusion/flow trend: modern robotics still uses classical Transformer/VLM next-token machinery when action precision and frequency demands are manageable.

Evidence And Limitations

OpenVLA reports competitive results across several robot policy evaluations and provides code/weights. Reported limitations include single-image observations, limited native history/proprioception support in the initial version, and lower inference rates than compact control-specialized policies.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Control and action interface	adjacent	Fine-tunes a VLM backbone to emit discretized robot action tokens conditioned on image and language.	Does not model future trajectories under candidate actions or provide counterfactual rollouts.
Context interface	adjacent	Combines visual observations and natural-language instructions as policy context.	Lacks channel context, topology, event streams, and numeric system context for TSFMs.
Benchmarks	adjacent	Evaluates multi-robot out-of-the-box control and fine-tuning across BridgeData, Google Robot, and Franka tasks.	Physical manipulation metrics do not test latent-state time-series modeling or digital-world control.

Links Into The Wiki

Open Questions

Can action-token VLAs match diffusion/flow action experts after better tokenization such as FAST?
Which robotics tasks are bottlenecked by semantic understanding rather than continuous-control fidelity?

Alex Open Research Wiki

Explorer

OpenVLA: An Open-Source Vision-Language-Action Model

OpenVLA: An Open-Source Vision-Language-Action Model

Source

Core Claim

Method Notes

Evidence And Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks