OpenVLA

Basic information

Website: https://arxiv.org/abs/2406.09246

Short description: OpenVLA is a 7 billion parameter open-source Vision Language-Action (VLA) model. This transformer based model accepts textual descriptions of tasks, and visual inputs, and outputs actions that can be executed by a robot. OpenVLA can be used to control many different types of manipulation robots, making it a generalist robot policy.

Intended uses: What does the developer say it’s for? Robotic control.

Date(s) deployed: First paper release June 13, 2024 [source]

Developer

Website: https://web.archive.org/web/20241222012031/https://openvla.github.io

Legal name: Stanford University (et al.) [source]

Entity type: Academic Institution(s)

Country (location of developer or first author’s first affiliation): California, USA [source]

Safety policies: What safety and/or responsibility policies are in place? None

System components

Backend model: What model(s) are used to power the system? Llama-2 7b vision-backbone, SigLIP and DinoV2 vision encoders.

Publicly available model specification: Is there formal documentation on the system’s intended uses and how it is designed to behave in them? None

Reasoning, planning, and memory implementation: How does the system ‘think’? Maps language task description and current visualstate directly to robot actions. No explicit planning beyond what is learned internally from the training data.

Observation space: What is the system able to observe while ‘thinking’? Textual inputs describing the robot manipulation task, and 224 × 224 pixel images of the current world state the robot is acting in.

Action space/tools: What direct actions can the system take? 7-dimensional robot control action (e.g. spatial dimensional, end-effector orientation, and grip strength) represented as discrete tokens.

User interface: How do users interact with the system? N/A; an engineering project

Development cost and compute: What is known about the development costs? “The final OpenVLA model is trained on a cluster of 64 A100 GPUs for 14 days, or a total of 21,500 A100-hours, using a batch size of 2048.” [source]

Guardrails and oversight

Accessibility of components:

Weights: Are model parameters available? Open source [source].
Data: Is data available? OpenVLA starts using a pretrained language model backbone and vision encoders. OpenVLA then finetunes on a curated subset of the Open X-Embodiment robotics dataset (that consists “more than 70 individual robot datasets, with more than 2M robot trajectories” [source])
Code: Is code available? Available [source].
Scaffolding: Is system scaffolding available? Available [source].
Documentation: Is documentation available? Unavailable, but they have a technical report [source]

Controls and guardrails: What notable methods are used to protect against harmful actions? None

Customer and usage restrictions: Are there know-your-customer measures or other restrictions on customers? None

Monitoring and shutdown procedures: Are there any notable methods or protocols that allow for the system to be shut down if it is observed to behave harmfully? The model has no shutdown procedures, however it is a base model.

Evaluation

Notable benchmark evaluations: On the BridgeData V2 and Google robot evaluate OpenVLA outperforms the previous SoTA open source manipulation policy, Octo (93M parameters) and SoTA closed source manipulation policy, RT-2-X (55B parameters).

Bespoke testing: The authors test the ability of OpenVLA to be adapted to new settings using small data sets of 10-150 demonstrations of some target task. They find that generalist policies like OpenVLA and Octo perform better in target tasks that “that involve multiple objects in the scene and require language conditioning,” but Diffusion Policy imitation learning techniques work better in “narrower single-instruction tasks” [source].

Safety: Have safety evaluations been conducted by the developers? What were the results? None

Publicly reported external red-teaming or comparable auditing:

Personnel: Who were the red-teamers/auditors? None
Scope, scale, access, and methods: What access did red-teamers/auditors have and what actions did they take? None
Findings: What did the red-teamers/auditors conclude? None

Ecosystem information

Interoperability with other systems: What tools or integrations are available? OpenVLA can be used to control robots within its training dataset, and finetuned for new systems.

Usage statistics and patterns: Are there any notable observations about usage? The github repository for the OpenVLA bas 173 forks, and 1.4k stars [source].

Additional notes

None