OpenVLA
Basic Information
Website: https://arxiv.org/abs/2406.09246
Short description: OpenVLA is a 7 billion parameter open-source Vision Language-Action (VLA) model. This transformer based model accepts textual descriptions of tasks, and visual inputs, and outputs actions that can be executed by a robot. OpenVLA can be used to control many different types of manipulation robots, making it a generalist robot policy.
Intended uses: What does the developer state that the system is intended for?: Robotic control.
Date(s) deployed: First paper release June 13, 2024 [source]
Developer
Website: https://openvla.github.io
Legal name: Stanford University (et al.) [source]
Entity type: Academic Institution(s)
Country (location of developer or first author's first affiliation): California, USA [source]
Safety policies: What safety and/or responsibility policies are in place?: None
System Components
Backend model(s): What model(s) are used to power the system?: Llama-2 7b vision-backbone, SigLIP and DinoV2 vision encoders.
Public model specification: Is there formal documentation on the system’s intend...: None
Description of reasoning, planning, and memory implementation: How does the syst...: Maps language task description and current visualstate directly to robot actions. No explicit planning beyond what is learned internally from the training data.
Observation space: What is the system able to observe while 'thinking'?: Textual inputs describing the robot manipulation task, and 224 × 224 pixel images of the current world state the robot is acting in.
Action space/tools: What direct actions can the system take?: 7-dimensional robot control action (e.g. spatial dimensional, end-effector orientation, and grip strength) represented as discrete tokens.
User interface: How do users interact with the system?: N/A; an engineering project
Development cost and compute: What is known about the development costs?: "The final OpenVLA model is trained on a cluster of 64 A100 GPUs for 14 days, or a total of 21,500 A100-hours, using a batch size of 2048." [source]
Guardrails & Oversight
Accessibility of components
Weights: Are model parameters available?: Open source [source].
Data: Is data available?: OpenVLA starts using a pretrained language model backbone and vision encoders. OpenVLA then finetunes on a curated subset of the Open X-Embodiment robotics dataset (that consists "more than 70 individual robot datasets, with more than 2M robot trajectories" [source])
Code: Is code available?: Available [source].
Documentation: Is documentation available?: Unavailable, but they have a technical report [source]
Scaffolding: Is system scaffolding available?: Available [source].
Controls and guardrails: What notable methods are used to protect against harmfu...: None
Monitoring and shutdown procedures: Are there any notable methods or protocols t...: The model has no shutdown procedures, however it is a base model.
Customer and usage restrictions: Are there know-your-customer measures or other ...: None
Evaluation
Notable benchmark evaluations (e.g., on SWE-Bench Verified): On the BridgeData V2 and Google robot evaluate OpenVLA outperforms the previous SoTA open source manipulation policy, Octo (93M parameters) and SoTA closed source manipulation policy, RT-2-X (55B parameters).
Bespoke testing (e.g., demos): The authors test the ability of OpenVLA to be adapted to new settings using small data sets of 10-150 demonstrations of some target task. They find that generalist policies like OpenVLA and Octo perform better in target tasks that "that involve multiple objects in the scene and require language conditioning," but Diffusion Policy imitation learning techniques work better in "narrower single-instruction tasks" [source].
Safety: Have safety evaluations been conducted by the developers? What were the ...: None
Publicly reported external red-teaming or comparable auditing
Personnel: Who were the red-teamers/auditors?: None
Scope, scale, access, and methods: What access did red-teamers/auditors have and...: None
Findings: What did the red-teamers/auditors conclude?: None
Ecosystem
Interoperability with other systems: What tools or integrations are available?: OpenVLA can be used to control robots within its training dataset, and finetuned for new systems.
Usage statistics and patterns: Are there any notable observations about usage?: The github repository for the OpenVLA bas 173 forks, and 1.4k stars [source].
Other notes (if any): --