Aguvis

Basic information

Website: https://arxiv.org/abs/2412.04454

Short description: Aguvis is fully autonomous pure vision GUI agent capable of performing tasks independently without relying on proprietary models and it can operate across various platforms (web, desktop, mobile) [source]

Intended uses: What does the developer say it’s for? GUI automation – Autonomously navigate and interact with complex digital environments [source]

Date(s) deployed: No apparent official deployment or release but earliest github commits are at Dec 23, 2024 [source]

Developer

Website: https://aguvis-project.github.io/

Legal name: University of Hong Kond (et al.) [source]

Entity type: Academic Institution, Corporation

Country (location of developer or first author’s first affiliation): Hong Kong, China [source]

Safety policies: What safety and/or responsibility policies are in place? None

System components

Backend model: What model(s) are used to power the system? Uses Qwen2-VL as the backend Vision-Language Model. Aguvis also evaluates LLaVA-OneVision as an alternative backend, demonstrating that its performance is independent of the VLM used. Aguvis can operate autonomously or serve as a grounding model when paired with GPT-4o for planning [source]

Publicly available model specification: Is there formal documentation on the system’s intended uses and how it is designed to behave in them? None

Reasoning, planning, and memory implementation: How does the system ‘think’? An Aguvis agent receives an image observation from the GUI environment, reasons and generates an inner monologue based on its previous actions and observations. This inner monologue consists of three components: a natural language description of the current observation, internal reasoning based on the high-level goal, the observation description, and previous thoughts and a low-level action instruction in natural language that specifies the next action. The agent then executes the action based on the instruction, receives a new observation, and repeats this process until it either achieves the goal or reaches a terminal state. Aguvis is trained on collection of GUI interaction trajectories called the AguVis collection (Appendix B.1 in [source]). The training is done in two stages: first stage enables the model to understand and interact with objects within a single GUI screenshot. The second stage introduces more complex decision-making and reasoning processes. This phase is designed to teach the model how to execute multi-step tasks by reasoning through agent trajectories that vary in complexity and environments, encompassing diverse reasoning modes [source]

Observation space: What is the system able to observe while ‘thinking’? Current image observation from the GUI environment. The system also has access to previous observations and actions. [source]

Action space/tools: What direct actions can the system take? pyautogui [source] support for both basic and pluggable actions systems [source]

User interface: How do users interact with the system? Users provide task prompts to the system and can observe the system’s reasoning process and outputs [source]

Development cost and compute: What is known about the development costs? AGUVIS is trained on a cluster of H100-80G GPUs: AGUVIS-7B uses 8 nodes and completes the grounding training within 5 hours and planning and reasoning training within 1 hour. AGUVIS-72B uses 16 nodes and completes the grounding training within 30 hours and planning and reasoning training within 6 hours [source]

Guardrails and oversight

Accessibility of components:

Weights: Are model parameters available? Available [source]
Data: Is data available? Aguvis collection splits used for both training stages are available [source]
Code: Is code available? Available [source]
Scaffolding: Is system scaffolding available? Available [source]
Documentation: Is documentation available? Preliminary [source]

Controls and guardrails: What notable methods are used to protect against harmful actions? Unknown

Customer and usage restrictions: Are there know-your-customer measures or other restrictions on customers? None

Monitoring and shutdown procedures: Are there any notable methods or protocols that allow for the system to be shut down if it is observed to behave harmfully? Unknown

Evaluation

Notable benchmark evaluations: OSWorld (14.79% average for 7B modelwithGPT-4o planner and 10.26% average for 72B model) [source]

Bespoke testing: Available [source]

Safety: Have safety evaluations been conducted by the developers? What were the results? None

Publicly reported external red-teaming or comparable auditing:

Personnel: Who were the red-teamers/auditors? None
Scope, scale, access, and methods: What access did red-teamers/auditors have and what actions did they take? None
Findings: What did the red-teamers/auditors conclude? None

Ecosystem information

Interoperability with other systems: What tools or integrations are available? Aguvis provides cross-platform operability and supports both offline and real-world online scenarios [source].

Usage statistics and patterns: Are there any notable observations about usage? 11 forks 170 stars on GitHub [source]

Additional notes

None