OpenWebVoyager

Basic information

Website: https://arxiv.org/abs/2410.19609

Short description: OpenWebVoyager is an agent for accomplishing open-ended tasks on the internet. This agent supports multimodal observations, and requires minimal human guidance.

Intended uses: What does the developer say it’s for? Designed to handle more complicated web scenarios, including dealing with multimodal input and sparse supervision. Standard internet interaction benchmarks are text-only: this work aims to extend this.

Date(s) deployed: Paper arXived Oct 25, 2024, and GitHub code initial commit Oct 21, 2024 [source]

Developer

Website: https://web.archive.org/web/20250115070638/https://github.com/MinorJerry/OpenWebVoyager/tree/main

Legal name: Zhejiang University (et al.) [source]

Entity type: Academic Institution, Corporation

Country (location of developer or first author’s first affiliation): China [source]

Safety policies: What safety and/or responsibility policies are in place? None

System components

Backend model: What model(s) are used to power the system? Uses WebVoyager [source] (which is based on GPT-4o) to obtain imitation data for training multimodal web navigation, and uses GPT-4o directly to evaluate correctness of web trajectories in subsequent self-improvement loops. Agent itself build using Idefics2 [source]

Publicly available model specification: Is there formal documentation on the system’s intended uses and how it is designed to behave in them? None

Reasoning, planning, and memory implementation: How does the system ‘think’? System initially bootstraps basic web browsing capability from imitating a SOTA web browsing agent WebVoyager [source], and then explores real-world web environments in an open-ended way, and the successes are used to augment the initial IL dataset to further improve training.

Observation space: What is the system able to observe while ‘thinking’? The 3 past screenshots, and the accessibility tree of the current webpage.

Action space/tools: What direct actions can the system take? Can navigate web pages autonomously, and interact with the user (i.e. the agent can click, input, scroll, go back, restart, wait, and provide an answer to the user).

User interface: How do users interact with the system? The user can provide the system a query in natural language, but since the reward signal grounded in GPT-4o, unclear how the model grounds it (and, they self-report the system often hallucinates).

Development cost and compute: What is known about the development costs? Unknown

Guardrails and oversight

Accessibility of components:

Weights: Are model parameters available? Available [source]
Data: Is data available? Available [source]
Code: Is code available? Available [source]
Scaffolding: Is system scaffolding available? Available [source]
Documentation: Is documentation available? Available [source]

Controls and guardrails: What notable methods are used to protect against harmful actions? Unknown

Customer and usage restrictions: Are there know-your-customer measures or other restrictions on customers? None

Monitoring and shutdown procedures: Are there any notable methods or protocols that allow for the system to be shut down if it is observed to behave harmfully? Unknown

Evaluation

Notable benchmark evaluations: Mind2Web (20%) [source]

Bespoke testing: See Figure 3 of paper [source]

Safety: Have safety evaluations been conducted by the developers? What were the results? None

Publicly reported external red-teaming or comparable auditing:

Personnel: Who were the red-teamers/auditors? None
Scope, scale, access, and methods: What access did red-teamers/auditors have and what actions did they take? None
Findings: What did the red-teamers/auditors conclude? None

Ecosystem information

Interoperability with other systems: What tools or integrations are available? None

Usage statistics and patterns: Are there any notable observations about usage? 7 forks and 65 starts on GitHub [source]

Additional notes

None