SeePlanAct

Basic information

Website: https://web.archive.org/web/20240906205347/https://assistantbench.github.io/

Short description: “A web agent equipped with memory and planning components for multihop, info-seeking questions.” [source]

Intended uses: What does the developer say it’s for? SPA was built to tackle tasks in AssistantBench: a benchmark that “evaluates the ability of web agents to automatically solve realistic and time-consuming tasks.” [source]

Date(s) deployed: Earliest GitHub commits from July 13, 2024 [source]

Developer

Website: https://web.archive.org/web/20240906205347/https://assistantbench.github.io/

Legal name: Tel Aviv University (et al.) [source]

Entity type: Academic Institution(s)

Country (location of developer or first author’s first affiliation): Israel [source]

Safety policies: What safety and/or responsibility policies are in place? None but see the “Ethical Implications and Broader Impact” section of the paper [source]

System components

Backend model: What model(s) are used to power the system? Variable, since SPA is built on SeeAct, which is compatible with several backend models. The developers mainly use GPT-4T and Claude-3.5-Sonnet [source] [source]

Publicly available model specification: Is there formal documentation on the system’s intended uses and how it is designed to behave in them? None

Reasoning, planning, and memory implementation: How does the system ‘think’? SPA has “two specialized components: (1) a planning component for the model to plan and re-plan its execution, and (2) a memory component with the option to transfer information between steps via a memory buffer.” The authors provide the prompt that they use to achieve this in Figure 20 of [source].

Observation space: What is the system able to observe while ‘thinking’? SPA operates in an environment where it can observe webpage screenshots and HTML elements, along with its task memory [source]

Action space/tools: What direct actions can the system take? SPA can take the following actions when browsing the internet: “Click, Select, Type, GoTo, Search, GoBack, Scroll, Press Enter, Terminate.” [source]

User interface: How do users interact with the system? Code released in a GitHub repository without a user interface [source]

Development cost and compute: What is known about the development costs? Unknown

Guardrails and oversight

Accessibility of components:

Weights: Are model parameters available? N/A; backends various models
Data: Is data available? N/A; backends various models
Code: Is code available? Available [source]
Scaffolding: Is system scaffolding available? Available [source]
Documentation: Is documentation available? Documentation on GitHub [source] and pre-print [source]

Controls and guardrails: What notable methods are used to protect against harmful actions? The system is prompted to “not attempt to create accounts, log in or do the final submission”. Individual users can also monitor and intervene manually [source]

Customer and usage restrictions: Are there know-your-customer measures or other restrictions on customers? None

Monitoring and shutdown procedures: Are there any notable methods or protocols that allow for the system to be shut down if it is observed to behave harmfully? The model is prompted to terminate a task “if it requires potentially harmful actions.” [source]

Evaluation

Notable benchmark evaluations: SPA is evaluated on AssistantBench and FanoutQA [source]

Bespoke testing: SPA achieves 12.9% on a benchmark that was constructed by the authors to test for autonomous task execution (AssistantBench) when leveraging Claude 3.5 Sonnet [source]

Safety: Have safety evaluations been conducted by the developers? What were the results? None

Publicly reported external red-teaming or comparable auditing:

Personnel: Who were the red-teamers/auditors? None
Scope, scale, access, and methods: What access did red-teamers/auditors have and what actions did they take? None
Findings: What did the red-teamers/auditors conclude? None

Ecosystem information

Interoperability with other systems: What tools or integrations are available? By default, SPA can navigate the internet and interact with web tools like Google Maps. Beyond this, interoperability is not highlighted in particular [source]

Usage statistics and patterns: Are there any notable observations about usage? The GitHub repository has 2 forks and 41 stars [source]

Additional notes

None