The AI Agent Index

Documenting the technical and safety features of deployed agentic AI systems

AutoWebGLM


Basic information

Website: https://arxiv.org/abs/2404.03648

Short description: AutoWebGLM is an automated web navigation agent [source]

Intended uses: What does the developer say it’s for? Web browsing tasks [source]

Date(s) deployed: Paper arxived on April 4, 2024, and Github code initial commit April 3, 2024 [source] [source]


Developer

Website: https://github.com/THUDM/AutoWebGLM

Legal name: Tsinghua University (et al.) [source]

Entity type: Academic Institution, Corporation

Country (location of developer or first author’s first affiliation): China [source]

Safety policies: What safety and/or responsibility policies are in place? None


System components

Backend model: What model(s) are used to power the system? ChatGLM3-6B model [source]

Publicly available model specification: Is there formal documentation on the system’s intended uses and how it is designed to behave in them? None

Reasoning, planning, and memory implementation: How does the system ‘think’? Paper arXived on April 4, 2024, and Github code initial commit April 3, 2024 [source] [source]

Observation space: What is the system able to observe while ‘thinking’? The observation space consists of states that include simplified HTML information, current location within webpage, and past operation records [source].

Action space/tools: What direct actions can the system take? The action space includes the following: Click at an element, Hover on an element, Select option in an element, Type to an element, Scroll up or down of the page, Go forward or backward of the page, Jump to URL, Switch to i-th tab, Notify user to interact, Stop with answer [source]

User interface: How do users interact with the system? A Chrome extension where users write prompts to perform operations on websites [source]

Development cost and compute: What is known about the development costs? Unknown


Guardrails and oversight

Accessibility of components:

  • Weights: Are model parameters available? Available [source]
  • Data: Is data available? Available [source]
  • Code: Is code available? Available [source]
  • Scaffolding: Is system scaffolding available? Available [source]
  • Documentation: Is documentation available? Some documentation in Github [source], and arXiv [source]

Controls and guardrails: What notable methods are used to protect against harmful actions? Unknown

Customer and usage restrictions: Are there know-your-customer measures or other restrictions on customers? None

Monitoring and shutdown procedures: Are there any notable methods or protocols that allow for the system to be shut down if it is observed to behave harmfully? Unknown


Evaluation

Notable benchmark evaluations: Mind2Web (59.5), MiniWoB++ (89.3%), WebArena (18.2%) [source]

Bespoke testing: Reported results on their own benchmark called AutoWebBench [source]

Safety: Have safety evaluations been conducted by the developers? What were the results? The authors “identify errors that occasionally occur during task execution, which can be broadly categorized into four types: hallucinations, poor graphical recognition, misinterpretation of task context, and pop-up interruptions”. In Table 7, they report how often these types of errors occur [source]

Publicly reported external red-teaming or comparable auditing:

  • Personnel: Who were the red-teamers/auditors? None
  • Scope, scale, access, and methods: What access did red-teamers/auditors have and what actions did they take? None
  • Findings: What did the red-teamers/auditors conclude? None

Ecosystem information

Interoperability with other systems: What tools or integrations are available? Chrome [source]

Usage statistics and patterns: Are there any notable observations about usage? 65 forks and 784 stars on GitHub [source]


Additional notes

None