Agent Workflow Memory
Basic information
Website: https://arxiv.org/pdf/2409.07429
Short description: AWM is a general-purpose web agent that includes a ‘workflow memory’ which is created by learning reusable routines for common tasks and integrating these workflows into a ‘workflow memory’ to guide future task-solving processes.
Intended uses: What does the developer say it’s for? General purpose web tasks.
Date(s) deployed: Paper arXived September 11, 2023 [source]
Developer
Website: https://arxiv.org/pdf/2409.07429
Legal name: Carnegie Mellon University (et al.) [source]
Entity type: Academic Institutions [source]
Country (location of developer or first author’s first affiliation): USA [source]
Safety policies: What safety and/or responsibility policies are in place? None
System components
Backend model: What model(s) are used to power the system? They use GPT-4 (gpt-4-0613) [source]
Publicly available model specification: Is there formal documentation on the system’s intended uses and how it is designed to behave in them? None
Reasoning, planning, and memory implementation: How does the system ‘think’? AWM is a typical agent based on BrowserGym [source], but it is capable of flexibly choosing to use a workflow from its workflow memory (online mode) or not (offline mode). To add new workflows, “the agent takes actions to solve given queries, induces workflows from successful ones, and integrates them into memory.” [source]
Observation space: What is the system able to observe while ‘thinking’? The system observes accessibility tree representations from each website [source]
Action space/tools: What direct actions can the system take? Webpage actions based on the page’s accessibility tree. This includes post/submission actions. The system has the additional ability to query and write to the agent workflow memory.
User interface: How do users interact with the system? A coding IDE and terminal.
Development cost and compute: What is known about the development costs? Unknown
Guardrails and oversight
Accessibility of components:
- Weights: Are model parameters available? N/A; backends external model(s) via API
- Data: Is data available? N/A; backends external model(s) via API
- Code: Is code available? Available [source]
- Scaffolding: Is system scaffolding available? Available [source]
- Documentation: Is documentation available? Unavailable (none beyond paper and GitHub)
Controls and guardrails: What notable methods are used to protect against harmful actions? None
Customer and usage restrictions: Are there know-your-customer measures or other restrictions on customers? None
Monitoring and shutdown procedures: Are there any notable methods or protocols that allow for the system to be shut down if it is observed to behave harmfully? None
Evaluation
Notable benchmark evaluations: WebArena (35.5) and Mind2Web (4.8) [source]
Bespoke testing: None
Safety: Have safety evaluations been conducted by the developers? What were the results? None
Publicly reported external red-teaming or comparable auditing:
- Personnel: Who were the red-teamers/auditors? None
- Scope, scale, access, and methods: What access did red-teamers/auditors have and what actions did they take? None
- Findings: What did the red-teamers/auditors conclude? None
Ecosystem information
Interoperability with other systems: What tools or integrations are available? None
Usage statistics and patterns: Are there any notable observations about usage? Github repo has 227 stars and 20 forks [source]
Additional notes
None