DynaSaur

Basic Information

Website: https://arxiv.org/abs/2411.01747

Short description: An LLM agent which can dynamically synthesize its own actions by writing Python code.

Intended uses: What does the developer state that the system is intended for?: Designed for domains where the number of possible actions is impractical to completely exhaust. Particularly applicable in strongly open-ended domains. General-purpose tool.

Date(s) deployed: Paper ArXiVed Nov 4 2024, initial commit to GitHub repo Nov 8, 2024 [source]

Developer

Website: https://github.com/adobe-research/dynasaur

Legal name: University of Maryland (et al.) [source]

Entity type: Academic Institution, Corporation

Country (location of developer or first author's first affiliation): USA [source]

Safety policies: What safety and/or responsibility policies are in place?: Unknown

System Components

Backend model(s): What model(s) are used to power the system?: LLM backend uses GPT-4o and GPT-4o mini

Public model specification: Is there formal documentation on the system’s intend...: None

Description of reasoning, planning, and memory implementation: How does the syst...: The agent is initially pre-populated with initial actions and a task description: observing these, the agent can interact with an iPython kernel, and generate new actions for itself to execute, in the process of completing the task.

Observation space: What is the system able to observe while 'thinking'?: The agent observes the set of predefined actions, the task description (provided by the end users), and its current trajectory (represented as a chain of thought-action-observation).

Action space/tools: What direct actions can the system take?: The agent proposes actions represented in Python functions, which can be but is not limited to the user-defined actions or compositions thereof. These Python functions can in turn interact with the internet, the OS, or anything else. Notably, the agent can call an action retriever, which retrieves Python descriptions of actions previously generated (not included by default due to context window concerns), which is then returned in the subsequent step as part of the observation.

User interface: How do users interact with the system?: The user provides an initial task description, but otherwise doesn't interact with the system once deployed.

Development cost and compute: What is known about the development costs?: Unknown

Guardrails & Oversight

Accessibility of components

Weights: Are model parameters available?: Available [source]

Data: Is data available?: Open sourced on GitHub [source]

Code: Is code available?: Open sourced on GitHub [source]

Documentation: Is documentation available?: Open sourced on GitHub [source]

Scaffolding: Is system scaffolding available?: Open sourced on GitHub [source]

Controls and guardrails: What notable methods are used to protect against harmfu...: Unknown

Monitoring and shutdown procedures: Are there any notable methods or protocols t...: Unknown

Customer and usage restrictions: Are there know-your-customer measures or other ...: None

Evaluation

Notable benchmark evaluations (e.g., on SWE-Bench Verified): GAIA 7th leaderboard rank [source] (first at time of publication), 27% for GPT-4o-mini-powered agent and 38% for GPT-4o-powered agent.

Bespoke testing (e.g., demos): GAIA benchmark [source]

Safety: Have safety evaluations been conducted by the developers? What were the ...: None

Publicly reported external red-teaming or comparable auditing

Personnel: Who were the red-teamers/auditors?: None

Scope, scale, access, and methods: What access did red-teamers/auditors have and...: None

Findings: What did the red-teamers/auditors conclude?: None

Ecosystem

Interoperability with other systems: What tools or integrations are available?: Can work in principle with anything that interacts with Python.

Usage statistics and patterns: Are there any notable observations about usage?: 18 forks and 228 stars on GitHub [source]

Other notes (if any): --

← Back to 2024 Index