DynaSaur

Basic information

Website: https://arxiv.org/abs/2411.01747

Short description: An LLM agent which can dynamically synthesize its own actions by writing Python code.

Intended uses: What does the developer say it’s for? Designed for domains where the number of possible actions is impractical to completely exhaust. Particularly applicable in strongly open-ended domains. General-purpose tool.

Date(s) deployed: Paper ArXiVed Nov 4 2024, initial commit to GitHub repo Nov 8, 2024 [source]

Developer

Website: https://web.archive.org/web/20241206200508/https://github.com/adobe-research/dynasaur

Legal name: University of Maryland (et al.) [source]

Entity type: Academic Institution, Corporation

Country (location of developer or first author’s first affiliation): USA [source]

Safety policies: What safety and/or responsibility policies are in place? Unknown

System components

Backend model: What model(s) are used to power the system? LLM backend uses GPT-4o and GPT-4o mini

Publicly available model specification: Is there formal documentation on the system’s intended uses and how it is designed to behave in them? None

Reasoning, planning, and memory implementation: How does the system ‘think’? The agent is initially pre-populated with initial actions and a task description: observing these, the agent can interact with an iPython kernel, and generate new actions for itself to execute, in the process of completing the task.

Observation space: What is the system able to observe while ‘thinking’? The agent observes the set of predefined actions, the task description (provided by the end users), and its current trajectory (represented as a chain of thought-action-observation).

Action space/tools: What direct actions can the system take? The agent proposes actions represented in Python functions, which can be but is not limited to the user-defined actions or compositions thereof. These Python functions can in turn interact with the internet, the OS, or anything else. Notably, the agent can call an action retriever, which retrieves Python descriptions of actions previously generated (not included by default due to context window concerns), which is then returned in the subsequent step as part of the observation.

User interface: How do users interact with the system? The user provides an initial task description, but otherwise doesn’t interact with the system once deployed.

Development cost and compute: What is known about the development costs? Unknown

Guardrails and oversight

Accessibility of components:

Weights: Are model parameters available? Available [source]
Data: Is data available? Open sourced on GitHub [source]
Code: Is code available? Open sourced on GitHub [source]
Scaffolding: Is system scaffolding available? Open sourced on GitHub [source]
Documentation: Is documentation available? Open sourced on GitHub [source]

Controls and guardrails: What notable methods are used to protect against harmful actions? Unknown

Customer and usage restrictions: Are there know-your-customer measures or other restrictions on customers? None

Monitoring and shutdown procedures: Are there any notable methods or protocols that allow for the system to be shut down if it is observed to behave harmfully? Unknown

Evaluation

Notable benchmark evaluations: GAIA 7th leaderboard rank [source] (first at time of publication), 27% for GPT-4o-mini-powered agent and 38% for GPT-4o-powered agent.

Bespoke testing: GAIA benchmark [source]

Safety: Have safety evaluations been conducted by the developers? What were the results? None

Publicly reported external red-teaming or comparable auditing:

Personnel: Who were the red-teamers/auditors? None
Scope, scale, access, and methods: What access did red-teamers/auditors have and what actions did they take? None
Findings: What did the red-teamers/auditors conclude? None

Ecosystem information

Interoperability with other systems: What tools or integrations are available? Can work in principle with anything that interacts with Python.

Usage statistics and patterns: Are there any notable observations about usage? 18 forks and 228 stars on GitHub [source]

Additional notes

None