DynaSaur
Basic Information
Website: https://arxiv.org/abs/2411.01747
Short description: An LLM agent which can dynamically synthesize its own actions by writing Python code.
Intended uses: What does the developer state that the system is intended for?: Designed for domains where the number of possible actions is impractical to completely exhaust. Particularly applicable in strongly open-ended domains. General-purpose tool.
Date(s) deployed: Paper ArXiVed Nov 4 2024, initial commit to GitHub repo Nov 8, 2024 [source]
Developer
System Components
Backend model(s): What model(s) are used to power the system?: LLM backend uses GPT-4o and GPT-4o mini
Public model specification: Is there formal documentation on the system’s intend...: None
Description of reasoning, planning, and memory implementation: How does the syst...: The agent is initially pre-populated with initial actions and a task description: observing these, the agent can interact with an iPython kernel, and generate new actions for itself to execute, in the process of completing the task.
Observation space: What is the system able to observe while 'thinking'?: The agent observes the set of predefined actions, the task description (provided by the end users), and its current trajectory (represented as a chain of thought-action-observation).
Action space/tools: What direct actions can the system take?: The agent proposes actions represented in Python functions, which can be but is not limited to the user-defined actions or compositions thereof. These Python functions can in turn interact with the internet, the OS, or anything else. Notably, the agent can call an action retriever, which retrieves Python descriptions of actions previously generated (not included by default due to context window concerns), which is then returned in the subsequent step as part of the observation.
User interface: How do users interact with the system?: The user provides an initial task description, but otherwise doesn't interact with the system once deployed.
Development cost and compute: What is known about the development costs?: Unknown
Guardrails & Oversight
Accessibility of components
Weights: Are model parameters available?: Available [source]
Data: Is data available?: Open sourced on GitHub [source]
Code: Is code available?: Open sourced on GitHub [source]
Documentation: Is documentation available?: Open sourced on GitHub [source]
Scaffolding: Is system scaffolding available?: Open sourced on GitHub [source]
Controls and guardrails: What notable methods are used to protect against harmfu...: Unknown
Monitoring and shutdown procedures: Are there any notable methods or protocols t...: Unknown
Customer and usage restrictions: Are there know-your-customer measures or other ...: None
Evaluation
Notable benchmark evaluations (e.g., on SWE-Bench Verified): GAIA 7th leaderboard rank [source] (first at time of publication), 27% for GPT-4o-mini-powered agent and 38% for GPT-4o-powered agent.
Bespoke testing (e.g., demos): GAIA benchmark [source]
Safety: Have safety evaluations been conducted by the developers? What were the ...: None
Publicly reported external red-teaming or comparable auditing
Personnel: Who were the red-teamers/auditors?: None
Scope, scale, access, and methods: What access did red-teamers/auditors have and...: None
Findings: What did the red-teamers/auditors conclude?: None
Ecosystem
Interoperability with other systems: What tools or integrations are available?: Can work in principle with anything that interacts with Python.
Usage statistics and patterns: Are there any notable observations about usage?: 18 forks and 228 stars on GitHub [source]
Other notes (if any): --