CodeActAgent

Basic information

Website: https://arxiv.org/abs/2402.01030

Short description: CodeActAgent can autonomously execute code and self-debug to carry out programming tasks [source]

Intended uses: What does the developer say it’s for? “CodeActAgent, designed for seamless integration with Python, can carry out sophisticated tasks (e.g., model training, data visualization) using existing Python packages.” [source]

Date(s) deployed: February 1, 2024 [source]

Developer

Website: https://web.archive.org/web/20241223175319/https://github.com/xingyaoww/code-act

Legal name: University of Illinois Urbana-Champaign (et al.) [source]

Entity type: Academic Institution, Industry Organization [source]

Country (location of developer or first author’s first affiliation): USA [source]

Safety policies: What safety and/or responsibility policies are in place? None but see the “Impact Statement” in the paper [source]

System components

Backend model: What model(s) are used to power the system? Variable, defaulting to Llama2 and Mistral [source]

Publicly available model specification: Is there formal documentation on the system’s intended uses and how it is designed to behave in them? None

Reasoning, planning, and memory implementation: How does the system ‘think’? CodeActAgent plans for its action through chain-of-thought. It also uses automated feedback from programming terminals (e.g., error messages) to self-debug its code [source]

Observation space: What is the system able to observe while ‘thinking’? “For each turn of interaction, the agent receives an observation (input) either from the user (e.g., natural language instruction) or the environment (e.g., code execution result).” [source]

Action space/tools: What direct actions can the system take? CodeActAgent can execute code actions in a Python terminal, which may then call an available API [source]

User interface: How do users interact with the system? While the company’s demos sometimes include a user interface (UI), there is no functioning, publicly available UI.

Development cost and compute: What is known about the development costs? “All SFT [supervised fine-tuning] experiments are performed on one 4xA100 40GB SXM node using a fork of Megatron-LLM with a training throughput of around 9k tokens per second.” [source]

Guardrails and oversight

Accessibility of components:

Weights: Are model parameters available? Backends external models. Weights are available [source]
Data: Is data available? Backends external models. Fine-tuning data is open-sourced [source]
Code: Is code available? Available [source]
Scaffolding: Is system scaffolding available? Available [source]
Documentation: Is documentation available? Basic documentation on Github [source] and pre-print [source]

Controls and guardrails: What notable methods are used to protect against harmful actions? None

Customer and usage restrictions: Are there know-your-customer measures or other restrictions on customers? None

Monitoring and shutdown procedures: Are there any notable methods or protocols that allow for the system to be shut down if it is observed to behave harmfully? None

Evaluation

Notable benchmark evaluations: 46.2% on Miniwob++ when the backend model is Mistral 7B [source].

Bespoke testing: CodeActAgent performs well on a benchmark that was constructed by the authors to test for tool composition (M3ToolEval) [source]. A demo is on the GitHub repo [source].

Safety: Have safety evaluations been conducted by the developers? What were the results? None

Publicly reported external red-teaming or comparable auditing:

Personnel: Who were the red-teamers/auditors? None
Scope, scale, access, and methods: What access did red-teamers/auditors have and what actions did they take? None
Findings: What did the red-teamers/auditors conclude? None

Ecosystem information

Interoperability with other systems: What tools or integrations are available? By default, CodeActAgent is integrated with a Python interpreter and can leverage existing Python packages. By providing API calls as Python functions, CodeActAgent can search Wikipedia and control robots. As CodeActAgent is open-source, it can be modified to integrate with other systems [source]

Usage statistics and patterns: Are there any notable observations about usage? GitHub repo has 519 stars and 40 forks [source]

Additional notes

None