CodeAct 2.1

Basic information

Website: https://web.archive.org/web/20241112010059/https://www.all-hands.dev/blog/openhands-codeact-21-an-open-state-of-the-art-software-development-agent

Short description: An open sources AI software developer agent built with the OpenHands, formally OpenDevin, framework as part of the CodeAct series of agents.

Intended uses: What does the developer say it’s for? “OpenHands agents can do anything a human developer can: modify code, run commands, browse the web, call APIs, and yes—even copy code snippets from StackOverflow.” [source]

Date(s) deployed: March 12, 2024 (as OpenDevin); November 1, 2024 [source]

Developer

Website: https://web.archive.org/web/20241229190647/https://www.all-hands.dev/

Legal name: All Hands AI, Inc [source]

Entity type: Corporation

Country (location of developer or first author’s first affiliation): Incorporation: Delaware, USA (3591585) [source]

Safety policies: What safety and/or responsibility policies are in place? Unknown

System components

Backend model: What model(s) are used to power the system? Various models can be used as a backend, but the developers use Claude Sonnet 3.5 by default [source]. The documentation has recommendations for what model to backend [source]

Publicly available model specification: Is there formal documentation on the system’s intended uses and how it is designed to behave in them? None

Reasoning, planning, and memory implementation: How does the system ‘think’? OpenHands’ default CodeAct agent implements reasoning and planning through a multi-turn interaction framework where the agent processes observations from users and environment, plans using chain-of-thought, and executes Python code, bash commands, browser control, file editing actions, with memory maintained through context history and event stream, while learning from execution results and error feedback [source]

Observation space: What is the system able to observe while ‘thinking’? OpenHands’ observation space includes: natural language instructions and/or screenshots from users, bash/Python execution output and/or error messages, browser states (e.g., tab opened, web page content), and file editor outputs [source]

Action space/tools: What direct actions can the system take? OpenHands’ action space includes: ability to execute arbitrary bash commands, and python code (including calling different APIs / libraries programmatically), interact with file editors, and interact with web browsers [source]

User interface: How do users interact with the system? Users interact with the system through natural language and image (screenshot) queries and receive responses that includes text explanations, code executions, visuals, or Github pull requests.

Development cost and compute: What is known about the development costs? Unknown , however the paper discusses the costs of evaluation runs [source]

Guardrails and oversight

Accessibility of components:

Weights: Are model parameters available? N/A; backends various models
Data: Is data available? N/A; backends various models
Code: Is code available? Available [source]
Scaffolding: Is system scaffolding available? On Github [source] and pre-print [source]
Documentation: Is documentation available? Documentation page [source] on Github [source] and pre-print [source]

Controls and guardrails: What notable methods are used to protect against harmful actions? Agent is run in sandboxed environment. OpenHands also has a built-in security analyzer in collaboration with InvariantLab that monitor agent’s action [source]

Customer and usage restrictions: Are there know-your-customer measures or other restrictions on customers? None

Monitoring and shutdown procedures: Are there any notable methods or protocols that allow for the system to be shut down if it is observed to behave harmfully? None

Evaluation

Notable benchmark evaluations: 53% on SWE-bench verified [source]

Bespoke testing: None

Safety: Have safety evaluations been conducted by the developers? What were the results? None

Publicly reported external red-teaming or comparable auditing:

Personnel: Who were the red-teamers/auditors? None
Scope, scale, access, and methods: What access did red-teamers/auditors have and what actions did they take? None
Findings: What did the red-teamers/auditors conclude? None

Ecosystem information

Interoperability with other systems: What tools or integrations are available? OpenHands with access to terminal and Python is able to use arbitrary software libraries and APIs.

Usage statistics and patterns: Are there any notable observations about usage? OpenHands github has 41.7k stars and 4.6k forks [source]

Additional notes

This paper uses OpenHands [source].