Magentic One

Basic information

Website: https://www.microsoft.com/en-us/research/publication/magentic-one-a-generalist-multi-agent-system-for-solving-complex-tasks/

Short description: A multiagent system introduced by Microsoft with general capabilities.

Intended uses: What does the developer say it’s for? It is used for “ad-hoc, open-ended tasks such as browsing the web and interacting with web-based applications, handling files, and writing and executing Python code” [source]

Date(s) deployed: Announced November 4, 2023 [source]

Developer

Website: https://web.archive.org/web/20241231232226/https://www.microsoft.com/en-us/

Legal name: Microsoft Corporation [source]

Entity type: Corporation [source]

Country (location of developer or first author’s first affiliation): Incorporation: Washington, USA (MICROSOFT CORPORATION (2357303)) [source]. Registration: Delaware, USA. HQ: Washington, USA [source]

Safety policies: What safety and/or responsibility policies are in place? Model evaluations and red teaming; model reporting and information sharing; security controls [source]. Microsoft’s safety policies are described online [source]

System components

Backend model: What model(s) are used to power the system? The default model used is gpt-4o-2024-05-13, but they also experiment with using Openai o1 [source].

Publicly available model specification: Is there formal documentation on the system’s intended uses and how it is designed to behave in them? Available [source]

Reasoning, planning, and memory implementation: How does the system ‘think’? The system contains multiple subagents that work together to solve problems. Things are controlled at a high level by the “Orchestrator” agent and executed by the “WebSurfer,” FileSurfer,” “Coder,” and “ComputerTerminal” agents. [source]

Observation space: What is the system able to observe while ‘thinking’? It has full access to a filesystem and web browser.

Action space/tools: What direct actions can the system take? It is able to surf (including posting) on the web, execute file system commands, and write/execute code.

User interface: How do users interact with the system? Users can configure and experiment with it using the AutoGen package [source]

Development cost and compute: What is known about the development costs? Unknown

Guardrails and oversight

Accessibility of components:

Weights: Are model parameters available? N/A; backends various models
Data: Is data available? N/A; backends various models
Code: Is code available? Available on GitHub as part of Microsoft’s AutoGen project [source]
Scaffolding: Is system scaffolding available? Available [source]
Documentation: Is documentation available? Available on GitHub [source], see also the technical report [source]

Controls and guardrails: What notable methods are used to protect against harmful actions? The developers recommend using containers, virtual environments, log monitoring, human oversight, access limitations, and data safeguards.

Customer and usage restrictions: Are there know-your-customer measures or other restrictions on customers? None

Monitoring and shutdown procedures: Are there any notable methods or protocols that allow for the system to be shut down if it is observed to behave harmfully? Logs are kept while the system runs.

Evaluation

Notable benchmark evaluations: GAIA (38%), AssistantBench (27.7), and WebArena (32.8%) [source]

Bespoke testing: None

Safety: Have safety evaluations been conducted by the developers? What were the results? They report on ad-hoc evaluations of failures and safety concerns in the technical report [source]. The developers claim: “We performed testing for Responsible AI harm e.g., cross-domain prompt injection and all tests returned the expected results with no signs of jailbreak” [source]

Publicly reported external red-teaming or comparable auditing:

Personnel: Who were the red-teamers/auditors? None
Scope, scale, access, and methods: What access did red-teamers/auditors have and what actions did they take? None
Findings: What did the red-teamers/auditors conclude? None

Ecosystem information

Interoperability with other systems: What tools or integrations are available? It was not explicitly designed to interoperate with any particular systems other than the web browser and filesystem. But it presumably could integrate with others with little configuration.

Usage statistics and patterns: Are there any notable observations about usage? Microsoft AutoGen has 36.9k stars and 5.3k forks [source]

Additional notes

None