The AI Scientist
Basic Information
Short description: An open-source AI agent designed to automate scientific research [source].
Intended uses: What does the developer state that the system is intended for?: Conducting end-to-end scientific research projects, including idea creation, to experimental design, experimental execution, result analysis, and paper writing [source].
Date(s) deployed: August 16, 2024
Developer
Website: https://sakana.ai/
Legal name: Unknown (seemingly "Sakana AI")
Entity type: Unknown
Country (location of developer or first author's first affiliation): Japan [source]
Safety policies: What safety and/or responsibility policies are in place?: Unknown
System Components
Backend model(s): What model(s) are used to power the system?: Variable, including GPT-4o, GPT-4o-mini, and o1 models, LLaMA 3, and Claude Sonnet 3.5 [source]
Public model specification: Is there formal documentation on the system’s intend...: None
Description of reasoning, planning, and memory implementation: How does the syst...: The agent implements research in 4 stages: (1) Idea Generation. (2) Experiment iteration. (3) Paper writing. (4) Paper review. To begin, the agent is provided a code template that implements some baseline experiment, a LaTeX template, and plotting file template. Each stage of the above research pipeline is completed by prompting the agent and providing it with text state summarizing previous sections [source].
Observation space: What is the system able to observe while 'thinking'?: The AI Scientist operates in an environment where it can see the output of executed code and has access to an archive of ideas.
Action space/tools: What direct actions can the system take?: Writing and executing code and recording ideas in a scratchpad.
User interface: How do users interact with the system?: The user simply runs scripts that produce and save output files.
Development cost and compute: What is known about the development costs?: Unknown However, they report a cost of about $15 per paper.
Guardrails & Oversight
Accessibility of components
Weights: Are model parameters available?: N/A; can use various backend models
Data: Is data available?: N/A; can use various backend models
Code: Is code available?: Open source [source]
Documentation: Is documentation available?: Available [source]
Scaffolding: Is system scaffolding available?: Available [source]
Controls and guardrails: What notable methods are used to protect against harmfu...: Developers recommend running the model in a sandboxed virtual environment.
Monitoring and shutdown procedures: Are there any notable methods or protocols t...: Depends in what environment the agent is run in.
Customer and usage restrictions: Are there know-your-customer measures or other ...: None
Evaluation
Notable benchmark evaluations (e.g., on SWE-Bench Verified): Unknown
Bespoke testing (e.g., demos): Authors evaluate the quality of papers written by the AI Scientist using the AI scientist reviewer and manual inspection.
Safety: Have safety evaluations been conducted by the developers? What were the ...: The technical report contains a "Limitations and Ethical Considerations" that discusses the safety implications. The authors state, "it could be explicitly be deployed to conduct unethical research, or even lead to unintended harm if The AI Scientist conducts unsafe research. Concretely, if it were encouraged to find novel, interesting biological materials and given access to 'cloud labs' (Arnold, 2022) where robots perform wet lab biology experiments, it could (without its overseer’s intent) create new, dangerous viruses or poisons that harm people before we can intervene."
Publicly reported external red-teaming or comparable auditing
Personnel: Who were the red-teamers/auditors?: None
Scope, scale, access, and methods: What access did red-teamers/auditors have and...: None
Findings: What did the red-teamers/auditors conclude?: None
Ecosystem
Interoperability with other systems: What tools or integrations are available?: The AI Scientist uses Aider [source], an LLM-Based Coding Assistant, to write experimental code.
Usage statistics and patterns: Are there any notable observations about usage?: The github repository for the AI Scientist has 1.2k forks, and 8.4k stars [source].
Other notes (if any): --