The AI Agent Index

Documenting the technical and safety features of deployed agentic AI systems

OpenAI o3


Basic information

Website: https://www.youtube.com/watch?v=SKBG1sqdyIU&t=1s

Short description: A series of large language models representing OpenAIs “next frontier model” after o1 [source]

Intended uses: What does the developer say it’s for? O3 is intended to be used across any domain in which long form reasoning is required. The developers draw particular reference to its ability to be used to construct software development agents and solve difficult general reasoning tasks such as competitive math and coding [source].

Date(s) deployed: Not yet externally deployed. First announced on Dec 20th 2024 [source]


Developer

Website: https://www.openai.com/

Legal name: OpenAI Inc.(parent company). OpenAI Global, LLC (for-profit subsidiary) [source]

Entity type: The structure of OpenAI is complex. It has a parent 501(c)(3) non-profit, with a for-profit subsidiary [source]. OpenAI is restructuring its business into a for-profit benefit corporation that will no longer be controlled by its non-profit board [source]

Country (location of developer or first author’s first affiliation): Incorporation: Delaware, USA (OPENAI GLOBAL, LLC (7208772)) [source]. HQ: California, USA

Safety policies: What safety and/or responsibility policies are in place? The OpenAI (non-profit entity) charter states “We are committed to doing the research required to make AGI safe, and to driving the broad adoption of such research across the AI community” [source]. They also have a “Preparedness framework” for tracking the dangerous capabilities of models they develop and state that “Only models with a post-mitigation score of “medium” or below can be deployed.” [source]


System components

Backend model: What model(s) are used to power the system? Unknown, but if o3 is similar to o1, then the backend model is specifically trained with “reinforcement learning to perform complex reasoning” [source]

Publicly available model specification: Is there formal documentation on the system’s intended uses and how it is designed to behave in them? Public general spec for all OpenAI models [source]

Reasoning, planning, and memory implementation: How does the system ‘think’? Unknown

Observation space: What is the system able to observe while ‘thinking’? Textual inputs from users.

Action space/tools: What direct actions can the system take? Natural language.

User interface: How do users interact with the system? A public user interface not yet released, although most likely will follow previous OpenAI models with a ChatGPT interface and API.

Development cost and compute: What is known about the development costs? Unknown


Guardrails and oversight

Accessibility of components:

  • Weights: Are model parameters available? Closed source
  • Data: Is data available? Closed source
  • Code: Is code available? Closed source
  • Scaffolding: Is system scaffolding available? Closed source
  • Documentation: Is documentation available? N/A (not yet released to public)

Controls and guardrails: What notable methods are used to protect against harmful actions? OpenAI released a new alignment method known as “Deliberative alignment” that uses a models reasoning ability to identify and refuse harmful queries. In the resulting paper, they test this method on o3-mini [source]

Customer and usage restrictions: Are there know-your-customer measures or other restrictions on customers? N/A (model not yet released to the public)

Monitoring and shutdown procedures: Are there any notable methods or protocols that allow for the system to be shut down if it is observed to behave harmfully? Unknown


Evaluation

Notable benchmark evaluations: Using high test time compute settings, o3 achieved 25.2% on Frontier Math and 87.5% on ARC-AGI semi-private eval set (75.7% on low test time compute). It also has a 2727 Codeforce ELO score and 71.7% on SWE-bench verified. For more benchmarks, and results for the related smaller o3-mini model, see [source].

Bespoke testing: Release video includes a demo where o3 mini is asked to create code for a website that can be used to submit coding questions to o3 mini and save and run the resulting code. They then use this website to get o3-mini to evaluate itself on GPQA [source].

Safety: Have safety evaluations been conducted by the developers? What were the results? Some safety evaluations have been run on the o3-mini result. See table 1 of [source].

Publicly reported external red-teaming or comparable auditing:

  • Personnel: Who were the red-teamers/auditors? No publicly known audits conducted so far, although OpenAI are releasing the model early to safety / security researchers for additional testing [source].
  • Scope, scale, access, and methods: What access did red-teamers/auditors have and what actions did they take? N/A assessments in progress.
  • Findings: What did the red-teamers/auditors conclude? Results not yet released.

Ecosystem information

Interoperability with other systems: What tools or integrations are available? Model is not publicly released yet.

Usage statistics and patterns: Are there any notable observations about usage? Unknown


Additional notes

None