The AI Agent Index

Documenting the technical and safety features of deployed agentic AI systems

Weco


Basic information

Website: https://web.archive.org/web/20241127010507/https://www.weco.ai/

Short description: An AI data science agent: AIDE designs pipelines for data analysis by generating code and producing models to analyze data [source]

Intended uses: What does the developer say it’s for? Weco “generates code for data preprocessing as well as model training, inference, and evaluation…The current alpha version of AIDE primarily targets tabular data tasks that can be solved with CPUs.” [source]

Date(s) deployed: April 4, 2024 [source]


Developer

Website: https://web.archive.org/web/20241127010507/https://www.weco.ai/

Legal name: WECO AI LTD [source] [source]

Entity type: Private limited Company (UK) [source]

Country (location of developer or first author’s first affiliation): Incorporation: UK [source]. HQ: London [source]

Safety policies: What safety and/or responsibility policies are in place? Unknown


System components

Backend model: What model(s) are used to power the system? Variable including OpenAI or Anthropic models [source]

Publicly available model specification: Is there formal documentation on the system’s intended uses and how it is designed to behave in them? None

Reasoning, planning, and memory implementation: How does the system ‘think’? “Solution Space Tree Search:” (1) Proposes solutions or makes changes to existing ones, (2) evaluates quality of solutions by running them and evaluating results, (3) selects most promising solution and begins another round of iteration/refinement [source]. Uses a ‘journal’ structure which stores generated code samples, tree structure of generated code samples, results of code execution, and evaluation metrics [source].

Observation space: What is the system able to observe while ‘thinking’? Maintains a workspace with all of the files and data generated by the AI agent [source]

Action space/tools: What direct actions can the system take? Writes and executes code, python interpreter, directory for storing logs [source]

User interface: How do users interact with the system? The user can monitor the agent’s logs and the forming solution tree [source]

Development cost and compute: What is known about the development costs? Unknown


Guardrails and oversight

Accessibility of components:

  • Weights: Are model parameters available? N/A; backends various models
  • Data: Is data available? N/A; backends various models
  • Code: Is code available? Available [source]
  • Scaffolding: Is system scaffolding available? Open source [source]
  • Documentation: Is documentation available? Available [source]

Controls and guardrails: What notable methods are used to protect against harmful actions? Depends on what guardrails are implemented in a specific configuration

Customer and usage restrictions: Are there know-your-customer measures or other restrictions on customers? None

Monitoring and shutdown procedures: Are there any notable methods or protocols that allow for the system to be shut down if it is observed to behave harmfully? Depends on what is implemented in a specific configuration


Evaluation

Notable benchmark evaluations: On MLE-Bench, “OpenAI’s o1-preview with AIDE scaffolding — achieves at least the level of a Kaggle bronze medal in 16.9% of competitions” [source], which was the best reported score; OpenAI used Weco AI’s open source scaffolding for their benchmarking

Bespoke testing: Several sample results Available [source]

Safety: Have safety evaluations been conducted by the developers? What were the results? None

Publicly reported external red-teaming or comparable auditing:

  • Personnel: Who were the red-teamers/auditors? None
  • Scope, scale, access, and methods: What access did red-teamers/auditors have and what actions did they take? None
  • Findings: What did the red-teamers/auditors conclude? None

Ecosystem information

Interoperability with other systems: What tools or integrations are available? None

Usage statistics and patterns: Are there any notable observations about usage? Unknown


Additional notes

None