The AI Agent Index

Documenting the technical and safety features of deployed agentic AI systems

data-to-paper


Basic information

Website: https://web.archive.org/web/20241119044950/https://github.com/Technion-Kishony-lab/data-to-paper?tab=readme-ov-file

Short description: A research automation platform completing a stepwise research process, able to design research plans, raise hypotheses, write and debug analysis code, and create complete and information-traceable papers [source].

Intended uses: What does the developer say it’s for? Research automation/copiloting and acceleration, while supporting transparency and traceability of decisions in the automated research process. Supports autonomous goal-search and fixed-goal specification.

Date(s) deployed: ArXived April 24, 2024 [source]


Developer

Website: https://web.archive.org/web/20241221170344/https://kishony.technion.ac.il/

Legal name: Israel Institute of Technology (et al.) [source]

Entity type: Academic Institution, Industry Organization

Country (location of developer or first author’s first affiliation): Israel [source]

Safety policies: What safety and/or responsibility policies are in place? None


System components

Backend model: What model(s) are used to power the system? Multiple different LLMs used / experimented with, found that open-source models (Llama2 family+ CodeLlama-34b) hallucinated far more than OpenAI models (GPT-3.5-turbo, GPT-4).

Publicly available model specification: Is there formal documentation on the system’s intended uses and how it is designed to behave in them? None

Reasoning, planning, and memory implementation: How does the system ‘think’? The system walks through the data research process step-by-step, through a series of predefined research steps and tool use calls. Steps are occasionally vetted with external LLM review, and the calls are broken up sufficiently to increase modularity (and hence reliability) of the individual steps.

Observation space: What is the system able to observe while ‘thinking’? Depending on the stage of the research generation process, can observe facts about data, generated research goal, analysis code, and sections of the generated report. In general, data-to-paper observes the union of the data and provided descriptions, and a subset of previously generated output. See Figure 1B [source]

Action space/tools: What direct actions can the system take? Writing and executing analysis code, exploring data and metadata, searching the literature through Semantic Scholar API, and writing and compiling hyperlinked LaTex papers section-by-section.

User interface: How do users interact with the system? Can provide data, data description, and optionally a fixed analysis goal for the system to produce. There is also a GUI app that allows users to use the system in a copilot mode [source]

Development cost and compute: What is known about the development costs? Unknown


Guardrails and oversight

Accessibility of components:

  • Weights: Are model parameters available? N/A; backends external model(s) via API
  • Data: Is data available? N/A; backends external model(s) via API
  • Code: Is code available? Available [source]
  • Scaffolding: Is system scaffolding available? Available [source]
  • Documentation: Is documentation available? Available [source]

Controls and guardrails: What notable methods are used to protect against harmful actions? System is sufficiently scoped and modular that harm seems implausible, although no specific measures are taken to ensure safety.

Customer and usage restrictions: Are there know-your-customer measures or other restrictions on customers? None

Monitoring and shutdown procedures: Are there any notable methods or protocols that allow for the system to be shut down if it is observed to behave harmfully? None


Evaluation

Notable benchmark evaluations: None

Bespoke testing: Demo paper generated available at [source]

Safety: Have safety evaluations been conducted by the developers? What were the results? None

Publicly reported external red-teaming or comparable auditing:

  • Personnel: Who were the red-teamers/auditors? None
  • Scope, scale, access, and methods: What access did red-teamers/auditors have and what actions did they take? None
  • Findings: What did the red-teamers/auditors conclude? None

Ecosystem information

Interoperability with other systems: What tools or integrations are available? The semantic Scholar API and is interchangeable with other literature search engine APIs

Usage statistics and patterns: Are there any notable observations about usage? GitHub repo has 50 forks and 489 stars [source]


Additional notes

None