Paper & Citation¶
Status: The benchmark paper is in preparation.
This page will be updated with the arXiv preprint link and full citation once published.
Benchmark Overview¶
| Title | ASI-Bench: A Project-level Benchmark for Evaluating LLM Agents on AI for Science |
| Authors | (In preparation) |
| Venue | arXiv preprint (forthcoming) |
| Repository | github.com/zjw49246/Agent-AI4Sci-Bench |
| Website | zjw49246.github.io/Agent-AI4Sci-Bench |
Provisional Citation¶
If you use this benchmark before the paper is published, please cite the repository:
@misc{agentai4scibench2026,
title = {ASI-Bench: A Project-level Benchmark for
Evaluating LLM Agents on AI for Science},
author = {ASI-Bench Contributors},
year = {2026},
howpublished = {\url{https://github.com/zjw49246/Agent-AI4Sci-Bench}},
note = {Paper in preparation. Check the repository for updates.}
}
Citation will be updated
Once the arXiv preprint is published, this block will be replaced with the full BibTeX entry including author list, eprint ID, and venue information.
Key Contributions¶
The paper presents:
- Project-level AI4Sci tasks — compact but realistic computational science workflows requiring multi-step reasoning, code generation, and artifact production
- B1-B4 autonomy ladder — a systematic way to measure how much scaffolding an agent needs to solve the same scientific problem
- Auditable evaluation — structured artifacts and provenance enabling reproducible, reviewed leaderboard entries
- Multi-domain coverage — 8 scientific domains with 42+ public tasks spanning math, physics, chemistry, astronomy, biology, materials, earth science, and engineering