AI for Science agent evaluation
ASI-Bench
A project-level benchmark for evaluating AI4Sci agents on realistic scientific workflows, scoreable artifacts, and auditable provenance.
Model progress
AI Progress on ASI-Bench
Overall score by model on the public test set. More model evaluations coming soon.
Verified review data
Leaderboard Snapshot
Repo-derived review rows show overall score, B1-B4 breakdowns, run metadata, and trust labels before the official baseline snapshot is finalized.
Scientific breadth
8 Domains, 42 Public Tasks
ASI-Bench covers a wide spectrum of computational science workflows.
Task registry
Featured Public Tasks
The cards below are selected from the generated public catalog and spread across domains when possible.
Get started in 60 seconds
Quick Start
Three commands from zero to your first benchmark score.
git clone https://github.com/zjw49246/Agent-AI4Sci-Bench.git
cd Agent-AI4Sci-Bench && uv sync && cp .env.example .env
Clone, install dependencies, and set your API keys.
ai4sci-bench run --tasks physics.sod_shock_tube \
--agent direct_llm --agent-config '{"model":"claude-sonnet-4-20250514"}' \
--sandbox task --prompt-levels b1
Run one task with a direct LLM agent on the easiest prompt level.
ai4sci-bench report
View your score breakdown and result artifacts.
Evaluation pipeline
From Task to Auditable Result
ASI-Bench combines scientific tasks, B1-B4 prompt levels, agent harnesses, and reproducible scoring into one auditable evaluation stack.
Scientific workflow task
Parameterized AI4Sci task with data, expected artifacts, runtime requirements, and scoring rules.
B1-B4 prompt level
Same scientific goal under decreasing guidance, from execution support to autonomous problem solving.
Agent / harness + model
CLI agent, scaffold, or direct baseline paired with a specific model and run configuration.
Sandbox, scorer, and report
Structured artifacts become task scores, B-level breakdowns, and reviewed leaderboard entries.
Why it matters
What Makes This Benchmark Different
Project-level workflows
Tasks require data inspection, method selection, implementation, debugging, and objective artifacts rather than a single short answer.
AI for Science domains
The public catalog spans math, physics, chemistry, astronomy, materials, and engineering-style scientific workflows.
B1-B4 autonomy ladder
The same scientific goal is evaluated under decreasing guidance, revealing how much scaffolding each agent needs.
Auditable evaluation
Runs produce structured artifacts and provenance that can power reviewed leaderboard entries instead of self-reported scores.
Updates
Latest News
- 2026-04-25 Public website data snapshot generated with 22 public tasks across 6 domains.
- 2026-04-22 Homepage, catalog, and leaderboard became data-driven from generated benchmark JSON.
- 2026-04-22 Benchmark figures added: domain coverage, B1-B4 prompt ladder, and evaluation workflow.