Skip to content

Overview

ASI-Bench evaluates LLM agents on compact but realistic AI for Science projects. This page explains what the benchmark measures, why the B1-B4 prompt ladder matters, and how results are intended to become auditable leaderboard entries.

Why This Benchmark

Many scientific benchmark tasks are either short-form question answering or isolated code-generation problems. ASI-Bench targets a different setting: project-level scientific workflows that require an agent to reason through a small but complete computational project.

The benchmark is designed to test whether an agent can:

  • understand a scientific objective
  • inspect provided data and decide what matters
  • choose an appropriate method
  • implement and debug the workflow
  • produce output artifacts that can be checked objectively

This makes the benchmark closer to the work pattern of a scientific assistant than a single-prompt exam.

What Project-Level Means

In this benchmark, a task is not only a prompt. A task includes metadata, runtime requirements, data or instance generation, expected artifacts, scoring rules, and public-safe summaries for website rendering.

A successful agent run should leave behind evidence:

  • generated data files or figures
  • implementation code
  • structured answer artifacts
  • scoreable outputs
  • run metadata and provenance

That evidence is important because the website is intended to support official leaderboard results, not just informal model comparisons.

Benchmark family Typical focus ASI-Bench difference
QA-style science benchmarks answer correctness evaluates multi-step scientific workflows and artifacts
coding benchmarks code generation or issue repair adds scientific method selection and domain interpretation
ScienceAgentBench / AstaBench broad science-agent evaluation emphasizes project-level tasks with B1-B4 prompt-level control
SkillsBench-style agent benchmarks agent/tool skill use uses scientific tasks and reproducible scoring as the primary surface

The benchmark therefore sits between narrow coding tests and unconstrained open-ended research.

B1-B4 Prompt Levels

ASI-Bench evaluates the same scientific objective under four prompt levels:

  • B1: strongest guidance, focused on execution.
  • B2: partial guidance, focused on domain understanding plus execution.
  • B3: minimal guidance, focused on autonomous scientific problem solving.
  • B4: B3 plus distractor content, focused on prioritization and robustness.

The goal is to measure how much scaffolding an agent needs before it can solve a scientific workflow reliably.

B1-B4 Prompt Ladder

Same task goal, same data, same evaluation; guidance decreases as autonomy increases.

B1

Execution-focused

Strongest guidance, mostly specified method, reliable implementation.

B2

Partial guidance

Method hints remain, parameters loosen, domain understanding matters.

B3

Minimal guidance

Task goal stays fixed while method choice becomes the challenge.

B4

Distractor-aware

B3 plus irrelevant context, testing focus and robustness.

Guidance decreases from B1 to B4 while autonomy and prioritization demands increase.

Task Lifecycle

The public website is generated from structured benchmark metadata. In the long term, each task page should expose:

  • task title and identifier
  • domain and subdomain
  • public summary
  • expected output types
  • high-level evaluation summary
  • runtime and sandbox notes
  • safe prompt excerpt

The current public catalog is intentionally scoped to tasks that are ready to be shown on the benchmark site.

Evaluation Workflow

Compact workflow overview from task definition through result reporting.
Benchmark workflow from task definition to scoreable outputs and website-ready reporting artifacts.

At a high level, the evaluation loop is:

  1. Read the task prompt and inspect the provided data.
  2. Decide on an appropriate scientific or computational method.
  3. Implement the solution and generate required artifacts.
  4. Produce structured outputs such as data files, figures, and code.
  5. Score the result against benchmark evaluation rules.
  6. Summarize the run in reporting artifacts that can power the public website.

Examples of result artifacts include:

  • run_metadata.json for run-level provenance
  • per-instance result JSON files
  • batch_overview.json
  • task_scoreboard.json
  • task_level_long.json

Reproducibility and Contamination Control

The benchmark design emphasizes:

  • sandboxed execution modes
  • explicit runtime requirements
  • structured output contracts
  • parameterized ground-truth generation
  • provenance attached to saved results

Parameterized generation helps reduce contamination risk and makes it easier to produce multiple scoreable instances without hand-authoring every case.

Leaderboard Readiness

The official leaderboard should eventually show reviewed runs accepted by benchmark maintainers. Each official result should include:

  • agent or harness name
  • model name
  • overall score
  • B1-B4 breakdown
  • task coverage
  • benchmark version
  • evaluation date
  • source or trace links when available
  • trust label such as official, reproduced, community, or unverified

This separation matters because official baselines and community submissions should not be mixed without clear provenance.