FAQ¶
What kind of benchmark is this?¶
ASI-Bench is a project-level benchmark for evaluating LLM agents on AI for Science tasks.
What does project-level mean here?¶
Tasks are designed to resemble compact scientific workflows rather than isolated coding problems.
Why does the benchmark use B1-B4 prompt levels?¶
The prompt ladder is intended to separate execution ability from more autonomous scientific problem solving.
Why does the benchmark emphasize structured outputs?¶
Structured outputs make evaluation more reproducible and easier to compare across agents and runs.
What does project-level mean in practice?¶
In practice, it means tasks are designed to look more like compact scientific workflows than isolated coding prompts. An agent may need to inspect data, decide on a method, implement a solution, and produce artifacts that can be checked objectively.
Why are there B1-B4 prompt levels?¶
The prompt ladder helps measure how much scaffolding an agent needs. B1 provides the most guidance, while B3 and B4 test more autonomous scientific problem solving and robustness under distraction.
Why does the public website only show test tasks right now?¶
The website currently shows only the public test subset so it does not overstate what is already ready for public benchmark presentation. The internal repository can contain a larger task tree that is still evolving.
Are all tasks on the website fully released benchmark tasks?¶
The public site is intentionally conservative, but the benchmark still evolves. The public task catalog should be interpreted as the current public-facing subset, not as the final complete benchmark release.
How are results stored?¶
The framework stores structured result artifacts such as:
run_metadata.json- per-instance result JSON files
- aggregated
batch_recordsoutputs such asbatch_overview.json,task_scoreboard.json, andtask_level_long.json
These artifacts are what make leaderboard rendering and result inspection reproducible.
Why is the public leaderboard not final yet?¶
The website can render leaderboard data, and the current page may show a conservative repo-verified review snapshot derived from checked repo records. That snapshot is useful for website review and demonstration, but a final public official baseline snapshot has not been published yet.
What is the difference between official and community results?¶
The intended distinction is:
- official leaderboard: reviewed and accepted benchmark runs
- community submissions: a future public submission layer, likely handled through a separate Hugging Face Space workflow
The benchmark site is being built to keep those two categories clearly separated.
Will the site eventually support public submissions?¶
Yes, but not in the first website phase. The current direction is to keep the initial site focused on benchmark presentation, official results, and documentation, and add a community submission flow later.
Why are some task detail pages still sparse?¶
Some public task pages are generated from structured metadata and selected prompt excerpts. This makes the site easier to maintain, but it also means some pages are still more functional than polished. That will improve as the public-facing task metadata and rendering rules mature.
Will this site eventually include an open submission flow?¶
Yes, but the intended plan is to keep the first website release focused on official benchmark content and add community submission infrastructure later.