Leaderboard¶
This section is the public home for ASI-Bench leaderboard data.
repo-verified are review snapshots, not final official benchmark results.
Scope¶
The official leaderboard is intended to show benchmark runs that have been reviewed and accepted by the benchmark maintainers. Community submissions, if added later, should remain clearly separated from the official table.
At the moment, the site shows a conservative repo-verified review snapshot. This keeps the visual leaderboard useful while making clear that the data is not yet a final official baseline run.
Data Source¶
The final leaderboard pages are designed to consume generated reporting artifacts derived from benchmark results, especially:
batch_overview.jsontask_scoreboard.jsontask_level_long.json
Until those full artifacts are available, this page consumes a generated review snapshot derived from task_review/round1/final_review_record.md.
What Will Appear Here¶
Once official benchmark results are published, this section should answer four questions quickly:
- which agents or model setups were evaluated
- how they score overall
- how performance changes across
B1-B4at a glance - how many tasks and runs are covered by the displayed snapshot
Core Metrics and Metadata¶
The default leaderboard should prioritize:
- overall score
- B1-B4 scores
- number of tasks covered
- model / method group metadata
Additional metadata such as external tool usage, sandbox mode, and result version can be shown when available.
Official vs Community Results¶
This site is designed around a clear separation:
- Official leaderboard: reviewed runs accepted by the benchmark maintainers
- Community submissions: a future layer that may later live behind a separate Hugging Face Space workflow
That separation matters because the benchmark is intended to support reproducible, auditable comparisons rather than a mixed feed of self-reported results and official baselines.
Current State¶
The local web scaffold already supports:
- a single overall leaderboard landing page
- generated JSON data for future deeper breakdowns
The next step for this section is official policy and data:
- define what counts as an official run
- prepare one official baseline snapshot
- regenerate the public leaderboard JSON from that snapshot