Leaderboard¶

This section is the public home for ASI-Bench leaderboard data.

The table below is generated from repo-verified review records so the website can display real checked data before the first official baseline snapshot is published. Rows marked repo-verified are review snapshots, not final official benchmark results.

Scope¶

The official leaderboard is intended to show benchmark runs that have been reviewed and accepted by the benchmark maintainers. Community submissions, if added later, should remain clearly separated from the official table.

At the moment, the site shows a conservative repo-verified review snapshot. This keeps the visual leaderboard useful while making clear that the data is not yet a final official baseline run.

Data Source¶

The final leaderboard pages are designed to consume generated reporting artifacts derived from benchmark results, especially:

batch_overview.json
task_scoreboard.json
task_level_long.json

Until those full artifacts are available, this page consumes a generated review snapshot derived from task_review/round1/final_review_record.md.

What Will Appear Here¶

Once official benchmark results are published, this section should answer four questions quickly:

which agents or model setups were evaluated
how they score overall
how performance changes across B1-B4 at a glance
how many tasks and runs are covered by the displayed snapshot

Core Metrics and Metadata¶

The default leaderboard should prioritize:

overall score
B1-B4 scores
number of tasks covered
model / method group metadata

Additional metadata such as external tool usage, sandbox mode, and result version can be shown when available.

Official vs Community Results¶

This site is designed around a clear separation:

Official leaderboard: reviewed runs accepted by the benchmark maintainers
Community submissions: a future layer that may later live behind a separate Hugging Face Space workflow

That separation matters because the benchmark is intended to support reproducible, auditable comparisons rather than a mixed feed of self-reported results and official baselines.

Current State¶

The local web scaffold already supports:

a single overall leaderboard landing page
generated JSON data for future deeper breakdowns

The next step for this section is official policy and data:

define what counts as an official run
prepare one official baseline snapshot
regenerate the public leaderboard JSON from that snapshot