Skip to content

Leaderboard

This section is the public home for ASI-Bench leaderboard data.

The table below is generated from repo-verified review records so the website can display real checked data before the first official baseline snapshot is published. Rows marked repo-verified are review snapshots, not final official benchmark results.

Scope

The official leaderboard is intended to show benchmark runs that have been reviewed and accepted by the benchmark maintainers. Community submissions, if added later, should remain clearly separated from the official table.

At the moment, the site shows a conservative repo-verified review snapshot. This keeps the visual leaderboard useful while making clear that the data is not yet a final official baseline run.

Data Source

The final leaderboard pages are designed to consume generated reporting artifacts derived from benchmark results, especially:

  • batch_overview.json
  • task_scoreboard.json
  • task_level_long.json

Until those full artifacts are available, this page consumes a generated review snapshot derived from task_review/round1/final_review_record.md.

What Will Appear Here

Once official benchmark results are published, this section should answer four questions quickly:

  1. which agents or model setups were evaluated
  2. how they score overall
  3. how performance changes across B1-B4 at a glance
  4. how many tasks and runs are covered by the displayed snapshot

Core Metrics and Metadata

The default leaderboard should prioritize:

  • overall score
  • B1-B4 scores
  • number of tasks covered
  • model / method group metadata

Additional metadata such as external tool usage, sandbox mode, and result version can be shown when available.

Official vs Community Results

This site is designed around a clear separation:

  • Official leaderboard: reviewed runs accepted by the benchmark maintainers
  • Community submissions: a future layer that may later live behind a separate Hugging Face Space workflow

That separation matters because the benchmark is intended to support reproducible, auditable comparisons rather than a mixed feed of self-reported results and official baselines.

Current State

The local web scaffold already supports:

  • a single overall leaderboard landing page
  • generated JSON data for future deeper breakdowns

The next step for this section is official policy and data:

  • define what counts as an official run
  • prepare one official baseline snapshot
  • regenerate the public leaderboard JSON from that snapshot