Skip to content

Official Review Flow

This page documents how benchmark maintainers evaluate whether a submitted run is eligible for the official leaderboard.

Review Criteria

Criterion Requirement
Completeness All tasks in the target set must be covered
Reproducibility Fixed seed, declared sandbox mode, full provenance
Sandbox --sandbox task or --sandbox os (not none)
Tool mode Must declare if external tools were enabled
Prompt levels B1, B2, B3, B4 all required
No modification Output directory must be unmodified from the run

Review Process

  1. Submission received — maintainer acknowledges via GitHub issue
  2. Provenance checkrun_metadata.json fields verified
  3. Spot validation — random instances may be re-run to verify reproducibility
  4. Gate consistency — hard gate pass/fail patterns checked for anomalies
  5. Acceptance or revision — if issues found, submitter is asked to re-run or clarify
  6. Leaderboard entry — accepted results are added to the official leaderboard JSON

Reproducibility Requirements

  • Same seed must produce same instance parameters
  • Same agent + model + config should produce similar (not identical) scores
  • If scores differ significantly on spot-check re-run, the submission may be flagged

Trust Labels

Label Meaning
official Fully reviewed, reproducibility verified
repo-verified Run by maintainers during review process
community Submitted by external users (future)

Timeline

Reviews are handled on a best-effort basis. Expect 1-2 weeks for initial feedback after submission.