Official Review Flow¶
This page documents how benchmark maintainers evaluate whether a submitted run is eligible for the official leaderboard.
Review Criteria¶
| Criterion | Requirement |
|---|---|
| Completeness | All tasks in the target set must be covered |
| Reproducibility | Fixed seed, declared sandbox mode, full provenance |
| Sandbox | --sandbox task or --sandbox os (not none) |
| Tool mode | Must declare if external tools were enabled |
| Prompt levels | B1, B2, B3, B4 all required |
| No modification | Output directory must be unmodified from the run |
Review Process¶
- Submission received — maintainer acknowledges via GitHub issue
- Provenance check —
run_metadata.jsonfields verified - Spot validation — random instances may be re-run to verify reproducibility
- Gate consistency — hard gate pass/fail patterns checked for anomalies
- Acceptance or revision — if issues found, submitter is asked to re-run or clarify
- Leaderboard entry — accepted results are added to the official leaderboard JSON
Reproducibility Requirements¶
- Same seed must produce same instance parameters
- Same agent + model + config should produce similar (not identical) scores
- If scores differ significantly on spot-check re-run, the submission may be flagged
Trust Labels¶
| Label | Meaning |
|---|---|
official |
Fully reviewed, reproducibility verified |
repo-verified |
Run by maintainers during review process |
community |
Submitted by external users (future) |
Timeline¶
Reviews are handled on a best-effort basis. Expect 1-2 weeks for initial feedback after submission.