Official Review Flow¶

This page documents how benchmark maintainers evaluate whether a submitted run is eligible for the official leaderboard.

Review Criteria¶

Criterion	Requirement
Completeness	All tasks in the target set must be covered
Reproducibility	Fixed seed, declared sandbox mode, full provenance
Sandbox	`--sandbox task` or `--sandbox os` (not `none`)
Tool mode	Must declare if external tools were enabled
Prompt levels	B1, B2, B3, B4 all required
No modification	Output directory must be unmodified from the run

Submission received — maintainer acknowledges via GitHub issue
Provenance check — run_metadata.json fields verified
Spot validation — random instances may be re-run to verify reproducibility
Gate consistency — hard gate pass/fail patterns checked for anomalies
Acceptance or revision — if issues found, submitter is asked to re-run or clarify
Leaderboard entry — accepted results are added to the official leaderboard JSON

Same seed must produce same instance parameters
Same agent + model + config should produce similar (not identical) scores
If scores differ significantly on spot-check re-run, the submission may be flagged

Label	Meaning
`official`	Fully reviewed, reproducibility verified
`repo-verified`	Run by maintainers during review process
`community`	Submitted by external users (future)

Reviews are handled on a best-effort basis. Expect 1-2 weeks for initial feedback after submission.