A guided experience for turning a hard scientific problem into a rigorous, reproducible, reviewable benchmark task.
Describe the scientific question and why it's hard.
Write the four-band prompts and assemble the evaluation.
Declare local testing, submit, and track review feedback.