Tasks¶
The task catalog is the core of the benchmark. Each task is defined as a self-contained scientific evaluation unit with metadata, prompts, runtime requirements, and scoring rules.
Current Catalog Snapshot¶
The public website catalog is currently restricted to the test subset of tasks rather than the entire internal task tree.
The live public task count, domain count, and domain distribution are rendered from the generated public site data below rather than maintained by hand.
Catalog Structure and Source¶
The long-term public task catalog will be generated from structured task metadata in:
That metadata is expected to power:
- task list pages
- task filters
- task detail pages
- homepage featured tasks
- domain and status summaries
What Each Task Represents¶
Each task is meant to represent a compact but meaningful scientific workflow. In most cases, that means an agent is expected to:
- interpret the scientific goal
- inspect the task data
- choose an appropriate method
- implement the solution
- generate artifacts that can be evaluated objectively
This is why task pages should communicate more than just a title and a prompt. They should help users understand what the task is testing and what a successful submission needs to produce.
What a Public Task Page Should Communicate¶
Each task detail page should eventually show:
- task title and identifier
- domain and subdomain
- public summary
- expected output types
- high-level evaluation summary
- runtime and sandbox notes
- a safe excerpt from the B1 prompt
Public Visibility Policy¶
The website should not automatically mirror every internal task state one-to-one. The public-facing catalog should represent tasks that are ready to be shown on the benchmark site.
The current policy is intentionally simple:
- show only tasks marked
test - do not expose
in_developmenttasks in the public catalog
This keeps the website from overstating what is already public benchmark content.
The website may eventually distinguish between:
- internal task lifecycle state used by the repo and framework
- public visibility state used by the website
For now, however, the public website follows a strict subset rule: test tasks only.
How to Use This Section¶
This section is intended to support three user flows:
- Browse by domain to understand benchmark coverage.
- Inspect individual tasks to see what kinds of workflows the benchmark includes.
- Understand output expectations before running or extending the benchmark.
Generated Catalog¶
The real task details live in the generated catalog section:
Tasks -> Catalog
Those pages are built from benchmark metadata and selected public-safe excerpts rather than written entirely by hand.
Current State of This Section¶
This page is the hand-authored entry point for the catalog. The generated task detail pages under the catalog section are now meant to track the same public test subset, so the public site presents a smaller but clearer task collection.