Tasks¶

The task catalog is the core of the benchmark. Each task is defined as a self-contained scientific evaluation unit with metadata, prompts, runtime requirements, and scoring rules.

The public task catalog is generated from benchmark metadata, but filtered through a public website policy. Right now, that means the website shows the `test` subset only.

Current Catalog Snapshot¶

The public website catalog is currently restricted to the test subset of tasks rather than the entire internal task tree.

The live public task count, domain count, and domain distribution are rendered from the generated public site data below rather than maintained by hand.

Catalog Structure and Source¶

The long-term public task catalog will be generated from structured task metadata in:

tasks/*/*/task.yaml

That metadata is expected to power:

task list pages
task filters
task detail pages
homepage featured tasks
domain and status summaries

What Each Task Represents¶

Each task is meant to represent a compact but meaningful scientific workflow. In most cases, that means an agent is expected to:

interpret the scientific goal
inspect the task data
choose an appropriate method
implement the solution
generate artifacts that can be evaluated objectively

This is why task pages should communicate more than just a title and a prompt. They should help users understand what the task is testing and what a successful submission needs to produce.

What a Public Task Page Should Communicate¶

Each task detail page should eventually show:

task title and identifier
domain and subdomain
public summary
expected output types
high-level evaluation summary
runtime and sandbox notes
a safe excerpt from the B1 prompt

Public Visibility Policy¶

The website should not automatically mirror every internal task state one-to-one. The public-facing catalog should represent tasks that are ready to be shown on the benchmark site.

The current policy is intentionally simple:

show only tasks marked test
do not expose in_development tasks in the public catalog

This keeps the website from overstating what is already public benchmark content.

The website may eventually distinguish between:

internal task lifecycle state used by the repo and framework
public visibility state used by the website

For now, however, the public website follows a strict subset rule: test tasks only.

How to Use This Section¶

This section is intended to support three user flows:

Browse by domain to understand benchmark coverage.
Inspect individual tasks to see what kinds of workflows the benchmark includes.
Understand output expectations before running or extending the benchmark.

Generated Catalog¶

The real task details live in the generated catalog section:

Tasks -> Catalog

Those pages are built from benchmark metadata and selected public-safe excerpts rather than written entirely by hand.

Current State of This Section¶

This page is the hand-authored entry point for the catalog. The generated task detail pages under the catalog section are now meant to track the same public test subset, so the public site presents a smaller but clearer task collection.