Skip to content

Tasks

The task catalog is the core of the benchmark. Each task is defined as a self-contained scientific evaluation unit with metadata, prompts, runtime requirements, and scoring rules.

The public task catalog is generated from benchmark metadata, but filtered through a public website policy. Right now, that means the website shows the `test` subset only.

Current Catalog Snapshot

The public website catalog is currently restricted to the test subset of tasks rather than the entire internal task tree.

The live public task count, domain count, and domain distribution are rendered from the generated public site data below rather than maintained by hand.

Catalog Structure and Source

The long-term public task catalog will be generated from structured task metadata in:

tasks/*/*/task.yaml

That metadata is expected to power:

  • task list pages
  • task filters
  • task detail pages
  • homepage featured tasks
  • domain and status summaries

What Each Task Represents

Each task is meant to represent a compact but meaningful scientific workflow. In most cases, that means an agent is expected to:

  1. interpret the scientific goal
  2. inspect the task data
  3. choose an appropriate method
  4. implement the solution
  5. generate artifacts that can be evaluated objectively

This is why task pages should communicate more than just a title and a prompt. They should help users understand what the task is testing and what a successful submission needs to produce.

What a Public Task Page Should Communicate

Each task detail page should eventually show:

  • task title and identifier
  • domain and subdomain
  • public summary
  • expected output types
  • high-level evaluation summary
  • runtime and sandbox notes
  • a safe excerpt from the B1 prompt

Public Visibility Policy

The website should not automatically mirror every internal task state one-to-one. The public-facing catalog should represent tasks that are ready to be shown on the benchmark site.

The current policy is intentionally simple:

  • show only tasks marked test
  • do not expose in_development tasks in the public catalog

This keeps the website from overstating what is already public benchmark content.

The website may eventually distinguish between:

  • internal task lifecycle state used by the repo and framework
  • public visibility state used by the website

For now, however, the public website follows a strict subset rule: test tasks only.

How to Use This Section

This section is intended to support three user flows:

  1. Browse by domain to understand benchmark coverage.
  2. Inspect individual tasks to see what kinds of workflows the benchmark includes.
  3. Understand output expectations before running or extending the benchmark.

Generated Catalog

The real task details live in the generated catalog section:

  • Tasks -> Catalog

Those pages are built from benchmark metadata and selected public-safe excerpts rather than written entirely by hand.

Current State of This Section

This page is the hand-authored entry point for the catalog. The generated task detail pages under the catalog section are now meant to track the same public test subset, so the public site presents a smaller but clearer task collection.