Skip to content

Result Format

This page defines the required structure for benchmark result submissions.

Required Artifacts

A valid submission must include the complete output directory from ai4sci-bench run or batch-run.

Top-level Files

File Purpose
run_metadata.json Global run configuration, agent/model info, provenance
batch_records/batch_overview.json Summary statistics across all tasks
batch_records/task_scoreboard.json Per-task aggregated scores
batch_records/task_level_long.json Per-task, per-prompt-level breakdown

Per-Instance Files

Under <task_id>/:

File Purpose
<instance_id>.json Scored result with component scores, gates, metadata
<instance_id>.agent_stdout.jsonl Agent execution log (recommended)
<instance_id>.agent_model_output.md Raw LLM completions (optional)

Key Fields in run_metadata.json

{
  "agent": "direct_llm",
  "agent_config": {"model": "claude-sonnet-4-20250514"},
  "sandbox": "task",
  "seed": 42,
  "prompt_levels": ["b1", "b2", "b3", "b4"],
  "benchmark_version": "public-test",
  "instances_per_task": 1,
  "framework_version": "...",
  "result_schema_version": 1
}

Naming Convention

  • Instance IDs are auto-generated and include task parameters and seed
  • Do not rename or restructure the output directory
  • Keep the full directory tree intact when submitting

What Reviewers Check

  1. run_metadata.json provenance fields are complete
  2. Sandbox mode was task or os (not none)
  3. Fixed seed was used
  4. All prompt levels (B1-B4) were run
  5. No evidence of result tampering (gate pass patterns are consistent)