Skip to content

Read Results

Understand what the benchmark produces after a run — where files go, what they contain, and how to interpret scores.

Output directory structure

After a run completes, the output directory (default results/) contains:

results/
  run_metadata.json              # Global run configuration + provenance
  <task_id>/
    <instance_id>.json           # Per-instance scored result
    <instance_id>.agent_stdout.jsonl   # Raw agent output log
    <instance_id>.agent_model_output.md  # LLM completions (direct_llm)
  batch_records/                 # Present in batch-run mode
    batch_overview.json          # Summary across all tasks
    task_scoreboard.json         # Per-task aggregated scores
    task_level_long.json         # Per-task-per-level breakdown

Key files

run_metadata.json

Global configuration for the entire run:

{
  "agent": "direct_llm",
  "agent_config": {"model": "claude-sonnet-4-20250514"},
  "sandbox": "task",
  "seed": 42,
  "prompt_levels": ["b1", "b2", "b3", "b4"],
  "benchmark_version": "public-test",
  "instances_per_task": 1,
  "framework_version": "...",
  "result_schema_version": 1
}

Per-instance result (<instance_id>.json)

Each instance produces a scored result with these key fields:

Field Description
final_score Overall score (0-100)
component_scores Breakdown by scorer (numerical accuracy, code quality, etc.)
gate_results Hard/soft gate pass/fail status
requested_mode What sandbox mode was requested
effective_mode What actually ran
enforcement_status Whether the sandbox was enforced
verification_status Post-run verification outcome

batch_records/task_scoreboard.json

When using batch-run, this file provides the leaderboard-friendly view:

[
  {
    "task_id": "physics.sod_shock_tube",
    "agent": "direct_llm",
    "model": "claude-sonnet-4-20250514",
    "b1": 100.0,
    "b2": 98.5,
    "b3": 53.2,
    "b4": 74.9,
    "mean_score": 81.6
  }
]

Using ai4sci-bench report

The report command renders a human-readable summary from the results directory:

ai4sci-bench report
ai4sci-bench report --output-dir path/to/results

It displays:

  • Per-task scores with B-level breakdown
  • Gate pass/fail summary
  • Low-scoring instances with diagnostic hints from scorer details

Understanding scores

  • 100: Perfect — all output artifacts match reference within tolerances
  • 70-99: Strong — most components correct, minor numerical or format issues
  • 30-69: Partial — core logic present but significant deviations
  • 0-29: Weak — fundamental approach or output issues
  • 0 (with gate fail): A hard gate blocked scoring entirely (e.g., required file missing)

Gate types

Gate Severity Behavior
file_match hard Required output files must exist
code_analysis hard or soft Pattern checks in agent code
Custom gates configurable Task-specific invariant checks

Hard gates block all scoring — the final score is 0 regardless of other components. Soft gates only produce warnings.

Provenance fields

Every result records exactly how it was produced:

  • requested_mode / effective_mode — what sandbox was asked for vs. what ran
  • enforcement_status — whether isolation was actually enforced
  • verification_status — post-run checks (e.g., no network access observed)
  • Agent config, model name, CLI version, seed, and timeout

This enables reproducibility and supports the official review process.

Next steps