Skip to content
Home / Catalog / Anonymous Deployment Prediction Set Audit

Anonymous Deployment Prediction Set Audit

  • Task ID: computer_science.deployment_prediction_sets
  • Domain: computer_science
  • Subdomain: machine_learning_reliability
  • Status: test
  • Tags: uncertainty_quantification, prediction_sets, conformal_prediction, deployment_shift, subgroup_reliability, model_audit, calibration

Public Summary

This page is generated from task metadata and selected public-safe excerpts.

Example B1 Prompt Excerpt

You are given a black-box multiclass classifier, a labeled calibration split, and an unlabeled deployment split. Build risk-controlled prediction sets for the deployment records.
Use a conformal-risk-control style procedure with proxy-stratum calibration:
1. Read `data/metadata.json` and `data/risk_spec.json`.
2. Load calibration and deployment probability tables, labels, and anonymous feature tables.
3. Discover recoverable proxy strata from the anonymous feature vectors. A good default is to standardize the joint calibration+deployment feature matrix and try 4 to 6 deterministic clusters, selecting the partition with stable calibration sizes and visible deployment shift.
4. For each proxy stratum, use only calibration records assigned to that stratum. A pooled stratum cutoff is a useful baseline, but a stronger policy should tune separate label-specific cutoffs within each proxy stratum.
5. The risk is label-weighted miss loss: the loss is zero when the true label is in the set and is the label cost from `risk_spec.json` otherwise.
6. Choose cutoffs to keep calibration empirical weighted miss risk plus a finite-sample buffer below the target risk within each recovered stratum. A Hoeffding-style one-sided buffer for bounded weighted miss loss is sufficient. When using label-specific cutoffs, sum the label-level risk contributions within the stratum.

Notes

  • This page is a generated site artifact.
  • Higher-level prompt details and internal benchmark specifics may remain intentionally undisclosed.