Skip to content

Evaluation Decisions

Use this guide when RecSys evaluation output needs to become a ship, hold, or rollback decision.

Decision order

Run the gates in this order. Do not interpret KPI movement until data integrity is credible.

Step Gate Default decision when it fails
1 Schemas validate for exposures, outcomes, assignments, and reports. Hold. Fix instrumentation or dataset export.
2 Join integrity is acceptable for the surfaces and slices being evaluated. Hold. Metrics are not trustworthy.
3 Guardrails hold: errors, latency, empty recommendations, and warning rates do not regress materially. Roll back or hold with a time-boxed mitigation.
4 Primary KPI improves enough to matter and is stable across key slices. Hold if inconclusive; roll back if meaningfully negative.
5 Rollback path is ready before rollout. Hold. Do not ship a change that cannot be reversed.

Baseline thresholds

These are starting points, not universal targets. Tune them to the product, sample size, and business risk.

Signal Starting point
Join rate Aim for at least 95% on the primary analysis slices before trusting KPI movement.
Error rate No worse than 0.1-0.5 percentage points absolute unless explicitly accepted.
Latency p95 no worse than 10-20% relative unless capacity testing supports it.
Empty recommendations No worse than 0.2-1.0 percentage points absolute.
Primary KPI Ship only when the improvement clears the pre-agreed minimum effect and guardrails hold.

Common decisions

Finding Decision Next action
KPI improved, join rate is low Hold Fix request_id, tenant, surface, or assignment joins before interpreting results.
KPI improved, guardrails regressed Roll back by default Reduce blast radius only if a safe mitigation is already known.
KPI is neutral and guardrails hold Hold Continue the experiment or close as inconclusive.
KPI regressed and guardrails hold Roll back Check slices only to explain the regression, not to excuse it.
Offline gate fails in CI Hold Fix the regression or update the baseline only after review.

Evidence to keep

  • Input dataset names, schema versions, and generation time.
  • Report output path and report hash when available.
  • Primary KPI, guardrails, and join-rate summary.
  • Slices reviewed and any excluded slices.
  • Config, rules, algorithm, artifact, and manifest versions involved.
  • Rollback lever chosen, if a rollback happened.

Validation commands

Run the local proof path when checking the repository fixture:

make proof-kit-test

Expected result: the command prints commercial proof kit smoke passed.

For custom datasets, run the relevant recsys-eval schema validation and report commands from checked-in configs under recsys-eval/configs/eval/.