Skip to content

Interpreting results: how to go from report to decision

This page explains Interpreting results: how to go from report to decision and how it fits into the RecSys suite.

Who this is for

Anyone making ship/hold decisions (engineers, PMs, analysts).

What you will get

  • How to read a report
  • How to decide "ship / hold / rollback" without fooling yourself
  • What to do when results are unclear

Step 0: Trust the data before trusting the metrics

Check:

  • data_quality: missing fields, duplicates, anomalies
  • join integrity: match rates, unexpected drops
  • warnings: especially for OPE

If these look bad, stop. Fix logging.

Step 1: Start with the summary

The report includes a summary for quick scanning:

  • mode
  • main deltas (baseline vs candidate or control vs candidate)
  • whether gates passed

If the summary says "inconclusive", treat it as a real outcome.

Step 2: Check guardrails

Even if the primary metric improves, do not ship if:

  • empty rate regressed
  • latency regressed outside budget
  • error rate regressed
  • a critical segment cliff appears

Guardrails exist because "winning slowly" is losing.

Step 3: Look at segments as a radar, not a scoreboard

Segments answer:

  • Who did this help?
  • Who did this hurt?
  • Is the impact consistent?

Segments can also create false positives when you slice too much. Use segments as diagnostics unless you have power to claim segment wins.

Step 4: Interpreting uncertainty

If you use confidence intervals or p-values:

  • wide intervals mean you do not know yet
  • small p-values can still happen by chance if you test too many things

"Inconclusive" is not failure. It is a request for more data or a better experiment design.

Step 5: A simple decision policy you can adopt

Suggested policy:

  • SHIP:

primary metric improves and guardrails hold and no major segment regressions

  • HOLD:

results are inconclusive or diagnostics warn about data quality

  • ROLLBACK:

primary regresses or guardrails regress or a major segment cliff appears

This maps well to a decision artifact (api/schemas/decision.v1.json).

What to do when it is unclear

Choose one:

  • run longer / collect more samples
  • reduce variance (CUPED / better covariates)
  • narrow the change (smaller delta)
  • use interleaving for ranker comparison
  • do offline gating first, then A/B

Common pitfalls

  • Confusing "statistically significant" with "practically important".
  • Shipping a win that is isolated to a single surface and breaks another.
  • Ignoring SRM warnings in experiments.

Treat the report as a navigation tool, not a trophy.