Interpreting results: how to go from report to decision¶
This page explains Interpreting results: how to go from report to decision and how it fits into the RecSys suite.
Who this is for¶
Anyone making ship/hold decisions (engineers, PMs, analysts).
What you will get¶
- How to read a report
- How to decide "ship / hold / rollback" without fooling yourself
- What to do when results are unclear
Step 0: Trust the data before trusting the metrics¶
Check:
- data_quality: missing fields, duplicates, anomalies
- join integrity: match rates, unexpected drops
- warnings: especially for OPE
If these look bad, stop. Fix logging.
Step 1: Start with the summary¶
The report includes a summary for quick scanning:
- mode
- main deltas (baseline vs candidate or control vs candidate)
- whether gates passed
If the summary says "inconclusive", treat it as a real outcome.
Step 2: Check guardrails¶
Even if the primary metric improves, do not ship if:
- empty rate regressed
- latency regressed outside budget
- error rate regressed
- a critical segment cliff appears
Guardrails exist because "winning slowly" is losing.
Step 3: Look at segments as a radar, not a scoreboard¶
Segments answer:
- Who did this help?
- Who did this hurt?
- Is the impact consistent?
Segments can also create false positives when you slice too much. Use segments as diagnostics unless you have power to claim segment wins.
Step 4: Interpreting uncertainty¶
If you use confidence intervals or p-values:
- wide intervals mean you do not know yet
- small p-values can still happen by chance if you test too many things
"Inconclusive" is not failure. It is a request for more data or a better experiment design.
Step 5: A simple decision policy you can adopt¶
Suggested policy:
- SHIP:
primary metric improves and guardrails hold and no major segment regressions
- HOLD:
results are inconclusive or diagnostics warn about data quality
- ROLLBACK:
primary regresses or guardrails regress or a major segment cliff appears
This maps well to a decision artifact (api/schemas/decision.v1.json).
What to do when it is unclear¶
Choose one:
- run longer / collect more samples
- reduce variance (CUPED / better covariates)
- narrow the change (smaller delta)
- use interleaving for ranker comparison
- do offline gating first, then A/B
Common pitfalls¶
- Confusing "statistically significant" with "practically important".
- Shipping a win that is isolated to a single surface and breaks another.
- Ignoring SRM warnings in experiments.
Treat the report as a navigation tool, not a trophy.
Read next¶
- Interpretation cheat sheet: Interpretation cheat sheet (recsys-eval)
- Metrics: Metrics: what we measure and why
- Runbooks: Runbooks: operating recsys-eval
- Online A/B workflow: Workflow: Online A/B analysis in production