Interpreting metrics and reports¶
This page gives a practical mental model for turning an evaluation report into a ship/hold/rollback decision.
Canonical reading order
This page is an orientation layer. The detailed metric definitions live in recsys-eval docs.
What a report is (and is not)¶
A RecSys evaluation report is:
- a decision artifact (shareable)
- a reproducible record (inputs + versions)
- a compact summary of multiple metrics and guardrails
It is not:
- a guarantee of online lift
- a substitute for instrumentation hygiene (joinability)
A 5-minute interpretation flow¶
-
Verify the evaluation is valid
-
Is the population/window what you expected?
- Are exposures and outcomes joined by stable
request_id?
Start here:
- Evaluation validity: Evaluation validity
-
Join logic: Join logic
-
Check guardrails first
-
Did any hard guardrail regress beyond tolerance?
-
If yes, decide "hold" even if the primary metric improves.
-
Read the primary metric in context
-
Compare relative deltas, not just absolute.
-
Look for segment-specific regressions (new users, cold start surfaces, long-tail items).
-
Identify tradeoffs and risks
-
Are you trading diversity for short-term clicks?
-
Are you increasing concentration on a few items?
-
Write the decision and follow-ups
-
Ship / hold / rollback
- 1–5 bullets explaining why
- The next experiment or mitigation
Where detailed metric definitions live¶
- Metric definitions and theory: Metrics
- Interpreting results (detailed): Interpreting results
- Interpretation cheat sheet (fast): Interpretation cheat sheet
- Decision playbook: Decision playbook
Read next¶
- Run eval and make ship decisions: Run eval and ship
- Evaluation reasoning and pitfalls: Evaluation reasoning and pitfalls
- Evidence (what outputs look like): Evidence