Skip to content

Interpreting metrics and reports

This page gives a practical mental model for turning an evaluation report into a ship/hold/rollback decision.

Canonical reading order

This page is an orientation layer. The detailed metric definitions live in recsys-eval docs.

What a report is (and is not)

A RecSys evaluation report is:

  • a decision artifact (shareable)
  • a reproducible record (inputs + versions)
  • a compact summary of multiple metrics and guardrails

It is not:

  • a guarantee of online lift
  • a substitute for instrumentation hygiene (joinability)

A 5-minute interpretation flow

  1. Verify the evaluation is valid

  2. Is the population/window what you expected?

  3. Are exposures and outcomes joined by stable request_id?

Start here:

  • Evaluation validity: Evaluation validity
  • Join logic: Join logic

  • Check guardrails first

  • Did any hard guardrail regress beyond tolerance?

  • If yes, decide "hold" even if the primary metric improves.

  • Read the primary metric in context

  • Compare relative deltas, not just absolute.

  • Look for segment-specific regressions (new users, cold start surfaces, long-tail items).

  • Identify tradeoffs and risks

  • Are you trading diversity for short-term clicks?

  • Are you increasing concentration on a few items?

  • Write the decision and follow-ups

  • Ship / hold / rollback

  • 1–5 bullets explaining why
  • The next experiment or mitigation

Where detailed metric definitions live