Skip to content

Evaluation reasoning and pitfalls

This page explains how to interpret RecSys evaluation results without fooling yourself.

Who this is for

  • Recommendation engineers and analysts
  • Leads approving ship/hold/rollback decisions
  • Stakeholders who want to understand why a metric changed

What you will get

  • A checklist for sanity-checking offline reports
  • The most common pitfalls that make metrics lie
  • How to connect evaluation output to an operational decision

Mental model

Evaluation answers a narrow question:

"Given the data we logged, do we have evidence that this change helps the user and the business, without breaking safety/guardrail constraints?"

Your confidence depends on:

  • data quality (joins, missingness, bias)
  • methodological fit (offline vs online vs OPE)
  • statistical robustness (variance, multiple testing)
  • operational realism (does production match the evaluation assumptions?)

The minimum sanity checks

Before you trust any number, confirm these:

  1. Join rate is stable
  2. Exposures join to outcomes by request_id at a stable rate over time.
  3. Slice coverage is stable
  4. tenant_id, surface, and your main segmentation keys exist for the same share of events.
  5. Traffic mix is comparable
  6. user cohorts and surfaces are not silently changing between runs.
  7. Guardrails are present
  8. at least one guardrail metric is reported (e.g., latency, churn proxy, diversity constraint).

See also: Event join logic (exposures ↔ outcomes ↔ assignments)

Common pitfalls

1) Clicks without exposures

If you log outcomes but not exposures, you cannot attribute.

Symptom: metrics swing wildly or look implausibly good.

Fix: implement exposure logging first.

2) Simpson’s paradox (your aggregate lies)

Overall lift can hide regressions in key segments.

Fix: require a small slice set (surface, tenant, segment) in every report and treat big slices as first-class.

3) Leakage and look-ahead

Offline evaluation can accidentally use information that was not available at serving time.

Fix: enforce time windows and event ordering in pipelines; document and test it.

4) Non-stationarity

User intent, inventory, and seasonality shift.

Fix: keep rolling baselines; treat changes as local decisions; prefer online tests for high-impact changes.

5) Metric gaming

A change can improve a proxy metric while harming the actual user outcome.

Fix: pair every primary KPI with at least one guardrail. Prefer metrics that measure user value directly.

Choosing evaluation mode

  • Offline regression: fastest and cheapest; good for "did we break anything" gates.
  • Online A/B: highest credibility for product outcomes; requires experiment controls.
  • OPE / counterfactual: useful when A/B is expensive; sensitive to modeling assumptions.
  • Interleaving: efficient comparisons for ranking changes; still needs careful logging.

See: Evaluation modes

Turning a report into a decision

A good decision record answers:

  • what changed
  • what improved and what worsened (including slices)
  • whether guardrails stayed within bounds
  • ship/hold/rollback decision and why

Start here: