Evaluation reasoning and pitfalls¶

This page explains how to interpret RecSys evaluation results without fooling yourself.

Who this is for¶

Recommendation engineers and analysts
Leads approving ship/hold/rollback decisions
Stakeholders who want to understand why a metric changed

What you will get¶

A checklist for sanity-checking offline reports
The most common pitfalls that make metrics lie
How to connect evaluation output to an operational decision

Mental model¶

Evaluation answers a narrow question:

"Given the data we logged, do we have evidence that this change helps the user and the business, without breaking safety/guardrail constraints?"

Your confidence depends on:

data quality (joins, missingness, bias)
methodological fit (offline vs online vs OPE)
statistical robustness (variance, multiple testing)
operational realism (does production match the evaluation assumptions?)

The minimum sanity checks¶

Before you trust any number, confirm these:

Join rate is stable
Exposures join to outcomes by request_id at a stable rate over time.
Slice coverage is stable
tenant_id, surface, and your main segmentation keys exist for the same share of events.
Traffic mix is comparable
user cohorts and surfaces are not silently changing between runs.
Guardrails are present
at least one guardrail metric is reported (e.g., latency, churn proxy, diversity constraint).

Common pitfalls¶

1) Clicks without exposures¶

If you log outcomes but not exposures, you cannot attribute.

Symptom: metrics swing wildly or look implausibly good.

Fix: implement exposure logging first.

Concept: Exposure logging and attribution
Schemas: Exposure, outcome, and assignment schemas

2) Simpson’s paradox (your aggregate lies)¶

Overall lift can hide regressions in key segments.

Fix: require a small slice set (surface, tenant, segment) in every report and treat big slices as first-class.

3) Leakage and look-ahead¶

Offline evaluation can accidentally use information that was not available at serving time.

Fix: enforce time windows and event ordering in pipelines; document and test it.

Pipelines: Windows and backfills

4) Non-stationarity¶

User intent, inventory, and seasonality shift.

Fix: keep rolling baselines; treat changes as local decisions; prefer online tests for high-impact changes.

5) Metric gaming¶

A change can improve a proxy metric while harming the actual user outcome.

Fix: pair every primary KPI with at least one guardrail. Prefer metrics that measure user value directly.

Choosing evaluation mode¶

Offline regression: fastest and cheapest; good for "did we break anything" gates.
Online A/B: highest credibility for product outcomes; requires experiment controls.
OPE / counterfactual: useful when A/B is expensive; sensitive to modeling assumptions.
Interleaving: efficient comparisons for ranking changes; still needs careful logging.

See: Evaluation modes

Turning a report into a decision¶

A good decision record answers:

what changed
what improved and what worsened (including slices)
whether guardrails stayed within bounds
ship/hold/rollback decision and why

Start here: