Skip to content

Evaluation checklist (recommendation engineer)¶

This page explains Evaluation checklist (recommendation engineer) and how it fits into the RecSys suite.

Who this is for¶

Recommendation engineers validating changes to signals, candidates, or ranking behavior
Anyone responsible for a “ship/hold/rollback” decision backed by evidence

What this is¶

A practical checklist that catches the most common evaluation failures:

bad joins (request_id problems)
leakage (training/eval contamination)
non-comparable baselines
metrics that look good but don’t translate to user impact

This is not a textbook. It is a “don’t ship blind” list.

0) Define the decision and scope¶

What is the decision? ship / hold / rollback
What changed? config / rules / signal / ranking code
What is the target surface? surface = ...
What is the primary KPI and minimum effect size?
What guardrails must not regress?

See: Success metrics (KPIs, guardrails, and exit criteria)

1) Data and join sanity (mandatory)¶

request_id is stable per render and present in:
serving response meta.request_id
exposure events
outcome events
Join-rate is measured and acceptable for the surface
Exposure ranks are recorded (position bias matters)
Identifiers are pseudonymous (avoid raw PII)

Canonical specs:

Minimum instrumentation: Minimum instrumentation spec (for credible evaluation)
Join logic: Event join logic (exposures ↔ outcomes ↔ assignments)

2) Baseline comparability (mandatory)¶

Baseline is clearly defined (what system/logic, what parameters)
Same population, same time window, same filters
You can reproduce baseline numbers

See: Baseline benchmarks (anchor numbers)

3) Offline evaluation gate (recommended before any online test)¶

Run offline evaluation with a deterministic snapshot (or clearly define sampling)
Ensure no leakage (training data leaking into evaluation window)
Inspect both aggregate metrics and slices (new users, cold-start, long-tail)

Start here:

Suite workflow: How-to: run evaluation and make ship decisions
Offline gate in CI: Workflow: Offline gate in CI
Interpreting results: Interpreting results: how to go from report to decision

4) Online validation (when you have traffic)¶

Choose test type: A/B, interleaving, or staged rollout
Confirm randomization and bucketing are stable
Track guardrails in near real-time (latency, empty-recs rate, errors)
Predefine stop conditions (kill switch thresholds)

See: Workflow: Online A/B analysis in production

5) Decision artifact (required)¶

Produce a shareable decision record (what changed, what the evidence says)
Link the report outputs and where raw logs live
Record rollback lever and confirm it works

Evidence template: Evidence (what “good outputs” look like)

Read next¶

RecSys engineering hub: RecSys engineering hub
Eval validity (what numbers mean): Evaluation validity
Decision playbook: Decision playbook: ship / hold / rollback