Evaluation validity¶

Evaluation numbers are only useful when you can explain what they mean, what they do not mean, and what you would do if they move.

This page makes the evaluation claims in this documentation precise and conservative.

What RecSys guarantees about evaluation¶

RecSys can guarantee process quality (repeatability and auditability), not business outcomes.

Reproducible offline runs: given the same inputs and the same artifact/manifest versions, you can re-run offline evaluation and obtain the same outputs.
Auditable joins: exposures, outcomes, and assignments are joined by request_id with explicit join logic.
Decision trail: reports are intended to be stored alongside the change that produced them, so “why did we ship?” is answerable.

RecSys helps you measure and decide; it does not remove the need for product judgment.

Symptoms:

Fix:

Common causes:

Mitigations:

Treat offline evaluation as a gate (reject obvious regressions), not as the final word.
Keep an “offline validity checklist” in your evaluation runbook.

Common causes:

Mitigations:

Define one KPI + one guardrail per surface first. See: Success metrics (KPIs, guardrails, and exit criteria)
Use a decision rubric (ship/hold/rollback). See: Decision playbook: ship / hold / rollback

Common causes:

Mitigations:

References:

Use this ladder to move from “safe to merge” to “safe to ship”: