Skip to content

Runbooks: operating recsys-eval¶

This page explains Runbooks: operating recsys-eval and how it fits into the RecSys suite.

Who this is for¶

Maintainers and on-call engineers.

What you will get¶

The top failure modes and how to debug them quickly
A repeatable "triage" flow

Triage flow¶

Identify the run:
run_id
mode
dataset window
binary version
Check data quality:
schema validation
duplicates
missing required fields
Check joins:
match rates
timestamp anomalies
Check gates and warnings:
which metric triggered the gate
which segment drove the regression
Decide action:
fix data
rerun
rollback config/model
escalate

Failure mode: schema validation fails¶

Symptoms:

validate command reports missing fields or wrong types

Fix:

update logging to match schemas
if schema changed, bump schema version and update producers

Failure mode: join match rate collapses¶

Symptoms:

offline metrics drop to near zero
report shows low join match

Likely causes:

request_id changed format
producers stopped logging outcomes with request_id
duplicate or missing request_id in exposures

Fix:

compare recent exposure and outcome samples
confirm request_id consistency end-to-end

Failure mode: SRM warning (experiments)¶

Symptoms:

control vs candidate sample sizes are off

Likely causes:

bucket assignment bug
logging bug
rollout was not actually 50/50

Fix:

stop interpreting metrics
fix assignment and rerun

Failure mode: OPE high variance¶

Symptoms:

warnings about near-zero propensities
wildly unstable estimates

Fix:

do not ship based on OPE
improve propensity logging and overlap
prefer A/B or interleaving

Read next¶

Troubleshooting: Troubleshooting: symptom -> cause -> fix
Online A/B workflow: Workflow: Online A/B analysis in production
CI gates: CI gates: using recsys-eval in automation
Metrics: Metrics: what we measure and why