Skip to content

Runbooks: operating recsys-eval

This page explains Runbooks: operating recsys-eval and how it fits into the RecSys suite.

Who this is for

Maintainers and on-call engineers.

What you will get

  • The top failure modes and how to debug them quickly
  • A repeatable "triage" flow

Triage flow

  1. Identify the run:

  2. run_id

  3. mode
  4. dataset window
  5. binary version

  6. Check data quality:

  7. schema validation

  8. duplicates
  9. missing required fields

  10. Check joins:

  11. match rates

  12. timestamp anomalies

  13. Check gates and warnings:

  14. which metric triggered the gate

  15. which segment drove the regression

  16. Decide action:

  17. fix data

  18. rerun
  19. rollback config/model
  20. escalate

Failure mode: schema validation fails

Symptoms:

  • validate command reports missing fields or wrong types

Fix:

  • update logging to match schemas
  • if schema changed, bump schema version and update producers

Failure mode: join match rate collapses

Symptoms:

  • offline metrics drop to near zero
  • report shows low join match

Likely causes:

  • request_id changed format
  • producers stopped logging outcomes with request_id
  • duplicate or missing request_id in exposures

Fix:

  • compare recent exposure and outcome samples
  • confirm request_id consistency end-to-end

Failure mode: SRM warning (experiments)

Symptoms:

  • control vs candidate sample sizes are off

Likely causes:

  • bucket assignment bug
  • logging bug
  • rollout was not actually 50/50

Fix:

  • stop interpreting metrics
  • fix assignment and rerun

Failure mode: OPE high variance

Symptoms:

  • warnings about near-zero propensities
  • wildly unstable estimates

Fix:

  • do not ship based on OPE
  • improve propensity logging and overlap
  • prefer A/B or interleaving