Failure modes & diagnostics (baseline)¶

This page explains Failure modes & diagnostics (baseline) and how it fits into the RecSys suite.

Who this is for¶

A set of common failure modes with: symptom → cause → diagnosis → fix → prevention
Links to the deeper runbooks and reference pages

Symptom
recsys-eval reports low exposure/outcome join rates
KPI swings look “too good / too bad” and vary wildly by slice
Likely causes
outcomes missing request_id
request_id generated twice (API call vs downstream logging)
the same request_id reused for multiple renders
surface/tenant keys mismatch between logs and slice keys
Diagnosis
run recsys-eval validate on exposures/outcomes (and assignments if experiments)
compute join-rate by surface (and platform) from raw logs
Fix
propagate request_id from recommend → render → outcome event
add an automated integration test that asserts “same request_id everywhere”
Prevention
enforce the invariants in: Minimum instrumentation spec (for credible evaluation)
keep request_id generation in one place (shared middleware/client)

Symptom
response items[] is empty (or much shorter than k)
Likely causes
no candidate data (empty popularity table in DB-only mode)
surface/namespace mismatch (writing signals to home but requesting home_feed)
constraints or allow-lists filtered everything
missing artifacts / stores (signal unavailable) in artifact mode
Diagnosis
check warnings[] (SIGNAL_UNAVAILABLE, CONSTRAINTS_FILTERED, CANDIDATES_INCLUDE_EMPTY)
confirm tenant + surface config exists (admin bootstrap)
if DB-only: verify seed tables contain data for the namespace
Fix
follow the runbook: Runbook: Empty recs
Prevention
integration checklist (one surface): How-to: Integration checklist (one surface)

Symptom
recommendations do not change after pipeline runs
results look “stuck” on an old model/version
Likely causes
pipelines did not publish a new manifest pointer
object store credentials/paths misconfigured
service caches not invalidated after shipping
Diagnosis
check manifest timestamp/version in the registry
check pipeline job logs for publish steps
confirm the service can read the manifest path and objects
Fix
follow the runbook: Runbook: Stale manifest (artifact mode)
Prevention
add a “ship verification” step: publish → invalidate cache → smoke test one request