Operations¶
This section is for running RecSys in production: performance, readiness, and on-call runbooks.
Who this is for¶
- SREs and on-call engineers running RecSys in production
- Engineering teams sizing capacity and validating production readiness
What you will get¶
- A production readiness checklist and baseline benchmarks
- Failure-mode diagnosis and safe remediations (with runbook links)
- The first pages to open when the service is not ready or recommendations go empty
Quick paths¶
- Performance & capacity
Sizing guidance and performance expectations. - Baseline benchmarks
Reproducible “anchor numbers” and a template to record your own runs. - Production readiness checklist
Pre-flight checks before you go live. - Failure modes & diagnostics
Common symptoms, likely causes, and safe fixes (with links to runbooks). - Service not ready (runbook)
Triage steps when the API fails readiness. - Empty recs (runbook)
Common causes and safe remediations. - Pipelines runbooks
Day-2 operations for the offline layer.
Read next¶
- Production readiness checklist: Production readiness checklist (RecSys suite)
- Failure modes & diagnostics (with runbooks): Failure modes & diagnostics (baseline)
- Pipelines SLOs and freshness: SLOs and freshness