Operational reliability and rollback¶
This page explains Operational reliability and rollback and how it fits into the RecSys suite.
Who this is for¶
- Product owners and stakeholders who need confidence that “we can ship this safely”
- Engineering leads who want a shared mental model of what can go wrong and how we recover
- On-call/SRE who need the shortest path to the right runbook
What you will get¶
- A clear model for what “healthy” means across serving, pipelines, and evaluation
- The rollback levers that exist in each layer (and when to use which)
- A first-incident checklist with links to the right runbooks
Reliability model (plain language)¶
RecSys is designed so that:
- Serving stays available even when offline pipelines fail.
- Changes are reversible: you can roll back config/rules or artifact versions without redeploying everything.
- Decisions are auditable: logs and reports can explain what was shipped and why.
The most important invariants are:
- Pipelines publish artifacts first and update the “current” manifest pointer last.
- The service reads “current” and can fall back safely when data is missing.
Rollback levers (what can we reverse?)¶
Use the smallest lever that fixes the user-facing issue.
1) Config/rules rollback (fast, common)¶
When to use:
- A bad rule or constraint caused empty or surprising recommendations.
- You need to revert request defaults or weights.
How:
- Use the config/rules rollback runbook: Runbook: Roll back config/rules
2) Artifact/manifest rollback (pipelines layer)¶
When to use:
- A published artifact version is wrong (bad data, wrong window, bad computation).
- Freshness is OK, but relevance regressed immediately after a publish.
How:
- Roll back artifacts safely: How-to: Roll back artifacts safely
- Roll back the manifest pointer: How-to: Roll back to a previous artifact version
3) “Stop shipping” (hold changes)¶
When to use:
- Data quality is unreliable (join-rate is bad, validation is failing, SRM indicates instrumentation issues).
- You need to stabilize observability before trying new algorithms.
How:
- Follow the evaluation workflow: How-to: run evaluation and make ship decisions
- Use
recsys-evaltroubleshooting when metrics don’t make sense: Troubleshooting: symptom -> cause -> fix
First incident checklist (start here under pressure)¶
Pick the symptom that best matches what you see:
- Service is up, but recommendations are empty
- Runbook: Runbook: Empty recs
- Service is up, but data looks stale
- Runbook: Runbook: Stale manifest (artifact mode)
- Pipelines runbook: Runbook: Stale artifacts
- Pipelines run failed
- Runbook: Runbook: Pipeline failed
- Evaluation report looks wrong (joins low, SRM warning, impossible lift)
- Start with: Interpreting results: how to go from report to decision
- Then: Troubleshooting: symptom -> cause -> fix
Read next¶
- Production readiness checklist: Production readiness checklist (RecSys suite)
- Rollback config/rules runbook: Runbook: Roll back config/rules
- Pipelines rollback: How-to: Roll back artifacts safely
- Interpreting eval results: Interpreting results: how to go from report to decision