Operational reliability and rollback¶

This page explains Operational reliability and rollback and how it fits into the RecSys suite.

Who this is for¶

Product owners and stakeholders who need confidence that “we can ship this safely”
Engineering leads who want a shared mental model of what can go wrong and how we recover
On-call/SRE who need the shortest path to the right runbook

A clear model for what “healthy” means across serving, pipelines, and evaluation
The rollback levers that exist in each layer (and when to use which)
A first-incident checklist with links to the right runbooks

RecSys is designed so that:

Serving stays available even when offline pipelines fail.
Changes are reversible: you can roll back config/rules or artifact versions without redeploying everything.
Decisions are auditable: logs and reports can explain what was shipped and why.

The most important invariants are:

Pipelines publish artifacts first and update the “current” manifest pointer last.
The service reads “current” and can fall back safely when data is missing.

Use the smallest lever that fixes the user-facing issue.

When to use:

How:

When to use:

A published artifact version is wrong (bad data, wrong window, bad computation).
Freshness is OK, but relevance regressed immediately after a publish.

How:

When to use:

Data quality is unreliable (join-rate is bad, validation is failing, SRM indicates instrumentation issues).
You need to stabilize observability before trying new algorithms.

How:

Follow the evaluation workflow: How-to: run evaluation and make ship decisions
Use recsys-eval troubleshooting when metrics don’t make sense: Troubleshooting: symptom -> cause -> fix

Pick the symptom that best matches what you see: