Production readiness checklist (RecSys suite)¶

This guide shows how to production readiness checklist (RecSys suite) in a reliable, repeatable way.

Who this is for¶

Developers, platform/SRE, and security reviewers preparing to run recsys-service in production.

What you will get¶

A practical checklist to catch the most common “we went live and it broke” gaps: auth/tenancy, data modes, privacy, observability, runbooks, backups, and safe rollout/rollback.

0) Decide your serving mode (DB-only vs artifact/manifest)¶

Pick a serving mode per tenant/environment:
DB-only mode (fastest pilot; signals live in Postgres)
Artifact/manifest mode (production-like; pipelines publish artifacts + manifest)
Read: Data modes
Document the choice for your deployment (values/env + runbooks).

1) Tenancy and authentication¶

See: Admin API + local bootstrap

2) Data contracts, logging, and privacy¶

Decide what identifiers are allowed (user_id / anonymous_id / session_id).
Confirm exposure/outcome logging design:
Required IDs + correlation strategy (request_id, item_id, subject id)
Retention policy and access control
If using exposure hashing, set and rotate EXPOSURE_HASH_SALT as a secret.
Document your PII stance (what fields are considered PII in your org).

See:

3) Pipelines readiness (artifact/manifest mode only)¶

Confirm artifact publishing is automated (scheduler) and has an owner/on-call.
Confirm you can roll back the manifest safely.
Define freshness SLOs and alerting.

See:

4) Database and migrations¶

Ensure Postgres is provisioned with backups and a tested restore procedure.
Confirm migrations are applied safely:
preflight checks in CI
an explicit migration job in production (not “hope MIGRATE_ON_START is fine”)
Document your rollback strategy for schema changes.

See: Database migrations

5) Observability and runbooks¶

6) Performance and capacity¶

Run a load test against production-like data and record results.
Configure caching and backpressure based on observed behavior:
RECSYS_CONFIG_CACHE_TTL, RECSYS_RULES_CACHE_TTL
RECSYS_BACKPRESSURE_MAX_INFLIGHT, RECSYS_BACKPRESSURE_MAX_QUEUE
Decide if and how you will enable profiling endpoints (keep PPROF_ENABLED=false by default).

See: Performance and capacity

7) Safe rollout and rollback¶

Define “ship” and “rollback” procedures for:
config and rules (admin API, audit log)
artifact manifests (pipelines)
algorithm version changes (deployments)
Confirm you can answer: “Which config/rules/algo version served this request?”
meta.config_version, meta.rules_version, meta.algo_version in responses
Document gates and criteria for shipping.

See:

Production readiness checklist (RecSys suite)¶

Who this is for¶

What you will get¶

0) Decide your serving mode (DB-only vs artifact/manifest)¶

1) Tenancy and authentication¶

2) Data contracts, logging, and privacy¶

3) Pipelines readiness (artifact/manifest mode only)¶

4) Database and migrations¶

5) Observability and runbooks¶

6) Performance and capacity¶

7) Safe rollout and rollback¶

Read next¶