Runbook: Service not ready¶
This guide shows how to runbook: Service not ready in a reliable, repeatable way.
Symptoms¶
/readyzreturns503 Service Unavailable- Kubernetes marks the pod
Ready=False(readiness probe failing) /healthzmay still return200 OK(process is running but not ready)
Quick triage (copy/paste)¶
Set:
BASE_URL=${BASE_URL:-http://localhost:8000}
Then run:
curl -fsS "$BASE_URL/healthz"
curl -fsS "$BASE_URL/readyz" || true
curl -fsS "$BASE_URL/health/detailed"
Interpretation:
healthzfailing means the process is not healthy (check container logs first).readyzfailing means at least one readiness dependency is unhealthy.health/detailedtells you which check(s) are unhealthy and why.
Warning
When sharing /health/detailed output (tickets, Slack), redact secrets and internal hostnames.
What readiness checks exist¶
In recsys-service, readiness includes:
basic(always healthy)database(Postgres ping) whenDATABASE_URLis configured- auth provider checks (for example: JWKS fetch) when JWT auth is enabled
Common causes and safe remediations¶
Check database is unhealthy¶
Likely causes:
- Postgres is down or not reachable from the service network
DATABASE_URLpoints to the wrong host/db or has invalid credentials- network policy / DNS / TLS issues
Checks:
- Local dev (docker compose):
cd api && make migrate-status - From a pod/container: run a simple query (for example
psql "$DATABASE_URL" -c 'select 1')
Safe remediations:
- Fix connectivity/credentials (secret wiring, DNS, firewall, DB availability)
- Apply migrations if they are behind:
- Local dev:
cd api && make migrate-up - Production: run your migration job (see Database migrations)
- If migrations fail: see Runbook: Database migration issues
Auth/JWKS check is unhealthy (JWT enabled)¶
Likely causes:
JWT_JWKS_URLis unreachable from the pod (egress/DNS)- the JWKS host is not allowlisted (
AUTH_JWKS_ALLOWED_HOSTS) - TLS/certificate problems
Checks:
- Verify configuration:
JWT_JWKS_URL,AUTH_JWKS_ALLOWED_HOSTS - From the same network as the service, check reachability:
curl -fsS "$JWT_JWKS_URL" >/dev/null
Safe remediations:
- Fix egress/DNS/TLS for the JWKS URL
- Update
AUTH_JWKS_ALLOWED_HOSTSto match the JWKS hostname
Warning
Do not use insecure JWKS settings (AUTH_ALLOW_INSECURE_JWKS=true) in production.
If /readyz is OK but traffic still fails¶
Readiness only covers baseline dependencies. If /readyz is 200 but:
/v1/recommendreturns empty → see Runbook: Empty recs- you are in artifact/manifest mode and requests fail → verify
RECSYS_ARTIFACT_*config and manifest publishing (see Data modes)
Read next¶
- Failure modes & diagnostics: Failure modes & diagnostics
- Database migration issues: Runbook: Database migration issues
- Operations index: Operations