Observability¶
RecSys exposes Prometheus metrics at /metrics when the service runs with the default toolkit middleware. The repository includes starter monitoring assets:
observability/prometheus-alerts.yamlobservability/grafana-dashboard.json
Treat these as templates. Tune thresholds to your catalog, traffic, latency budget, and rollback tolerance.
Key Signals¶
| Signal | Metric |
|---|---|
| Request rate and outcomes | recsys_recommendation_requests_total |
| Latency | recsys_recommendation_latency_seconds |
| Returned item count | recsys_recommendation_returned_items |
| Warning count | recsys_recommendation_warnings |
| Artifact load failures | recsys_artifact_load_failures_total |
| Manifest freshness | recsys_artifact_manifest_age_seconds |
The built-in labels intentionally avoid tenant IDs, request IDs, user IDs, and artifact URIs. Use logs with request IDs for detailed incident reconstruction.
First Alerts¶
Start with alerts for:
- error or overload rate above the agreed guardrail,
- empty recommendation rate above the agreed guardrail,
- p95 latency regression,
- stale manifests,
- artifact load failures.
When an alert fires, use the operations runbooks for empty recommendations, stale manifests, and service readiness.