Skip to content

Observability

RecSys exposes Prometheus metrics at /metrics when the service runs with the default toolkit middleware. The repository includes starter monitoring assets:

  • observability/prometheus-alerts.yaml
  • observability/grafana-dashboard.json

Treat these as templates. Tune thresholds to your catalog, traffic, latency budget, and rollback tolerance.

Key Signals

Signal Metric
Request rate and outcomes recsys_recommendation_requests_total
Latency recsys_recommendation_latency_seconds
Returned item count recsys_recommendation_returned_items
Warning count recsys_recommendation_warnings
Artifact load failures recsys_artifact_load_failures_total
Manifest freshness recsys_artifact_manifest_age_seconds

The built-in labels intentionally avoid tenant IDs, request IDs, user IDs, and artifact URIs. Use logs with request IDs for detailed incident reconstruction.

First Alerts

Start with alerts for:

  • error or overload rate above the agreed guardrail,
  • empty recommendation rate above the agreed guardrail,
  • p95 latency regression,
  • stale manifests,
  • artifact load failures.

When an alert fires, use the operations runbooks for empty recommendations, stale manifests, and service readiness.