Suite architecture¶
This page explains Suite architecture and how it fits into the RecSys suite.
Canonical architecture flow
The canonical end-to-end flow is described in How it works. This page focuses on component boundaries and operational levers.
Who this is for¶
- Developers and platform engineers integrating the suite
- Recommendation engineers who want a “whole system” mental model
- SRE / on-call engineers who need to know rollback levers and common failure modes
What you will get¶
- The end-to-end data flow from serving → logging → pipelines → evaluation
- Where state lives in each mode (DB-only vs artifact/manifest)
- The IDs that tie everything together (
tenant_id,surface,request_id) - The operational levers for safe shipping and rollback
One-screen mental model¶
flowchart LR
C[Client] -->|/v1/recommend| S[recsys-service]
S --> A[recsys-algo]
S --> E[(Exposure logs)]
C --> O[(Outcome logs)]
E --> P[recsys-pipelines]
O --> P
P --> M[(Manifest pointer)]
M --> S
E --> V[recsys-eval]
O --> V
V --> D[(Report + ship/rollback decision)] See also: Suite Context
Components and responsibilities¶
recsys-service (online)¶
Responsibilities:
- Low-latency HTTP API (
/v1/recommend,/v1/similar) - Tenancy and scoping (tenant headers/JWT claims, surfaces, segments)
- Caching and backpressure
- Exposure logging (for evaluation and auditability)
Reads:
- Tenant config/rules (versioned) from Postgres (optional but recommended)
- Signals either:
- from Postgres tables (DB-only mode), or
- from artifacts referenced by a manifest pointer (artifact/manifest mode)
Writes:
- Exposure logs (file-based JSONL by default)
- Optional audit logs for admin writes
Start here:
- Admin/bootstrap: Admin API + local bootstrap (recsys-service)
- Config reference: recsys-service configuration
recsys-algo (ranking core)¶
Responsibilities:
- Deterministic ranking/scoring of candidate sets
- Constraints/rules/diversity with explainability support
- Ports-and-adapters design so storage backends can vary
Start here:
recsys-pipelines (offline artifact builder)¶
Responsibilities:
- Ingest and canonicalize raw events
- Compute versioned artifacts (e.g., popularity, co-occurrence)
- Publish artifacts and update the “current” pointer (manifest)
- Provide rollback by pointer swap (never point serving at missing blobs)
Start here:
- Start here
- Artifact lifecycle: Artifacts and versioning
recsys-eval (evaluation + decision support)¶
Responsibilities:
- Validate logs against strict schemas
- Compute offline regression gates and online experiment analysis
- Produce reports and a decision trail (“ship / hold / rollback”)
Start here:
- Overview: recsys-eval
- Interpreting results: Interpreting results: how to go from report to decision
The key identifiers (how the system joins up)¶
tenant_id: organization boundary for data isolation and config (seetenants.external_id)surface: where recommendations are shown (e.g.,home,pdp); also a signal namespace by defaultsegment: optional sub-slice within a surface (e.g.,new_users)request_id: join key across exposure logs and outcomes; make it stable per requestuser_id: stable, pseudonymous identifier (avoid raw PII)
Related:
- Namespacing: Surface namespaces
- Logging: Exposure logging and attribution
Data modes: DB-only vs artifact/manifest¶
There are two supported serving modes:
- DB-only mode: signals live in Postgres tables and are read directly by the service.
- Artifact/manifest mode: pipelines publish versioned blobs to object storage and the service reads the current versions via a manifest pointer.
This tradeoff is explained here:
Ship and rollback: what changes in production¶
Common production levers:
1) Config and rules
- Update via admin endpoints (versioned, optimistic concurrency).
- Invalidate service caches after updates.
See:
2) Artifacts / manifest pointer
- Pipelines publish new artifacts and swap the manifest pointer last.
- Rollback is a pointer swap to the last known-good manifest.
See:
- Pipelines rollback: How-to: Roll back to a previous artifact version
- Suite runbook (service): Runbook: Roll back config/rules
Common failure modes (and where to look)¶
- Empty recs
- missing signals (DB-only) or missing/incorrect manifest (artifact mode)
- surface/namespace mismatch (
homedata queried underpdp) - overly strict constraints/rules
See:
- Runbook: Empty recs
-
Forbidden / tenant scope errors
- tenant headers missing or mismatched
See: