Concepts: how to understand recsys-eval¶
This page explains Concepts: how to understand recsys-eval and how it fits into the RecSys suite.
Who this is for¶
Anyone. This is the "map" of the system.
What you will get¶
- The core mental model in 5 minutes
- The four workflows and when to use each
- A small glossary so words like "exposure" stop being mysterious
The mental model in one picture¶
You log:
- what you showed (exposures)
- what users did (outcomes)
- who was in A vs B (assignments, for experiments)
recsys-eval reads those logs and produces a report and optional decision.
exposures (ranked list shown)
+ outcomes (clicks, purchases, etc.)
+ assignments (control vs candidate)
-------------------------------------------> recsys-eval
-------------------------------------------> report.json (+ optional decision.json)
Glossary¶
- request_id:
A single recommendation moment. One screen, one call, one "ranked list shown". In recsys-eval, request_id is the main join key.
- exposure:
What you showed to the user for a request_id: the ranked list of items plus context (tenant, surface, etc.).
- outcome:
What the user did after the exposure: click, conversion, revenue, etc.
- assignment:
Which experiment variant a request/user belongs to (control or candidate).
- segment:
A slice such as tenant + surface + device. Segments are where hidden problems show up. Global averages lie.
- guardrail:
A metric that must not regress even if a primary metric improves. Typical guardrails: latency, errors, empty recommendation rate.
- propensity (OPE only):
A probability that a policy would show an item in a position. If you do not have correct propensities, OPE can confidently produce nonsense.
The four workflows (pick the right tool)¶
1) Offline evaluation¶
Question:
- "If we rank differently, does it better match what users later did?"
Inputs:
- exposures + outcomes
Outputs:
- ranking metrics (NDCG@K, Recall@K, MAP@K, etc.)
- segment breakdowns
- optional confidence intervals
When to use:
- before shipping changes
- regression gate in CI
Common pitfalls:
- your join from exposures to outcomes is broken
- your "ground truth" is too sparse or biased
2) Experiment analysis (A/B)¶
Question:
- "In production, did variant B outperform A, and did we stay within guardrails?"
Inputs:
- exposures + outcomes + assignments
Outputs:
- KPI deltas (CTR, conversion, etc.)
- confidence intervals or p-values (depending on config)
- guardrail checks
- optional decision artifact (ship/hold/rollback)
When to use:
- shipping decisions
Common pitfalls:
- SRM (sample ratio mismatch): buckets are not balanced
- too many segments: false positives
3) Off-policy evaluation (OPE)¶
Question:
- "Can we estimate impact from logs without running an experiment?"
Inputs:
- exposures + outcomes + propensities
Outputs:
- IPS/SNIPS/DR estimates and diagnostics
- warnings about variance and missing propensities
When to use:
- directional iteration when A/B is expensive
Common pitfalls:
- missing overlap: the new policy behaves outside the support of the logged one
- near-zero propensities: variance explodes
4) Interleaving¶
Question:
- "Between ranker A and B, which one wins more often on the same traffic?"
Inputs:
- ranker A results + ranker B results + outcomes (often clicks)
Outputs:
- win rates, tie rate, p-value
When to use:
- comparing two rankers or weight sets quickly
- when A/B would be too slow or noisy
Common pitfalls:
- you treat interleaving as a full business KPI replacement (it is not)
Where this fits in the bigger system¶
Typical stack:
- recsys-service: serves recs and logs exposures and outcomes
- recsys-pipelines: builds artifacts (popularity, co-occurrence, embeddings)
- recsys-algo: ranks and applies rules
- recsys-eval: measures and decides
recsys-eval is the "truth serum": it turns change claims into evidence.
Read next¶
- Metrics: Metrics: what we measure and why
- Interpreting results: Interpreting results: how to go from report to decision
- Online A/B workflow: Workflow: Online A/B analysis in production
- CI gates: CI gates: using recsys-eval in automation