Concepts: how to understand recsys-eval¶

This page explains Concepts: how to understand recsys-eval and how it fits into the RecSys suite.

Who this is for¶

Anyone. This is the "map" of the system.

What you will get¶

The core mental model in 5 minutes
The four workflows and when to use each
A small glossary so words like "exposure" stop being mysterious

The mental model in one picture¶

You log:

what you showed (exposures)
what users did (outcomes)
who was in A vs B (assignments, for experiments)

recsys-eval reads those logs and produces a report and optional decision.

exposures (ranked list shown)
        + outcomes (clicks, purchases, etc.)
        + assignments (control vs candidate)
-------------------------------------------> recsys-eval
-------------------------------------------> report.json (+ optional decision.json)

Glossary¶

request_id:

A single recommendation moment. One screen, one call, one "ranked list shown". In recsys-eval, request_id is the main join key.

exposure:

What you showed to the user for a request_id: the ranked list of items plus context (tenant, surface, etc.).

outcome:

What the user did after the exposure: click, conversion, revenue, etc.

assignment:

Which experiment variant a request/user belongs to (control or candidate).

segment:

A slice such as tenant + surface + device. Segments are where hidden problems show up. Global averages lie.

guardrail:

A metric that must not regress even if a primary metric improves. Typical guardrails: latency, errors, empty recommendation rate.

propensity (OPE only):

A probability that a policy would show an item in a position. If you do not have correct propensities, OPE can confidently produce nonsense.

The four workflows (pick the right tool)¶

1) Offline evaluation¶

Question:

"If we rank differently, does it better match what users later did?"

Inputs:

exposures + outcomes

Outputs:

ranking metrics (NDCG@K, Recall@K, MAP@K, etc.)
segment breakdowns
optional confidence intervals

When to use:

before shipping changes
regression gate in CI

Common pitfalls:

your join from exposures to outcomes is broken
your "ground truth" is too sparse or biased

2) Experiment analysis (A/B)¶

Question:

"In production, did variant B outperform A, and did we stay within guardrails?"

Inputs:

exposures + outcomes + assignments

Outputs:

KPI deltas (CTR, conversion, etc.)
confidence intervals or p-values (depending on config)
guardrail checks
optional decision artifact (ship/hold/rollback)

When to use:

shipping decisions

Common pitfalls:

SRM (sample ratio mismatch): buckets are not balanced
too many segments: false positives

3) Off-policy evaluation (OPE)¶

Question:

"Can we estimate impact from logs without running an experiment?"

Inputs:

exposures + outcomes + propensities

Outputs:

IPS/SNIPS/DR estimates and diagnostics
warnings about variance and missing propensities

When to use:

directional iteration when A/B is expensive

Common pitfalls:

missing overlap: the new policy behaves outside the support of the logged one
near-zero propensities: variance explodes

4) Interleaving¶

Question:

"Between ranker A and B, which one wins more often on the same traffic?"

Inputs:

ranker A results + ranker B results + outcomes (often clicks)

Outputs:

win rates, tie rate, p-value

When to use:

comparing two rankers or weight sets quickly
when A/B would be too slow or noisy

Common pitfalls:

you treat interleaving as a full business KPI replacement (it is not)

Where this fits in the bigger system¶

Typical stack:

recsys-service: serves recs and logs exposures and outcomes
recsys-pipelines: builds artifacts (popularity, co-occurrence, embeddings)
recsys-algo: ranks and applies rules
recsys-eval: measures and decides

recsys-eval is the "truth serum": it turns change claims into evidence.

Concepts: how to understand recsys-eval¶

Who this is for¶

What you will get¶

The mental model in one picture¶

Glossary¶

The four workflows (pick the right tool)¶

1) Offline evaluation¶

2) Experiment analysis (A/B)¶

3) Off-policy evaluation (OPE)¶

4) Interleaving¶

Where this fits in the bigger system¶

Read next¶