Skip to content

Evaluation modes¶

This page explains Evaluation modes and how it fits into the RecSys suite.

Who this is for¶

Stakeholders and engineers deciding how to validate improvements
Teams setting up a “ship / hold / rollback” workflow

What you will get¶

A map of offline vs online evaluation (what each can prove)
How RecSys supports each mode (and what you must provide)

Offline evaluation (deterministic, repeatable)¶

Goal:

Answer: “Is this change likely better, and did we break anything?”

Requires:

Exposure logs (what was shown)
Outcome logs (what happened), ideally
A defined evaluation dataset window

Docs:

Overview and workflows: recsys-eval docs
CI gates: CI gates: using recsys-eval in automation

When to use:

Every change that affects ranking behavior (rules, weights, signals, scoring)
As a deterministic “quality gate” before production rollout

Online evaluation (experiments, production validation)¶

Goal:

Answer: “Does this change improve business metrics under real traffic?”

Requires:

A way to assign traffic to variants
Stable subject IDs (user/session) to avoid broken bucketing
Joinable logs (exposures + outcomes)

Docs:

Experimentation model: Experimentation model (A/B, interleaving, OPE)
Online A/B workflow: Workflow: Online A/B analysis in production

When to use:

After offline gates pass
When you need a procurement-grade proof of impact (business KPIs)

Interleaving (faster online comparison)¶

Interleaving compares rankers by mixing results in the same list.

Docs:

Interleaving: Interleaving: fast ranker comparison on the same traffic

Read next¶

Run eval and ship: How-to: run evaluation and make ship decisions
Interpretation cheat sheet: Interpretation cheat sheet (recsys-eval)
Experimentation model: Experimentation model (A/B, interleaving, OPE)