How-to: run evaluation and make ship decisions¶

This guide shows how to how-to: run evaluation and make ship decisions in a reliable, repeatable way.

Who this is for¶

Engineers shipping recommender changes and needing a quality gate
Analysts validating impact from logs
Operators who need an auditable “ship / hold / rollback” decision trail

What you will get¶

A runnable baseline workflow for validating logs and producing reports
A clear recommendation for when to use offline vs experiment analysis
Links to the deeper recsys-eval docs for interpretation and scaling

Goal¶

Turn exposure/outcome logs into a report you can use to decide ship / hold / rollback.

Prereqs¶

recsys-eval built (from this repo):

cd recsys-eval
make build

Logs in the v1 schemas:
exposures: exposure.v1
outcomes: outcome.v1
assignments: assignment.v1 (required for experiment mode)

0) Validate inputs (always)¶

Validation is strict (extra fields can fail). Run this before trusting any metric:

./bin/recsys-eval validate --schema exposure.v1 --input exposures.jsonl
./bin/recsys-eval validate --schema outcome.v1 --input outcomes.jsonl
./bin/recsys-eval validate --schema assignment.v1 --input assignments.jsonl

Tip: if you want recsys-service to emit exposure.v1 directly, set:

EXPOSURE_LOG_ENABLED=true
EXPOSURE_LOG_FORMAT=eval_v1

See: Exposure logging and attribution

1) Run an offline regression gate (recommended baseline)¶

Always run an offline regression gate in CI:

compare baseline vs candidate versions
fail if a primary metric regresses beyond a threshold

Example:

./bin/recsys-eval run \
  --mode offline \
  --dataset configs/examples/dataset.jsonl.yaml \
  --config configs/eval/offline.ci.yaml \
  --output /tmp/offline_report.md \
  --output-format markdown

If your exposure logs come from recsys-service in eval_v1 format, the exposure context keys are named like tenant_id, surface, and segment. Ensure your slice_keys use the same names.

2) Prefer online experiments when possible¶

Online A/B tests are the best way to measure real impact:

log exposures with experiment id/variant
log outcomes tied to the same request_id
check KPI + guardrails

Example:

./bin/recsys-eval run \
  --mode experiment \
  --dataset configs/examples/dataset.jsonl.yaml \
  --config configs/eval/experiment.default.yaml \
  --output /tmp/experiment_report.md \
  --output-format markdown

3) Ship / rollback mechanics¶

Ship if KPI improves and guardrails hold. Hold if results are inconclusive. Roll back if primary or guardrails regress.

Rollback levers:

Artifacts/manifest: swap the manifest pointer (pipelines) and invalidate service caches
Config/rules: roll back config/rules versions and invalidate service caches

See:

Pipelines rollback: How-to: Roll back to a previous artifact version
Admin cache invalidation: Admin API + local bootstrap (recsys-service)
Interpreting results: Interpreting results: how to go from report to decision

Verify¶

The report file exists and includes a summary table for your chosen mode.
Join integrity is sane (if join rate is low, fix logging before trusting metrics).