Evidence (what “good outputs” look like)¶

Concrete examples of the artifacts a credible pilot should produce (logs, reports, and audit records).

Who this is for¶

Buyers who want proof the loop is real (not just architecture)
Stakeholders who need to see what a decision artifact looks like

What you will get¶

The three concrete artifacts produced in a credible pilot
Where they come from in this repo (so you can reproduce them)

What you will receive (deliverables)¶

If you follow the recommended pilot path, you should be able to produce:

A shareable evaluation report (JSON + optional Markdown/HTML summary)
An evidence kit (sample recommendation response, exposure/outcome samples, join-rate checks)
A written ship/hold/rollback decision linked to the artifacts
A reproducibility record (commands, versions, and where artifacts are stored)

Where to start:

Suite workflow: Run eval and ship
Recommended outputs bundle: Evidence kit template

Evidence ladder (how to interpret)¶

This page shows example artifacts you can generate in a credible pilot.

What this evidence does prove:

You can serve non-empty recommendations (POST /v1/recommend).
You can log what was shown (exposures) and what happened (outcomes), and join them by request_id.
You can produce a shareable report and make a ship/hold/rollback decision with an audit trail.

What this evidence does not prove by itself:

KPI lift in your product (you still need your own data + experimentation discipline).
Production readiness (use the production checklist + runbooks): Production readiness checklist
Absolute performance/latency guarantees (use baseline anchor numbers as a starting point): Baseline benchmarks

The artifacts that matter¶

1) Serving output (what users see)¶

An API response includes ranked items plus metadata and warnings.

Example (response shape, abbreviated):

{
  "items": [{ "item_id": "item_3", "rank": 1, "score": 0.12 }],
  "meta": {
    "tenant_id": "demo",
    "surface": "home",
    "config_version": "W/\"...\"",
    "rules_version": "W/\"...\"",
    "request_id": "req-1"
  },
  "warnings": []
}

Start here:

Local end-to-end tutorial: Local end-to-end

2) Exposure/outcome logs (what we measure)¶

You need auditable logs to attribute outcomes to what was shown.

Examples (JSONL; shown pretty-printed for readability):

Exposure (exposure.v1):

{
  "request_id": "req-1",
  "user_id": "u_1",
  "ts": "2026-02-05T10:00:00Z",
  "items": [
    { "item_id": "item_1", "rank": 1 },
    { "item_id": "item_2", "rank": 2 }
  ],
  "context": { "tenant_id": "demo", "surface": "home" }
}

Outcome (outcome.v1):

{
  "request_id": "req-1",
  "user_id": "u_1",
  "item_id": "item_2",
  "event_type": "click",
  "ts": "2026-02-05T10:00:02Z"
}

Start here:

Data contracts (schemas + examples): Data contracts
Exposure logging and attribution: Exposure logging and attribution

recsys-eval produces a machine-readable JSON report and optional Markdown/HTML summaries.

Example (executive summary shape, illustrative):

{
  "run_id": "2026-02-05T12:34:56Z-abc123",
  "mode": "offline",
  "created_at": "2026-02-05T12:34:56Z",
  "version": "recsys-eval/vX.Y.Z",
  "summary": {
    "cases_evaluated": 12345,
    "executive": {
      "decision": "pass",
      "highlights": ["No regressions on guardrails"],
      "key_deltas": [{ "name": "primary_metric", "delta": 0.012, "relative_delta": 0.03 }]
    }
  }
}

Start here:

Suite workflow: Run eval and ship
recsys-eval overview: recsys-eval overview

4) Audit record (what changed, and who changed it)¶

Control-plane changes (config/rules/cache invalidation) can be written to an audit log.

Example (abbreviated):

{
  "tenant_id": "demo",
  "entries": [
    {
      "id": 123,
      "occurred_at": "2026-02-05T10:00:00Z",
      "actor_sub": "user:demo-admin",
      "actor_type": "user",
      "action": "config.update",
      "entity_type": "tenant_config",
      "entity_id": "demo",
      "request_id": "req-1"
    }
  ]
}

Start here:

Admin API bootstrap and audit endpoint: Admin API
Security hardening checklist (includes audit logging): Security, privacy, compliance

A reproducible demo path (under an hour)¶

Run the suite locally and produce a report you can share:

Tutorial: Local end-to-end

This gives you:

a working serving API
eval-compatible exposure logs
a minimal outcome log
a sample evaluation report

Evidence kit template (copy/paste)¶

Use this template as the “bundle” you share internally for decision-making and procurement.

1) Context¶

Product / domain: …
Surface: …
Population: …
Time window: …
Primary KPI: …
Guardrails: …

2) What changed¶

Candidate vs baseline description: …
Config/rules/algo versions: …
Data mode: DB-only / artifact-manifest

3) Proof artifacts (links)¶

Serving proof (sample response with request_id): …
Exposure log sample (JSONL line + schema version): …
Outcome log sample (JSONL line + schema version): …
Join-rate report (exposures ↔ outcomes by request_id): …
Evaluation report (JSON + executive summary): …

4) Decision¶

Decision: ship / hold / rollback
Reasoning (1–5 bullets):
…
Risks / follow-ups:
…

5) Reproducibility¶

Commit SHA / tag: …
Commands / pipeline run IDs: …
Where artifacts are stored (paths, retention): …

Evidence (what “good outputs” look like)¶

Who this is for¶

What you will get¶

What you will receive (deliverables)¶

Evidence ladder (how to interpret)¶

The artifacts that matter¶

1) Serving output (what users see)¶

2) Exposure/outcome logs (what we measure)¶

3) Evaluation report (what you can share internally)¶

4) Audit record (what changed, and who changed it)¶

A reproducible demo path (under an hour)¶

Evidence kit template (copy/paste)¶

1) Context¶

2) What changed¶

3) Proof artifacts (links)¶

4) Decision¶

5) Reproducibility¶

Read next¶