Skip to content

Evidence (what “good outputs” look like)

Concrete examples of the artifacts a credible pilot should produce (logs, reports, and audit records).

Who this is for

  • Buyers who want proof the loop is real (not just architecture)
  • Stakeholders who need to see what a decision artifact looks like

What you will get

  • The three concrete artifacts produced in a credible pilot
  • Where they come from in this repo (so you can reproduce them)

What you will receive (deliverables)

If you follow the recommended pilot path, you should be able to produce:

  • A shareable evaluation report (JSON + optional Markdown/HTML summary)
  • An evidence kit (sample recommendation response, exposure/outcome samples, join-rate checks)
  • A written ship/hold/rollback decision linked to the artifacts
  • A reproducibility record (commands, versions, and where artifacts are stored)

Where to start:

Evidence ladder (how to interpret)

This page shows example artifacts you can generate in a credible pilot.

What this evidence does prove:

  • You can serve non-empty recommendations (POST /v1/recommend).
  • You can log what was shown (exposures) and what happened (outcomes), and join them by request_id.
  • You can produce a shareable report and make a ship/hold/rollback decision with an audit trail.

What this evidence does not prove by itself:

  • KPI lift in your product (you still need your own data + experimentation discipline).
  • Production readiness (use the production checklist + runbooks): Production readiness checklist
  • Absolute performance/latency guarantees (use baseline anchor numbers as a starting point): Baseline benchmarks

The artifacts that matter

1) Serving output (what users see)

An API response includes ranked items plus metadata and warnings.

Example (response shape, abbreviated):

{
  "items": [{ "item_id": "item_3", "rank": 1, "score": 0.12 }],
  "meta": {
    "tenant_id": "demo",
    "surface": "home",
    "config_version": "W/\"...\"",
    "rules_version": "W/\"...\"",
    "request_id": "req-1"
  },
  "warnings": []
}

Start here:

2) Exposure/outcome logs (what we measure)

You need auditable logs to attribute outcomes to what was shown.

Examples (JSONL; shown pretty-printed for readability):

Exposure (exposure.v1):

{
  "request_id": "req-1",
  "user_id": "u_1",
  "ts": "2026-02-05T10:00:00Z",
  "items": [
    { "item_id": "item_1", "rank": 1 },
    { "item_id": "item_2", "rank": 2 }
  ],
  "context": { "tenant_id": "demo", "surface": "home" }
}

Outcome (outcome.v1):

{
  "request_id": "req-1",
  "user_id": "u_1",
  "item_id": "item_2",
  "event_type": "click",
  "ts": "2026-02-05T10:00:02Z"
}

Start here:

3) Evaluation report (what you can share internally)

recsys-eval produces a machine-readable JSON report and optional Markdown/HTML summaries.

Example (executive summary shape, illustrative):

{
  "run_id": "2026-02-05T12:34:56Z-abc123",
  "mode": "offline",
  "created_at": "2026-02-05T12:34:56Z",
  "version": "recsys-eval/vX.Y.Z",
  "summary": {
    "cases_evaluated": 12345,
    "executive": {
      "decision": "pass",
      "highlights": ["No regressions on guardrails"],
      "key_deltas": [{ "name": "primary_metric", "delta": 0.012, "relative_delta": 0.03 }]
    }
  }
}

Start here:

4) Audit record (what changed, and who changed it)

Control-plane changes (config/rules/cache invalidation) can be written to an audit log.

Example (abbreviated):

{
  "tenant_id": "demo",
  "entries": [
    {
      "id": 123,
      "occurred_at": "2026-02-05T10:00:00Z",
      "actor_sub": "user:demo-admin",
      "actor_type": "user",
      "action": "config.update",
      "entity_type": "tenant_config",
      "entity_id": "demo",
      "request_id": "req-1"
    }
  ]
}

Start here:

A reproducible demo path (under an hour)

Run the suite locally and produce a report you can share:

This gives you:

  • a working serving API
  • eval-compatible exposure logs
  • a minimal outcome log
  • a sample evaluation report

Evidence kit template (copy/paste)

Use this template as the “bundle” you share internally for decision-making and procurement.

1) Context

  • Product / domain:
  • Surface:
  • Population:
  • Time window:
  • Primary KPI:
  • Guardrails:

2) What changed

  • Candidate vs baseline description:
  • Config/rules/algo versions:
  • Data mode: DB-only / artifact-manifest
  • Serving proof (sample response with request_id): …
  • Exposure log sample (JSONL line + schema version): …
  • Outcome log sample (JSONL line + schema version): …
  • Join-rate report (exposures ↔ outcomes by request_id): …
  • Evaluation report (JSON + executive summary): …

4) Decision

  • Decision: ship / hold / rollback
  • Reasoning (1–5 bullets):
  • Risks / follow-ups:

5) Reproducibility

  • Commit SHA / tag: …
  • Commands / pipeline run IDs: …
  • Where artifacts are stored (paths, retention): …