Benchmarks and methodology¶
Benchmarks are credibility tools, not marketing numbers.
This page explains what you can reasonably measure, how to measure it, and how to record results so they are comparable over time.
What we benchmark¶
We focus on three benchmark categories that matter during procurement:
- Serving performance (latency/throughput for
POST /v1/recommend) - Pipelines performance (how long artifacts and manifests take to build)
- Evaluation runtime (how long offline reports take to produce)
What you should not expect¶
- These numbers are not “your production numbers.”
- These numbers do not imply business lift.
- Cross-company comparisons are misleading unless environments are comparable.
Reproducible baseline benchmarks¶
The suite includes baseline runs and a template to record your own results:
- Baseline benchmarks (ops): Baseline benchmarks (anchor numbers)
- Performance and capacity: Performance and capacity guide
Minimal benchmark protocol (recommended)¶
Run this protocol in a clean local environment first (10–20 minutes):
- Run the local end-to-end tutorial:
- local end-to-end (service → logging → eval)
- Record:
- host specs (CPU / RAM / disk)
- docker versions
- dataset size (items/users)
- data mode (DB-only vs artifact/manifest)
- Run the included baseline scripts and write down results.
How to share results internally
Paste your recorded results into your evaluation document and link to the exact git commit + manifest id.
Benchmark validity checklist¶
Use this checklist to keep your benchmarks meaningful:
- You know the workload (surface count,
k, filters) - You capture p50/p95/p99 latency (not only averages)
- You record the dataset size (items/users)
- You record artifact versions and config versions
- You record cache behavior (cold vs warm)
How benchmarks connect to procurement¶
Benchmarks should answer:
- “Will this fit inside our latency budget?”
- “What will it cost us to run?”
- “How hard is it to operate?”
See also:
- TCO and effort: TCO and effort model
- Procurement pack: Procurement pack (Security, Legal, IT, Finance)
- Evidence (example outputs): Evidence (what “good outputs” look like)
Read next¶
- TCO and effort: TCO and effort model
- Evidence: Evidence (what “good outputs” look like)
- Limitations: Known limitations and non-goals (current)