Skip to content

Data lifecycle

This page explains Data lifecycle and how it fits into the RecSys suite.

Stages

  1. Raw events

  2. Input is JSON Lines files (jsonl)

  3. Schema: schemas/events/exposure.v1.json

  4. Canonical events

  5. Stored per day (UTC) per tenant/surface

  6. Written idempotently (replace per partition)

  7. Validation

  8. Canonical is validated before any artifacts are computed/published

  9. Artifact compute

  10. popularity: counts by item

  11. cooc: session-level co-occurrence

  12. Staging (optional)

  13. Compute jobs can stage artifacts to artifacts_dir

  14. Publish

  15. Versioned blob written to object store

  16. Registry record written
  17. Current manifest pointer updated last

Why canonicalization exists

  • Raw data is messy (missing fields, inconsistent formatting)
  • Canonical events define a stable boundary

Why validation gates exist

If you publish a bad artifact, you can degrade user experience immediately. Validation prevents "bad data" from reaching serving.