Data lifecycle¶
This page explains Data lifecycle and how it fits into the RecSys suite.
Stages¶
-
Raw events
-
Input is JSON Lines files (jsonl)
-
Schema:
schemas/events/exposure.v1.json -
Canonical events
-
Stored per day (UTC) per tenant/surface
-
Written idempotently (replace per partition)
-
Validation
-
Canonical is validated before any artifacts are computed/published
-
Artifact compute
-
popularity: counts by item
-
cooc: session-level co-occurrence
-
Staging (optional)
-
Compute jobs can stage artifacts to
artifacts_dir -
Publish
-
Versioned blob written to object store
- Registry record written
- Current manifest pointer updated last
Why canonicalization exists¶
- Raw data is messy (missing fields, inconsistent formatting)
- Canonical events define a stable boundary
Why validation gates exist¶
If you publish a bad artifact, you can degrade user experience immediately. Validation prevents "bad data" from reaching serving.
Read next¶
- Start here: Start here
- Validation and guardrails: Validation and guardrails
- Run incremental: How-to: Run incremental pipelines
- Validation failed runbook: Runbook: Validation failed
- Glossary: Glossary