Skip to content

RecSys

Data lifecycle

Initializing search

Home
Start here
Personas
Tutorials
How-to guides
Reference
Explanation
For businesses
Operations
Components
Project
What's new
Tags

RecSys

Home
Start here
Start here
Personas
Personas
Tutorials
Tutorials
How-to guides
How-to guides
Reference
Reference
- Auth and tenancy
- Minimum instrumentation spec
- Integration spec
- API
  API
- Data contracts
  Data contracts
- Config
  Config
- CLI
  CLI
  - recsys-eval
  - recsys-pipelines
- Database
  Database
Explanation
Explanation
For businesses
For businesses
Operations
Operations
- Production readiness checklist
- Performance & capacity
- Baseline benchmarks
- Failure modes & diagnostics
- Runbooks
  Runbooks
Components
Components
- recsys-algo
  recsys-algo
- recsys-eval
  recsys-eval
  - Overview
  - Workflows
    Workflows
    
    Offline gate in CI
    
    Online A/B in production
    
    Decision playbook (ship/hold/rollback)
    
    Default evaluation pack
    
    Interpretation cheat sheet
  - Concepts
  - Data contracts
  - Integration
  - Metrics
  - Interpreting results
  - OPE
  - Interleaving
  - Architecture
  - CI gates
  - Scaling
  - Runbooks
  - Troubleshooting
  - Security & privacy
  - Style guide
  - Roadmap
- recsys-pipelines
  recsys-pipelines
  - Overview
  - Start here
  - Glossary
  - Learning paths
    Learning paths
    
    Engineer
    
    Data engineer
    
    SRE / on-call
    
    Product / stakeholders
  - Tutorials
    Tutorials
    
    Local quickstart
    
    Job-per-container mode
  - How-to
    How-to
    
    Operate pipelines daily
    
    Backfill safely
    
    Roll back artifacts safely
    
    Run incremental
    
    Run a backfill
    
    Schedule pipelines
    
    Debug failures
    
    Roll back the manifest
    
    Add artifact type
    
    Add event field
  - Explanation
    Explanation
    
    Architecture
    
    Data lifecycle Data lifecycle
    Table of contents
    
    Stages
    
    Why canonicalization exists
    
    Why validation gates exist
    
    Read next
    
    Windows and backfills
    
    Artifacts and versioning
    
    Validation and guardrails
    
    Documentation approach
  - Reference
    Reference
    
    CLI
    
    Config
    
    Output layout
    
    Exit codes
    
    Event schema
    
    Artifact schema
  - Operations
    Operations
    
    SLOs and freshness
    
    Runbooks
    Runbooks
    
    Pipeline failed
    
    Validation failed
    
    Limit exceeded
    
    Stale artifacts
  - Contributing
    Contributing
    
    Dev workflow
    
    Style guide
    
    Releasing
Project
Project
What's new
What's new
- Archive
  Archive
  - 2026
Tags

Table of contents

Stages
Why canonicalization exists
Why validation gates exist
Read next

Home
Components
recsys-pipelines
Explanation

recsys-pipelines

Data lifecycle¶

This page explains Data lifecycle and how it fits into the RecSys suite.

Stages¶

Raw events
Input is JSON Lines files (jsonl)
Schema: schemas/events/exposure.v1.json
Canonical events
Stored per day (UTC) per tenant/surface
Written idempotently (replace per partition)
Validation
Canonical is validated before any artifacts are computed/published
Artifact compute
popularity: counts by item
cooc: session-level co-occurrence
Staging (optional)
Compute jobs can stage artifacts to artifacts_dir
Publish
Versioned blob written to object store
Registry record written
Current manifest pointer updated last

Why canonicalization exists¶

Raw data is messy (missing fields, inconsistent formatting)
Canonical events define a stable boundary

Why validation gates exist¶

If you publish a bad artifact, you can degrade user experience immediately. Validation prevents "bad data" from reaching serving.

Read next¶

Start here: Start here
Validation and guardrails: Validation and guardrails
Run incremental: How-to: Run incremental pipelines
Validation failed runbook: Runbook: Validation failed
Glossary: Glossary

2026-02-08

Aatu Harju

Windows and backfills

Copyright © RecSys - Change cookie settings

Made with Material for MkDocs

Cookie consent

We use cookies to measure documentation usage and improve this site. You can accept or reject analytics cookies.

GitHub
Google Analytics

Manage settings