Skip to content

Off-policy evaluation (OPE): powerful and easy to misuse

This page explains Off-policy evaluation (OPE): powerful and easy to misuse and how it fits into the RecSys suite.

Who this is for

Advanced users. Read this before using --mode ope in anything serious.

What you will get

  • What OPE tries to estimate
  • What propensities are and why they matter
  • When OPE results are trustworthy and when they are not

The promise

OPE tries to answer:

  • "What would have happened if we used a different ranking policy?"

using logs collected from an old policy.

This can save you from running an online experiment.

The catch

OPE depends on assumptions that are easy to violate:

  • correct propensity logging
  • overlap between old and new policies (support)
  • stable user behavior model

If you violate these, OPE can confidently lie.

Propensities in plain language

A propensity is the probability that a policy shows an item in a position.

If an item never appears under the logging policy, you cannot reliably estimate how it would perform under a new policy. This is the "no overlap" problem.

Diagnostics you should take seriously

  • near-zero propensities:

your estimator variance explodes

  • missing target propensities:

you are not evaluating the policy you think you are

  • heavy clipping:

your result is dominated by a few samples

A practical "when to use" checklist

Use OPE when:

  • you log propensities correctly
  • your new policy is a mild change from the old
  • you mainly want directional signal

Do not use OPE when:

  • the new policy changes candidate generation dramatically
  • you have missing propensity fields
  • you care about strict ship/no-ship
  • Use offline evaluation first.
  • Use OPE as an early filter.
  • Use A/B or interleaving for the final decision.