Measured Lift

How Levered measures the conversion lift of the model against the holdout, and how to reproduce the exact numbers from your own warehouse.

Measured lift is the empirical conversion difference between users the model optimized and users held back in the holdout (a randomized control that always sees the baseline). It answers a single question: did the optimization actually move the metric?

The same number appears in the dashboard's Lift Measurement tab, the levered optimizations results CLI ("Measured" line), and the API. They are computed by one shared estimator, so they always agree — and you can reproduce them exactly from your warehouse with the SQL and formula below.

The estimator

For a metric over a date range, Levered pools each arm and reports the relative lift with a delta-method confidence interval and a two-sided p-value.

Let the model (treatment) arm have rate p_model over n_model users, and the holdout (control) arm have rate p_hold over n_hold users. For a boolean metric the rate is conversions / users.

relative_lift = (p_model − p_hold) / p_hold          (reported × 100 as a percent)

The standard error of the relative lift comes from the delta method:

var(p_model) = p_model · (1 − p_model) / n_model      # boolean
var(p_hold)  = p_hold  · (1 − p_hold)  / n_hold        # boolean

se = 100 · sqrt( var(p_model) / p_hold²  +  p_model² · var(p_hold) / p_hold⁴ )

95% CI  = relative_lift_percent ± 1.96 · se
p_value = 2 · (1 − Φ(|relative_lift_percent / se|))

For a numeric metric, var per arm is stddev² / n instead of the binomial form; everything else is identical.

A CI and p-value are reported only when each arm has at least 100 users. At a zero holdout baseline the relative lift is undefined and reported as blank.

Why pooling is valid. The estimator pools the whole date range rather than stratifying by day. This is correct because the holdout percentage is fixed for an optimization's lifetime — so the calendar day is not associated with which arm a user is in, and pooling is unbiased. If you ran an optimization under an older policy where the holdout percentage changed mid-flight, do not pool across the change: filter to a single-holdout-percentage window before comparing.

Counts: the canonical definition

Lift parity depends on counting users the same way Levered does. The per-arm counts are participant-level, deduped to one row per user:

One row per anonymous_id. A user with multiple exposures counts once. Their arm and conversion window are anchored to their first exposure for the optimization.
Arm comes from is_holdout on that first exposure (true → holdout, false → model).
Conversion = the user has at least one reward event whose timestamp is within the optimization's conversion window of the first exposure (exposure_ts ≤ reward_ts ≤ exposure_ts + window), joined on anonymous_id.
Day boundaries for a date-range filter are UTC.

Reproduce it from your warehouse

This query reproduces the per-arm counts for an optimization and conversion window. Swap in your exposure and reward tables, the optimization_id, the window (here 24 hours), and an optional date range on the first-exposure day.

WITH first_exposure AS (
  SELECT
    anonymous_id,
    MIN(timestamp)                          AS anchor_ts,
    -- arm = holdout status of the first exposure
    BOOLOR_AGG(is_holdout)                  AS is_holdout
  FROM exposures
  WHERE optimization_id = '<OPTIMIZATION_ID>'
  GROUP BY anonymous_id
),
cohort AS (
  SELECT
    fe.anonymous_id,
    fe.is_holdout,
    fe.anchor_ts,
    -- converted = a reward within the window of the first exposure
    CASE WHEN EXISTS (
      SELECT 1 FROM rewards r
      WHERE r.anonymous_id = fe.anonymous_id
        AND r.timestamp >= fe.anchor_ts
        AND r.timestamp <= DATEADD(hour, 24, fe.anchor_ts)
    ) THEN 1 ELSE 0 END                     AS converted
  FROM first_exposure fe
  -- optional date range, on the first-exposure day, in UTC:
  -- WHERE fe.anchor_ts >= '2026-05-31'
)
SELECT
  CASE WHEN is_holdout THEN 'holdout' ELSE 'model' END AS arm,
  COUNT(*)                                             AS users,
  SUM(converted)                                       AS conversions,
  AVG(converted)                                       AS rate
FROM cohort
GROUP BY is_holdout;

Feed the two (rate, users) pairs into the formula above and you get the same lift, CI, and p-value as the dashboard and CLI.

Worked examples

These are the canonical golden values Levered's own tests assert against (z = 1.96, minN = 100). Use them to check your implementation:

p_model	n_model	p_hold	n_hold	Lift	95% CI	p-value
0.7239	50,689	0.7178	20,270	+0.85%	[−0.18%, +1.88%]	0.104
0.7265	33,672	0.7153	3,832	+1.57%	[−0.57%, +3.70%]	0.151
0.5000	1,000	0.5000	1,000	0.00%	[−8.77%, +8.77%]	1.000

(The first two rows are the Taxfix sign-up reward over the full range and over the fixed-holdout window — the same split you reach by filtering to one holdout percentage.)

Measured Lift

The estimator

Counts: the canonical definition

Reproduce it from your warehouse

Worked examples

On this page