Segment Lift

The Segment Lift section on the Results page answers a question the overall lift number cannot: for whom did the variation help, and for whom did it hurt? It breaks the primary metric down across every available segment, reports the lift inside each segment value, and applies multiple-testing correction so you can trust the significance flags. It also surfaces a top-movers card that names the segments where the treatment helped most and hurt most.

What is Segment Lift?

A new checkout flow might lift overall conversion by +3% — but inside that average, mobile users could be lifting +15% while desktop users are dropping by −6%. The aggregate hides both stories. The product decision is no longer "ship or kill" — it is "ship to mobile, keep the control on desktop, and investigate why."

Statistically this is called heterogeneous treatment effects (HTE) — the recognition that a single average lift usually averages over real subgroup differences. Segment Lift makes those differences visible without forcing you to filter through every segment by hand. Every available segment is computed in one pass.

How A vs B computes per-segment lift

For each available segment on the experiment — built-in segments like device, country, browser, plus any custom segments you have sent — A vs B runs a single ClickHouse query that groups exposures and conversions by segment value and variation. Each (segment value × variation) cell is then handed to the same statistics engine that powers your primary metric card. The numbers in the Segment Lift table are computed exactly the same way as the headline lift, just scoped to a subset of visitors.

Per-segment ClickHouse query — one query per segment key, run in parallel.
Per-cell engine call — your selected engine (Bayesian, Frequentist, or Sequential) computes lift, CI, and significance for every cell.
Interaction test — a chi-square (or Welch's t for continuous metrics) checks whether the differences between segment values are real.
BH correction — all p-values are corrected together using the Benjamini-Hochberg false-discovery-rate method.

Why multiple-testing correction matters

If you run 20 segment-by-segment tests at α = 0.05, you expect one false positiveby chance alone — even when no segment truly differs. This is the segment-shopping problem: filter through enough segments and you will always find a "significant" one. Without correction, the segment dropdown is a tool for fooling yourself.

A vs B applies the Benjamini-Hochberg (BH) procedure across every cell in the panel. BH controls the false-discovery rate: of the cells flagged significant, the expected fraction that are false positives is held at α. The Segment Lift table shows the BH-adjusted p-value (p_adj) — that is the number to read. A cell that looks significant pre-correction will often disappear after BH, and that is the correct behaviour, not a bug.

Reading the interaction p-value

Each segment block carries an interaction badge — "Strong interaction", "Possible interaction", or "No interaction" — with its own p-value. This is a different question from per-cell significance:

Per-cell significanceasks: "Did variation B beat control inside this segment value?"
Interaction significanceasks: "Did the treatment effect differ acrosssegment values? Did mobile really respond differently from desktop, or could the apparent gap be noise?"

Interaction is the headline number for "should I personalize on this segment". A "strong interaction" badge means the differences between segment values are unlikely to be chance; a "no interaction" badge means the segment slices look broadly similar even if one cell happened to clear the significance bar.

Sample-size policy

Tiny cells produce wild lift estimates. A vs B applies two thresholds to keep the panel honest:

Below 30 visitors per arm — the cell is dropped from analysis. Per-cell lift is reported as —.
Between 30 and 1,000 visitors per arm — the cell renders with a Low N badge. The estimate is shown, but you should treat it as directional, not conclusive.

Engine-specific behaviour

The Segment Lift table follows whichever engine you have selected for the main results page.

Frequentist and Sequential — the significance column reports BH-adjusted p-values (p_adj). A cell is flagged significant when p_adj < α.
Bayesian — the significance column reports the probability to beat control. A cell is flagged significant when that probability exceeds 1 − α. Bayesian inference does not formally need BH correction; the adjusted p-value column is still populated so the panel reads the same regardless of engine.

What this is — and isn't — for

Segment Lift is a discovery signal, not a deployment decision. A flagged cell is a hypothesis worth retesting — not a green light to ship a per-segment rollout from the panel alone. The right follow-up to an interesting Segment Lift finding is usually a new experiment targeted at that segment, or a follow-on rollout decided outside the panel with a human in the loop.

The existing segment filter is the right surface for drilling into a single slice once the panel has flagged it. Use Segment Lift to surface candidates, then filter to that slice to see the full breakdown — charts, secondary metrics, time series.

The 100-cell cap

Per-segment fan-out can grow large fast — a project with hundreds of country values could otherwise dominate the panel. To keep the panel readable and the query cost bounded, A vs B caps each experiment at 100 (segment value × variation) cells. When the cap kicks in, the panel footer says so, and segments are kept in order of total exposure count — the most-trafficked segment values survive.