Quantile metrics

A quantile metric measures a specific point on the distribution of a continuous value rather than the average. The canonical examples are p90 page-load time, p95 latency, and p99 time-to-first-purchase. Quantiles are the right tool when averages mislead — long-tailed distributions like load time and revenue often have means dominated by outliers, while medians and upper percentiles tell the story of the typical and the worst-case user. In the experiment builder this is surfaced as the Percentile measure — pick it on a metric binding to analyse an existing project metric at the percentile of your choice.

What is it

In A vs B, the Percentile measure is picked inside the experiment builder's Metrics step— not at metric creation time. On the binding's per-measure config slide you pick a value source (which event property to read — revenue, a built-in event property, or a custom-named property carried on the underlying metric) and a percentile from 0 to 100. Snap points at 50, 75, 90, 95, and 99 are provided for common cases, plus a free-form input for power users who need p99.5 or p99.9.

Each variation's quantile is computed by ClickHouse's quantileTDigest aggregate against the raw event rows. Confidence intervals come from a bias-corrected bootstrap (1000 resamples by default) computed Node-side from the values returned by ClickHouse — the same approach Eppo uses in their open-source experimentation library.

When to use

Use a quantile metric whenever the average is misleading or the tail matters more than the centre:

p95 page-load time — the 95th percentile of how long your pages take to load. The metric of record for engineering and SRE buyers. A long tail of slow loads hurts retention even if the mean looks fine.
p50 time-to-first-purchase— half of new signups convert by this time. A leading indicator for activation experiments that doesn't get dragged around by stragglers.
p90 session duration— the 90th percentile of per-session time-on-site. Captures "engaged" users without letting bot traffic skew the mean.

Means are sensitive to outliers and biased upward by long tails. If 99% of visitors load the page in 1.2s and 1% wait 30s, the mean is about 1.5s — fine. But the p99 is 30s — terrible. The mean hides the worst-case experience your users are actually having.

Example

You're testing a Cloudflare-edge-cached version of your homepage against the origin-served version. You want to know whether the cache actually makes the slow tail faster, not whether it nudges the median by a few milliseconds.

In your snippet integration, send a load-time measurement on each page load:

1avsb.track.event('pageLoad', { value: performance.now() })

In the experiment builder's Metrics step, attach a new metric to the pageLoad event and pick the Percentile measure with percentile = 95. Mark the binding as the primary metric for the experiment.

After the experiment has data:

1Control:   1820 ms  (p95 of 18,420 visitors)
2Variant:    980 ms  (p95 of 18,512 visitors)
3Lift:      −46.2%  (95% bootstrap CI: −54.1% to −38.0%)
4Badge:     Quantile (bootstrap CI)

How A vs B computes it

Quantile estimates and their confidence intervals are computed in two layers:

Point estimate — ClickHouse's quantileTDigest(p)(value)builds a t-digest sketch in the query and returns the p-th percentile. T-digest is a streaming quantile algorithm that's fast and accurate for percentiles up to about p95. For p99 and beyond, A vs B also runs quantileExact alongside and surfaces the gap between the two on the results row — if they disagree by more than a small tolerance, the row warns about t-digest approximation error.
Confidence interval— a bias-corrected percentile bootstrap with 1000 resamples by default. For very large experiments (N > 100k per variant), A vs B falls back to a t-digest-sampled subset to keep result-computation fast, and the row notes that the CI is "approximate (sampled bootstrap)". From the results page you can request a high-precision recompute (5000 resamples) that runs asynchronously and updates the row when ready.

Quantile rows include a Show both CIs toggle that displays the bootstrap CI alongside a normal-approximation CI computed from the same data. The gap between the two surfaces parametric / non-parametric agreement so you can judge whether the bootstrap is adding precision or just noise.

Sending values from the snippet

Quantile metrics on revenue need no snippet change — A vs B uses the existing revenue field. Quantile metrics on any other continuous value need the new value field on avsb.track.event:

1// Existing revenue tracking — unchanged
2avsb.track.event('purchase', { revenue: 49.99 })
3
4// New: generic continuous-value tracking for quantile metrics
5avsb.track.event('pageLoad', { value: performance.now() })
6avsb.track.event('scrollDepth', { value: scrollPercent })

Both revenue and value can coexist on the same call. Existing revenue-only integrations work byte-for-byte.

Sample experiment

A worked example: an edge-cache rollout experiment with ~18k visitors per arm. p50 load time moves from 380ms (control) to 360ms (variant) — basically flat at the median. But p95 moves from 1820ms to 980ms — a 46% improvement, statistically significant with a bootstrap CI of −54.1% to −38.0%. The mean would have shown a smaller, ambiguous change; the quantile metric makes the tail improvement obvious.

FAQ

Which percentile should I pick?

Pick the percentile that matches the experience you care about. p50 is the median (the typical user). p95 captures the worst 5% — often the right threshold for performance. p99 captures the worst 1% — useful for SLA work but noisier with smaller samples.

Why is the bootstrap CI different from the normal-approximation CI?

Bootstrap CIs make no assumption about the underlying distribution and are robust to skew and heavy tails — common in real load-time and revenue data. Normal-approximation CIs assume the sampling distribution of the quantile is roughly Gaussian, which is a fine approximation for medians but breaks down at extreme percentiles. The Show both CIs toggle lets you see the gap.

Does CUPED apply to quantile metrics?

Not in the standard form. CUPED is a variance-reduction technique for means; the analogous adjustment for quantiles is an active area of research and not yet shipped. The quantile metric's CI relies on the bootstrap alone.

Ratio metricsTrack per-visitor ratios like revenue-per-visitor.

Composite metricsCombine multiple metrics into a single weighted decision signal.

WinsorizationCap extreme values before computing the test statistic.