Winsorization

Winsorization is an outlier-handling toggle that caps the most extreme values at a chosen percentile before any statistic is computed. It is the industry-standard treatment for long-tailed metrics like revenue, time-on-site, and session duration — where one enterprise deal or one unusually engaged visitor can swing the test's outcome out of proportion to their actual frequency. Winsorization keeps every visitor in the analysis (it doesn't throw rows away) but clamps the extremes so the mean and variance reflect the typical user instead of the headline outlier.

What is it

Winsorization clamps values above a chosen upper percentile (default p99) and optionally below a lower percentile (off by default; used for symmetric capping when a metric has a negative tail). Once enabled on a metric, every value above the cap is replaced with the cap value; every value below the lower cap is replaced with the lower cap. The visitor stays in the variant — only the value is changed.

Available on every measure that operates on a continuous or numeric value: Total value per visitor, Total value, Rate, Percentile, and Composite (weighted). (Conversion-rate measures — Unique conversions per visitor, Total events, and Unique visitors who fired — are binary or count-based so winsorization doesn't apply.) For the Composite (weighted) measure, winsorization is applied per component binding before the weighted sum.

Winsorization is configured on the binding's per-measure config slide in the experiment builder — not at metric creation time. The same project metric can be analysed winsorized on one experiment and unwinsorized on another.

When to use

Use winsorization whenever a metric has a long right tail that you suspect is dragging your variance higher than the typical user warrants:

Revenue per visitor (B2C) — most purchases are $20–$100, but the occasional $5,000 purchase happens. Without winsorization, the variance is dominated by those high-value events and confidence intervals are wider than they need to be.
Revenue per visitor (B2B) — even more extreme: most contracts are $10k–$50k, but a single $1M deal can move the mean by an order of magnitude. Winsorization at p95 or p99 keeps those outliers from dominating the test.
Session duration — visitors who leave a tab open overnight push the right tail into the hundreds of minutes. Winsorizing at p99 typically caps these at ~30 minutes, which is a much fairer representation of engagement.
Items in cart — most users add 1–3 items; a few add 47 (often automation or restock). Winsorization keeps those carts in the test but stops them from defining the average.

If your metric is binary (click rate, conversion rate, signup rate), winsorization doesn't apply — there's nothing extreme to cap. If your metric's extreme values are exactly what you care about (e.g., revenue from whale customers in a high-end product test), winsorizing them away defeats the purpose. The rule of thumb: winsorize when you suspect outliers are noise; don't winsorize when they're signal.

Example

You're running a checkout experiment on a B2C store. Revenue per visitor in the last 30 days has this shape:

1Mean:    $42
2Median:  $28
3p95:     $180
4p99:     $410
5Max:     $5,820  (one purchase from a corporate restock)

Without winsorization, that single $5,820 purchase pulls the mean up and inflates the variance. If you enable winsorization at upper p99:

1Mean:    $39  (reduced by $3 / 7%)
2p99:     $410 (the cap)
3Max:     $410 (was $5,820)
4Variance: reduced ~22%

When you toggle winsorization on in the metric editor, A vs B shows a live preview against the last 30 days of data so you can see exactly how many visitors would be capped, how the mean shifts, and how much variance is reduced — before you commit the change.

How A vs B computes it

Winsorization runs as a two-CTE pattern inside the same SQL that computes the experiment statistics:

The first CTE computes the cap from the in-experiment data: quantile(upperPercentile)(value)per arm. Computing the cap from in-experiment data (not historical data) is the standard practice — it avoids leaking information across the experiment's start / stop boundary.
The main aggregation applies least(value, cap) before summing or averaging. For a lower cap, greatest(value, lowerCap) runs alongside.

For metric types with continuous values (ratio numerators, composite components), the cap is computed and applied per component before any CUPED variance reduction. Order of operations:

1raw values  →  winsorize  →  CUPED  →  variance / lift / CI

This ordering is industry-standard and matches how Eppo and Statsig sequence the same transforms.

Live preview against past data

When you toggle winsorization on in the metric editor, A vs B fetches a 30-day summary from the metric's historical event stream and shows:

The cap value that would have applied at the chosen percentile
The number of visitors whose values would have been capped
The reduction in mean and variance the cap would have produced

This makes the impact concrete before the metric ships. If the preview shows that p99 only caps 4 visitors and reduces variance by 0.2%, winsorization probably isn't worth enabling. If it shows that p95 caps 80 visitors and reduces variance by 18%, you're likely getting real precision out of it.

If the metric has fewer than 100 historical values, the preview panel warns that the cap may be unstable and recommends widening the training window or waiting for more data. The save isn't blocked, but you should expect the in-experiment cap to be noisier than the preview suggests.

Sealed-plan amendments

Toggling winsorization on or off, or changing the cap percentile, on a metric referenced by a sealed analysis plan is recognised as an amendment-worthy change. The amendment flow blocks the save until a rationale is written; the amendment is logged; results recompute using the new definition from the amendment date forward, with a visible boundary in the timeseries chart.

FAQ

What percentile should I use?

p99 is the most common default — it caps the worst 1% of values. p95 is more aggressive and is sometimes the right choice for very long-tailed metrics. Below p95, you're changing too much of the distribution and the resulting mean stops being a faithful summary of the data. The live preview is the best way to choose — pick the percentile that materially reduces variance without changing the mean by more than a few percent.

Does winsorization interact with CUPED?

Yes. Winsorization runs first, then CUPED applies to the winsorized values. This means CUPED's variance-reduction works on the cleaned distribution rather than the raw one, which is more effective in practice and is the industry-standard ordering.

Can I change winsorization on a running experiment?

Yes, but if the metric is in a sealed plan it triggers the amendment flow with a written rationale. The change applies prospectively — results recompute from the amendment date forward, with a visible boundary on the timeseries chart.

When should I use the lower cap?

Rarely. The lower cap is for metrics with a negative tail — e.g., a net-revenue metric where refunds push some visitor values below zero. Most experimentation metrics are non-negative and only need the upper cap.

Ratio metricsTrack per-visitor ratios with delta-method variance.

Quantile metricsMeasure percentiles of continuous values.

Composite metricsCombine multiple metrics into a single decision signal.