Winsorization
Winsorization is an outlier-handling toggle that caps the most extreme values at a chosen percentile before any statistic is computed. It is the industry-standard treatment for long-tailed metrics like revenue, time-on-site, and session duration — where one enterprise deal or one unusually engaged visitor can swing the test's outcome out of proportion to their actual frequency. Winsorization keeps every visitor in the analysis (it doesn't throw rows away) but clamps the extremes so the mean and variance reflect the typical user instead of the headline outlier.
What is it
Winsorization clamps values above a chosen upper percentile (default p99) and optionally below a lower percentile (off by default; used for symmetric capping when a metric has a negative tail). Once enabled on a metric, every value above the cap is replaced with the cap value; every value below the lower cap is replaced with the lower cap. The visitor stays in the variant — only the value is changed.
Available on every measure that operates on a continuous or numeric value: Total value per visitor, Total value, Rate, Percentile, and Composite (weighted). (Conversion-rate measures — Unique conversions per visitor, Total events, and Unique visitors who fired — are binary or count-based so winsorization doesn't apply.) For the Composite (weighted) measure, winsorization is applied per component binding before the weighted sum.
Winsorization is configured on the binding's per-measure config slide in the experiment builder — not at metric creation time. The same project metric can be analysed winsorized on one experiment and unwinsorized on another.
When to use
Use winsorization whenever a metric has a long right tail that you suspect is dragging your variance higher than the typical user warrants:
- Revenue per visitor (B2C) — most purchases are $20–$100, but the occasional $5,000 purchase happens. Without winsorization, the variance is dominated by those high-value events and confidence intervals are wider than they need to be.
- Revenue per visitor (B2B) — even more extreme: most contracts are $10k–$50k, but a single $1M deal can move the mean by an order of magnitude. Winsorization at p95 or p99 keeps those outliers from dominating the test.
- Session duration — visitors who leave a tab open overnight push the right tail into the hundreds of minutes. Winsorizing at p99 typically caps these at ~30 minutes, which is a much fairer representation of engagement.
- Items in cart — most users add 1–3 items; a few add 47 (often automation or restock). Winsorization keeps those carts in the test but stops them from defining the average.
Example
You're running a checkout experiment on a B2C store. Revenue per visitor in the last 30 days has this shape:
1Mean: $422Median: $283p95: $1804p99: $4105Max: $5,820 (one purchase from a corporate restock)Without winsorization, that single $5,820 purchase pulls the mean up and inflates the variance. If you enable winsorization at upper p99:
1Mean: $39 (reduced by $3 / 7%)2p99: $410 (the cap)3Max: $410 (was $5,820)4Variance: reduced ~22%When you toggle winsorization on in the metric editor, A vs B shows a live preview against the last 30 days of data so you can see exactly how many visitors would be capped, how the mean shifts, and how much variance is reduced — before you commit the change.
How A vs B computes it
Winsorization runs as a two-CTE pattern inside the same SQL that computes the experiment statistics:
- The first CTE computes the cap from the in-experiment data:
quantile(upperPercentile)(value)per arm. Computing the cap from in-experiment data (not historical data) is the standard practice — it avoids leaking information across the experiment's start / stop boundary. - The main aggregation applies
least(value, cap)before summing or averaging. For a lower cap,greatest(value, lowerCap)runs alongside.
For metric types with continuous values (ratio numerators, composite components), the cap is computed and applied per component before any CUPED variance reduction. Order of operations:
1raw values → winsorize → CUPED → variance / lift / CIThis ordering is industry-standard and matches how Eppo and Statsig sequence the same transforms.
Live preview against past data
When you toggle winsorization on in the metric editor, A vs B fetches a 30-day summary from the metric's historical event stream and shows:
- The cap value that would have applied at the chosen percentile
- The number of visitors whose values would have been capped
- The reduction in mean and variance the cap would have produced
This makes the impact concrete before the metric ships. If the preview shows that p99 only caps 4 visitors and reduces variance by 0.2%, winsorization probably isn't worth enabling. If it shows that p95 caps 80 visitors and reduces variance by 18%, you're likely getting real precision out of it.
Sealed-plan amendments
Toggling winsorization on or off, or changing the cap percentile, on a metric referenced by a sealed analysis plan is recognised as an amendment-worthy change. The amendment flow blocks the save until a rationale is written; the amendment is logged; results recompute using the new definition from the amendment date forward, with a visible boundary in the timeseries chart.
FAQ
What percentile should I use?
p99 is the most common default — it caps the worst 1% of values. p95 is more aggressive and is sometimes the right choice for very long-tailed metrics. Below p95, you're changing too much of the distribution and the resulting mean stops being a faithful summary of the data. The live preview is the best way to choose — pick the percentile that materially reduces variance without changing the mean by more than a few percent.
Does winsorization interact with CUPED?
Yes. Winsorization runs first, then CUPED applies to the winsorized values. This means CUPED's variance-reduction works on the cleaned distribution rather than the raw one, which is more effective in practice and is the industry-standard ordering.
Can I change winsorization on a running experiment?
Yes, but if the metric is in a sealed plan it triggers the amendment flow with a written rationale. The change applies prospectively — results recompute from the amendment date forward, with a visible boundary on the timeseries chart.
When should I use the lower cap?
Rarely. The lower cap is for metrics with a negative tail — e.g., a net-revenue metric where refunds push some visitor values below zero. Most experimentation metrics are non-negative and only need the upper cap.