Statistical Methodology

A vs B uses a Bayesian statistical approach to analyze experiment results. You will never see a p-value or a significance threshold to wrestle with. Instead, you get a single, intuitive number: the probability that one variation is genuinely better than another. Here is what that means and why it matters.

The core idea

Traditional A/B testing uses something called frequentist statistics, which produces a p-value. A p-value answers the question: "If there were no real difference between the variations, how likely would it be to see results this extreme just by chance?" A p-value below 0.05 (5%) is typically declared "significant," but this framing is counterintuitive and frequently misunderstood.

Bayesian statistics answers a simpler, more useful question: "Given the data I have collected, how confident am I that Variation B is better than the Control?" That answer is a probability — for example, 92% — which is much easier to act on.

An analogy: flipping a coin

Imagine you suspect a coin is biased toward heads. You flip it 20 times and get 13 heads. Is the coin biased?

A frequentist would say: "There is a 13% chance of seeing 13 or more heads if the coin were fair, so we cannot reject the null hypothesis at p=0.05."

A Bayesian would say: "Based on your 20 flips, there is about a 73% probability the coin favors heads."

The Bayesian answer is more useful because it directly answers the question you actually care about: is the coin biased? The A vs B Results page gives you the Bayesian answer.

Winning probability

The probability to beat control shown for each variation is the Bayesian posterior probability that this variation has a higher true conversion rate than the control. A vs B computes this using a Beta-Binomial model, which is a standard Bayesian approach for conversion rate experiments.

As more data arrives, this probability updates continuously. With very little data, it will hover near 50% (no evidence either way). As the experiment runs and one variation consistently converts more, the probability will move higher — toward 90%, 95%, 99%.

Credible intervals

A vs B also shows a credible intervalfor each variation's conversion rate and improvement. A credible interval is the Bayesian equivalent of a confidence interval, but with a more intuitive interpretation.

A 95% credible interval of [+5%, +15%] means: "There is a 95% probability the true improvement is somewhere between 5% and 15%." You can read it literally. A frequentist confidence interval cannot be read this way (it is a statement about the procedure, not the result), but a Bayesian credible interval can.

Narrower credible intervals mean more certainty. If the interval is wide — say, [‑5%, +25%] — you do not have enough data to know the true effect size.

The peeking problem — solved

Traditional frequentist A/B tests have a "peeking problem." If you check your p-value every day and stop the moment it drops below 0.05, you will declare false winners far more often than 5% of the time. Frequentist tests are only valid if you commit to a sample size up front and check once at the end.

Bayesian methods do not have this problem in the same way. The posterior probability is updated with each new data point, and you can look at it any time without invalidating the statistical model. This is one of the main practical advantages of Bayesian A/B testing — you get live results that you can act on, rather than a locked hypothesis you cannot evaluate until a fixed end date.

While Bayesian testing does not suffer from the strict peeking problem, acting on very early results with small sample sizes is still risky. With only 50 visitors, a 90% winning probability can easily flip as more data arrives. A vs B recommends waiting until you have at least a few hundred visitors per variation before making a final decision.

What about the prior?

Bayesian statistics requires a starting assumption called a prior— an initial belief about the conversion rate before you collect any data. A vs B uses a non-informative prior (a flat Beta distribution), which means the starting assumption is "we know nothing." The results are driven entirely by the data you collect, with no thumb on the scale.

This makes the approach behave very similarly to frequentist testing in terms of where results end up, but with the more intuitive probability interpretation and without the peeking restriction.

The delta method (ratio metrics)

Ratio metrics — revenue per visitor, items per order, refund rate — divide one random variable by another, and the standard t-test produces wrong confidence intervals on that shape. A vs B uses the delta method, a first-order Taylor expansion that approximates the variance of X/Y from the variances and covariance of X and Y:

Var(X/Y) ≈ (μ_X / μ_Y)² · [ Var(X)/μ_X² − 2·Cov(X,Y)/(μ_X·μ_Y) + Var(Y)/μ_Y² ]

This is the standard variance estimator across the industry (Eppo, Statsig, GrowthBook, LaunchDarkly). For the frequentist engine the delta-method variance feeds a t-test on the ratio; for the Bayesian engine it feeds a normal-approximation posterior; for the sequential engine it feeds an always-valid bound that is wider than the fixed-horizon CI by design. See Ratio metrics for the user-facing explanation.

Bootstrap CIs (quantile metrics)

Quantile metrics — p50 / p90 / p95 / p99 of a continuous value — do not have a closed-form sampling distribution. A vs B uses a bias-corrected percentile bootstrap: it resamples the visitor values with replacement (1000 resamples by default), recomputes the quantile on each resample, and reports the 2.5th and 97.5th percentiles of the resampled quantile distribution as the 95% CI. The bias-correction step adjusts for skew in the bootstrap distribution, which matters at extreme percentiles and small samples.

For very large experiments (N > 100k per variant), the bootstrap falls back to a t-digest-sampled subset returned by ClickHouse rather than the raw rows; the results row notes approximate (sampled bootstrap). A high-precision recompute (5000 resamples) is available on demand from the results page. See Quantile metrics.

Weighted-sum statistics (composite metrics)

Composite metrics combine multiple component metrics into a single weighted decision signal. The point estimate is the weighted sum of the component point estimates; the variance combines the per-component variances with pairwise covariance so correlated components don't double-count their contribution:

Var(composite) = Σᵢ wᵢ² · Var(Xᵢ) + 2 · ΣᵢΣⱼ>ᵢ wᵢ · wⱼ · Cov(Xᵢ, Xⱼ)

Per-component covariance is computed once per result run from co-observed visitor records and reused across the three engines: weighted t-test for frequentist, weighted Normal posterior for Bayesian, and weighted always-valid bound for sequential. When the co-observation count for a component pair falls below a threshold (default 30), the engine falls back to an upper-bound variance estimate and the row surfaces "conservative CI" — better to widen the interval than to claim precision the data doesn't support. See Composite metrics.