/Docs

Frequentist Engine

The Frequentist engine is the classical approach to A/B test statistics: a p-value, a 95% confidence interval, and a plain yes-or-no answer on statistical significance. It is the format most stakeholders and compliance reviewers expect to see.

What Frequentist analysis actually does

Frequentist analysis asks a single question: if there were really no difference between the variations, how likely is it we'd see a result at least as extreme as this? That probability is the p-value. A small p-value says the observed difference would be surprising under the null hypothesis of no effect, so we reject the null and call the result significant.

In A vs B, the Frequentist engine uses:

  • A two-proportion z-test for binary metrics (conversion rates). Pooled standard error against Control under the null, unpooled Wald intervals for reporting.
  • A Welch two-sample z-test for continuous metrics (revenue, pageviews, time on page). Unpooled variance, normal-approximation p-values.

Reading a p-value

By default A vs B calls a result significant when p < 0.05. That is the significance level, also written as α = 0.05. You can change this default per project, and override it per experiment, in the Analysis section of Project Settings and the experiment builder (both shipped in the previous releases — see Analysis Defaults).

A few common values in plain English:

  • p = 0.03 → a 3% chance we'd see this much difference by pure luck. Significant at α = 0.05.
  • p = 0.12 → a 12% chance under the null. Not significant at α = 0.05.
  • p < 0.001 → extremely unlikely under the null. Very strong evidence against it.

Reading the 95% confidence interval

Alongside every p-value, A vs B reports a 95% confidence intervalfor the variation's observed rate (or mean). A 95% CI is the range of values the true underlying rate is likely to fall in, interpreted in the frequentist sense: if we repeated the experiment many times, roughly 95% of the intervals would contain the true value.

On the Results page, the Confidence Interval column shows the interval as a horizontal bar. The wider the bar, the less certain the estimate. As more visitors are exposed, the interval narrows.

Multiple variations and corrections

When an experiment has more than two variations — say, Control plus Variant A and Variant B — we run more than one test. Each additional comparison raises the chance of a false positive. A vs B corrects for this automatically using the multiple-comparison correction (MCC) method set on the experiment or project:

  • Bonferroni (default) — simple, conservative. Multiplies each p-value by the number of tests.
  • Holm-Bonferroni — slightly less conservative than Bonferroni.
  • Benjamini-Hochberg — controls the false-discovery rate; best pick when you have many metrics or variations.
  • None — reports raw p-values without correction. Use only if you know what you're doing.

The p-values shown in the Results page are the corrected values; the significance badge reflects the correction.

Why peeking matters

A Frequentist p-value is only valid once, at the pre-declared sample size. If you check the numbers every day and decide to stop when the result looks good, the real false-positive rate goes up — often from 5% to 15-20%. Picking the day that happens to cross the threshold is a form of p-hacking, even if it's unintentional.

Stick to your pre-declared sample size

Use the sample-size calculatorto decide how many visitors each variation needs before you launch, and don't call the experiment early just because the p-value dipped below α. A vs B has built-in peek-protection that warns you when you try to conclude early — see Early stopping & peek protection.

If you need to peek safely, pick the Sequential engine instead — it is designed for exactly this.

Peek protection

While a Frequentist experiment is still accumulating sample, A vs B shows a persistent banner at the top of the results page: “Day 5 of 14 · 3,400 of 10,000 visitors · not yet valid for stopping decisions.” The banner disappears the moment you reach your target sample size or your scheduled end date.

If you try to pause, stop, or declare a winner on a Frequentist experiment before it has reached its target, a modal interrupts the action and offers three choices:

  • Let it run — dismiss the modal and keep collecting visitors.
  • Stop anyway and log — proceed with the action and stamp the experiment record with an audit-visible note: “Early-stopped under Frequentist · validity reduced.”The override is also written into the audit log and into any CSV export of the experiment.
  • Switch future experiments to Sequential — flips your project default. The current experiment continues unchanged; new experiments inherit the Sequential engine. (This option is disabled until Sequential ships.)

See Early stopping & peek protection for the full reasoning and what the audit stamp looks like in practice.

When to pick Frequentist over Bayesian

  • A stakeholder wants p-values. Many product, marketing, and research teams speak in p-values. If your org already reports results this way, Frequentist is the simplest fit.
  • Regulatory or compliance reporting. Pharma, finance, and healthcare workflows often require a fixed-α, pre-declared-sample-size trial design. Frequentist maps directly onto that.
  • You can commit to a sample size up front.If your traffic is predictable and you're willing to plan the experiment length before launch, Frequentist is efficient and interpretable.

Where to configure

Set the engine in one of three places:

  • Project level. Project Settings → Analysis sets the default for every new experiment and feature-flag A/B test rule in the project.
  • Per experiment.In the experiment builder's Analysis Overrides card, before launch. Once the experiment is running, the engine is locked — mixing engines mid-flight would invalidate the result.
  • Per flag rule.In the feature-flag A/B Test rule editor's Analysis section, before the rule is enabled. Same lock-on-launch behaviour as experiments.