Stats Engines release — April 2026

A vs B's biggest statistical update yet. Three engines, full per-experiment configurability, retrospective and side-by-side comparison views, pre-registered analysis plans, layered peek protection, and a handful of polish features (ROPE, A/A validation, engine-aware health score). This page summarises what shipped and links into the deeper reference docs for each capability.

Anyone running A/B tests in A vs B — whether through the Web Experimentation builder or feature-flag A/B Test rules. The new defaults preserve previous behaviour (Bayesian engine, Auto variance reduction); none of this requires action to adopt.

Three engines, fully configurable

A vs B used to ship a single Bayesian engine. The release adds two more, all selectable per project, per experiment, and per feature-flag rule:

Bayesian(default, unchanged) — reports probability-to-beat-control and credible intervals. Forgiving of mid-experiment peeks.
Frequentist— classical p-values and confidence intervals. Declares significance at the alpha level you set, after the pre-declared sample size is reached.
Sequential— always-valid inference (asymptotic confidence sequences, the AsympCS method Netflix runs in production). Peek as often as you like, stop early when the evidence is in.

Per-experiment statistical configurability

Every analysis option that previously lived only at the project level can now be overridden per experiment (and per flag rule) before launch:

Custom alpha / confidence / Bayesian threshold— one numeric field that reads as 95% confidence, α = 0.05, or 95% chance-to-beat depending on the engine you picked.
Multiple-comparison correction (MCC) for Frequentist experiments with more than two variations: Bonferroni (default), Holm-Bonferroni, Benjamini-Hochberg (FDR), Tiered Bonferroni (no correction on the primary metric, Bonferroni on secondaries), or None.
ROPE (Region of Practical Equivalence)for Bayesian experiments — declare what counts as a meaningful difference, not just any difference. Optional per-experiment field.
Variance reduction (CUPED)— unchanged, but now overridable per experiment alongside the engine.

Sample-size calculator — three modes

The sample-size calculator now ships in three engine-aware modes:

Fixed-horizon— baseline, MDE, alpha, power, daily traffic → required sample size and estimated duration.
Power calculator— sample size, MDE → achieved power. Useful for sanity-checking how confident you can be in a null result.
Duration estimator— daily traffic, MDE → estimated calendar duration to hit the target.

Explore-under and Compare-engines views

The Results page has two new affordances for re-rendering the same experiment under a different engine:

Explore under— a dropdown that re-renders the analysis under Bayesian, Frequentist, or Sequential. Clearly labelled as exploratory; the official engine (locked at launch) is unchanged.
Compare engines— a side-by-side view of all three outputs in one pane. Makes disagreement between methods visible.

Both work for any past or present experiment, because A vs B already stores raw per-visitor exposures and events. Read more: Comparing engines.

Pre-registered analysis plans

Before launching an experiment, the creator can declare the primary metric, guardrail metrics, secondary metrics, engine, alpha, MCC method, target sample size, and target duration. Once the experiment launches, the plan is sealed— any mid-flight change creates an amendment record visible in the audit trail. The Results page shows which metrics were pre-registered vs added after launch.

Pre-registration is optional per project: turn on the Require pre-registrationtoggle in Project Settings to make every new experiment in the project go through the plan before launch. Read more: Analysis Plans.

Peek protection (five layers)

Frequentist p-values only hold their false-positive guarantee if you look once, at the pre-declared sample size. Peeking and stopping early can inflate the real false-positive rate from 5% to 15–20%. The release adds a layered system that intervenes at the decision point, not the viewing point:

Status banneron Frequentist results pages while accumulating — "numbers are still accumulating and not yet valid for stopping decisions."
Blocking modalon ship/conclude actions before the target sample size is reached — shows the real false-positive rate at the current sample size, with three options: let it run, stop anyway and log it, or switch future experiments to Sequential.
Override audit stamp— proceeding stamps the experiment record with "early-stopped under Frequentist, statistical validity reduced," visible in the experiment history and exports.
Sequential nudge banner— a dismissable banner on Frequentist results pages: "Want to peek safely? Sequential analysis is designed for it."
Honest-peek overlay— opt-in toggle that overlays the Sequential always-valid interval alongside the Frequentist CI on the primary metric, so the cost of peeking is visible without blocking.

Bayesian and Sequential experiments bypass this system — Bayesian is forgiving of peeks by design, and Sequential's whole point is that peeking is safe. Read more: Early Stopping & Peek Protection.

A/A validation mode

A one-click toggle in the experiment builder splits traffic between two identical empty variations. Expected result: a null lift. The Results page shows an A/A diagnostic banner that surfaces anomalies (large implicit lift, sample-ratio mismatch, an engine signal crossing significance) so you can catch bucketing or attribution bugs before they pollute a real experiment.

Engine-aware health score

The health score's significance signal now matches the engine the experiment is using: probability thresholds for Bayesian, p-value thresholds for Frequentist, always-valid crossings for Sequential. The data-quality and sample-ratio guardrails remain engine-agnostic (chi-squared, neutral across engines).

What's next (deliberately separate)

Several capabilities sit one logical step beyond the engine axis and ship as their own plans rather than getting bundled here:

Ratio metrics (e.g. clicks per session) — needs delta-method variance handling.
Additional variance-reduction methods beyond CUPED — stratification, ANCOVA, CUPED++.
Heterogeneous treatment effects (HTE) / uplift modelling — who benefits most.
Winsorization / outlier capping — per-metric configuration.

Each is its own plan. Watch this Changelog section for future entries.

Released April 2026.