Choosing a Stats Engine
A vs B ships three statistical engines. They answer the same underlying question — "is this variation better than Control?" — in different ways. This page is a plain-English comparison to help you pick the right one.
The three engines at a glance
- Bayesian (default) — reports the probability each variation beats Control, plus a credible interval. Forgiving of mid-experiment peeks, easy for non-statisticians to read. Answers: "how likely is it that Variant B is better?"
- Frequentist — reports a p-value and 95% confidence interval. Declares significance only at the pre-declared sample size. Requires commitment to an experiment duration. Answers: "is the difference statistically significant at α = 0.05?"
- Sequential — always-valid inference. Peek as often as you like, stop as soon as the evidence is in, without inflating the false-positive rate. Slightly wider intervals than Frequentist in exchange for the flexibility. Answers the same question as Frequentist, but safely at any point.
When Bayesian is the right pick
- You want the result to read as "87% probability B beats Control" rather than "p = 0.018."
- You want the team to be able to peek at the results without worrying about invalidating them.
- Your stakeholders are non-technical and respond better to probability than to p-values.
- Your traffic varies, so pre-committing to a precise sample size is hard.
When Frequentist is the right pick
- A stakeholder or compliance reviewer expects p-values and confidence intervals.
- You're in a regulated industry (pharma, finance, healthcare) where a fixed-α, pre-declared-sample-size trial is required.
- You have predictable traffic and you're willing to plan the experiment up front.
- You want the strictest classical guarantees about false-positive rates.
Frequentist p-values only hold their false-positive guarantee if you look once, at the pre-declared sample size. Peeking every day and stopping when it looks good inflates the real false-positive rate from 5% to 15-20%. If you want to peek, pick Sequential.
When Sequential is the right pick
- You want to be able to check the results every day and stop as soon as the signal is there.
- Traffic is unpredictable and pre-committing to a duration feels wrong.
- You want the classical p-value interpretation but you also want to move fast.
- You're running many experiments in parallel and need to make fast ship/kill decisions.
See the Sequential Engine page for the deeper reference: how always-valid inference works, why intervals are slightly wider than Frequentist, and what the "safe to stop" decision text means in practice.
Side-by-side comparison
| Question | Bayesian | Frequentist | Sequential |
|---|---|---|---|
| Safe to peek? | Yes | No | Yes |
| Requires pre-declared sample size? | Recommended, not required | Yes | Recommended, not required |
| Reports probability-to-beat-control? | Yes | No | No |
| Reports p-value? | No | Yes | Yes (always-valid) |
| Interval style | Credible interval | Confidence interval (Wald) | Confidence sequence |
| Forgives stopping early? | Yes | No | Yes |
Calculator modes
The sample-size calculator has three modes, all engine-aware (see Analysis Defaults for the tab link):
- Fixed-horizon — inputs: baseline rate, minimum detectable effect (MDE), α, power, daily traffic. Output: required sample size per variation and estimated duration in days. Use this to plan a new experiment.
- Power calculator — inputs: sample size, MDE. Output: achieved power at the end of the experiment. Use this to sanity-check how confident you can be in a null result.
- Duration estimator — inputs: daily traffic, MDE. Output: estimated calendar duration. Use this when you care most about the timeline, not the exact sample size.
On Bayesian mode the calculator adds a prior input and reports duration to a target probability-to-beat-control. On Sequential mode it warns that the reported power is a lower bound (Sequential stopping can end the experiment earlier than the fixed-horizon calculation suggests).
At the top of the calculator you pick the measure: Conversion rate uses the classic sample-size formula; Rate uses the delta method; Percentile uses a bootstrap simulation against a historical sample; Composite uses a weighted-variance combination of components. Each measure renders the inputs it actually needs — the calculator never asks you for a baseline rate when you're sizing a percentile.
Can I change the engine mid-flight?
No. Once an experiment or flag rule is running, the engine is locked. Swapping engines during a live experiment is a classic p-hacking hazard — the engine that happens to look best becomes tempting. A vs B blocks it by design.
The Results page does include an Explore under dropdown that lets you re-render a past or present experiment under a different engine for exploratory analysis. The official result — the one in reports, exports, and audit logs — stays locked to the engine you chose at launch. See Comparing engines for how Explore-under and Compare-engines work.