Results Look Wrong

A vs B uses Bayesian statistics to analyse your experiment data. If you are not familiar with Bayesian methods, some of the numbers on the results page can look surprising or counterintuitive. This page explains the most common concerns people have about their results.

"The conversion rate seems too low"

Conversion rates in A vs B are calculated per unique visitor, not per pageview. The formula is:

Conversion rate = Unique visitors who converted ÷ Total unique visitors in the variation

A visitor who lands on the page three times only counts once in the denominator. A visitor who converts on their second visit is still only counted once as converted. This means conversion rates are often lower than what you might expect from pageview-based analytics.

This is the statistically correct way to measure conversion rates for A/B testing — it ensures that frequent visitors do not have outsized influence on the results. If your site has visitors who return many times, your per-visitor conversion rate will be lower than your per-session or per-pageview rate.

"The results keep changing"

If you check the results every few hours and see the winning probability shifting back and forth, this is completely normal and expected. A vs B uses a Bayesian model that updates continuously as new data arrives. When you have a small number of visitors, a few conversions in one direction can swing the probability significantly. As more data accumulates, the estimate stabilises.

Think of it like an election: early returns from a small number of precincts can point strongly to one candidate, but as more votes are counted, the picture becomes clearer and more stable. The same applies to your experiment results.

The right approach is to set a target number of visitors before you run the experiment, run it until you hit that target, and then read the results — rather than checking every few hours and stopping when the results happen to look good.

Looking at results and stopping an experiment early when you see a positive result is called "peeking" and is one of the most common sources of false positives in A/B testing. At any given moment, a result might look good by chance. Only stop an experiment when you planned to stop it, or when you have clearly sufficient data.

"Control is winning — is that bad?"

No. A control winning is a completely valid and valuable result. It means your variant did not improve the metric compared to the original — and now you know that, rather than guessing. This is one of the most important outcomes in A/B testing: learning what does not work.

A control winning should lead you to ask:

Was the hypothesis wrong? (The change I thought would help actually does not.)
Was the implementation right? (Did the variant look and work as intended?)
Was I measuring the right thing? (Is this metric actually linked to the business goal?)

Answer these questions, form a new hypothesis, and design the next experiment. A control winning is learning — not failure.

"Confidence is stuck at 50%"

A probability of around 50% means A vs B has no evidence that either variation is better than the other — both are equally likely to be the winner given the data seen so far. This is the Bayesian starting point before any data is collected, and it persists when there is not enough data to distinguish between the two.

This is not a bug. It means you need more data. The probability will move away from 50% as more visitors are counted and the two conversion rates diverge (if they ever do). If after thousands of visitors the probability stays near 50%, it suggests the true effect size is very small — the two variations may be genuinely equivalent.

"The revenue impact shows a huge number"

The Revenue Impact figure is an annualised estimate extrapolated from current data. It multiplies the observed difference in revenue per visitor between your variations by the number of visitors per year at your current traffic rate.

Early in an experiment, this figure can look enormous because:

You have very few data points, so the observed difference per visitor is volatile — a couple of high-value conversions can dramatically skew the average.
Extrapolating a noisy early signal out to a full year amplifies the noise.

Take Revenue Impact directionally, not literally — especially early in an experiment. It becomes more meaningful once you have enough data for the conversion rate to stabilise. A large positive Revenue Impact alongside a high winning probability and sufficient visitor count is a meaningful signal.

Revenue Impact assumes your current traffic rate and average order value will continue indefinitely. Seasonal variations, product launches, pricing changes, and other factors will all affect the real annualised value. Use Revenue Impact to understand the magnitude of the opportunity, not as an accurate financial forecast.

"Results differ across segments"

Seeing different results for different segments is extremely common and often the most valuable insight from an experiment. Mobile and desktop users almost always behave differently. New visitors and returning visitors respond to changes differently. Premium users and free users have different motivations.

If your overall result is inconclusive but one segment shows a strong positive effect, consider running a follow-up experiment targeted specifically at that segment. Conversely, if one segment shows a negative effect while another shows positive, a one-size-fits-all variant may not be the right approach.

Segment analysis is exploratory — use it to generate hypotheses, not to declare a winner within a subgroup. The experiment was not powered to detect effects within individual segments, so the confidence levels for segment-level results will be lower than for the overall result.

"How do I know when I have enough data?"

A rule of thumb for Bayesian A/B testing: you generally need at least 100 conversions per variation before the results become stable enough to act on. For low-conversion-rate experiments (below 1%), you may need thousands of visitors per variation.

A vs B shows a minimum detectable effectcalculation in the results panel to help you understand how many visitors are needed to reliably detect the effect size you are looking for. If you are well past this threshold and the result is still inconclusive, the true effect may be smaller than your minimum detectable effect — meaning the experiment ran long enough and the answer is genuinely "no meaningful difference."

Decide your stopping criterion before the experiment starts — either a target number of visitors, a target number of conversions, or a calendar date. Then stop on that criterion, not because the result happens to look good or bad on any given day.