Bayesian vs Frequentist A/B Testing: A Practical Guide for Developers

Both camps claim the other is wrong. The truth is that they answer different questions. Once you understand that, you can use both correctly — and stop misinterpreting p-values.

Every A/B testing platform eventually forces you into a philosophical argument: Bayesian vs frequentist. Product teams debate it. Data scientists go to war over it. And most developers in the middle just want a number they can act on.

Here is the honest take: both approaches are valid, they answer different questions, and conflating them is where most errors in A/B testing practice come from.

The frequentist view

Frequentist statistics — the approach behind p-values and confidence intervals — asks: "If there were truly no difference between control and variant, how often would we see data this extreme or more extreme by chance?"

That probability is the p-value. If p < 0.05, we say the result is statistically significant at the 95% confidence level.

What it does NOT say: that there is a 95% probability the variant is better. This is the single most common misinterpretation, and it is deeply baked into how product teams talk about A/B tests.

A p-value is not the probability that your hypothesis is true. It is the probability of observing your data (or more extreme data) assuming the null hypothesis is true. These are very different statements.

The Bayesian view

Bayesian statistics asks a different question: given the data we have observed, what is the updated probability distribution over the true effect size?

This is more intuitive for decision-making. Instead of a binary "significant / not significant" gate, you get a posterior distribution — a full picture of how likely different effect sizes are given the evidence so far.

From that posterior you can compute:

Probability to beat baseline (PTB): the probability the variant's true conversion rate exceeds the control's
Expected loss: how much you lose, in expectation, if you ship the wrong variant
Credible intervals: a range where the true effect lies with some probability (unlike confidence intervals, these mean exactly what most people think confidence intervals mean)

When frequentist wins

Frequentist methods shine when you need formal decision gates with known error rates — particularly in regulated contexts, academic publishing, or when stakeholders need a hard "yes/no" with a quantifiable false positive rate.

They are also simpler to explain and audit. A z-test is two lines of code. The assumptions are transparent. There is no prior to argue about.

Use frequentist statistics when:

You need a clear threshold decision (ship / do not ship) with a defined alpha
Stakeholders are familiar with p-values and confidence intervals
You are running a single experiment with a pre-specified sample size (no peeking)
You want to control the false positive rate across many experiments

When Bayesian wins

Bayesian methods are superior when you are making sequential decisions under uncertainty — which is exactly what continuous optimisation systems do.

In a Bayesian Multi-Armed Bandit (MAB), traffic is allocated proportionally to each arm's probability of being the best option, updated continuously as data arrives. This means you exploit good variants earlier and explore bad ones less, reducing the opportunity cost of running an experiment.

Use Bayesian methods when:

You are running multiple variants simultaneously and want to minimise regret (lost conversions during the experiment)
You have prior knowledge about effect sizes that can inform the posterior
You want to make sequential decisions without the 'peeking problem' inflating false positives
The cost of the experiment itself (in missed conversions) is high relative to the value of certainty

How ACO uses both

ACO reports frequentist statistics on every active experiment every 15 minutes. The SRM check is a chi-squared test against the expected traffic ratio; significant losses are removed as a protective stop. Because repeatedly monitored z-tests are not an always-valid sequential method, significant wins require human promotion review.

ACO includes Bayesian Multi-Armed Bandit analysis endpoints for exploring campaign-level allocation decisions. Automatic Thompson Sampling selection is not enabled in the live generation cycle while its decision policy is being validated.

The peeking problem

One practical advantage of Bayesian approaches is resistance to peeking. With frequentist tests, looking at results before reaching your pre-specified sample size inflates the false positive rate significantly. A study by Johari et al. found that peeking at frequentist results every day and stopping when p < 0.05 produces a true false positive rate of around 26%, not 5%.

Sequential Bayesian tests and always-valid confidence sequences solve this — you can look whenever you want without inflating error rates. Most modern A/B testing platforms that advertise "stop anytime" results are using some form of this.

What to remember

p-values answer: 'how surprising is this data if there is no effect?' Not 'how likely is the variant to be better?'
Bayesian posteriors answer: 'given this data, what is my updated belief about the effect size?'
Neither is universally superior — they solve different problems
For autonomous systems making sequential decisions, Bayesian methods reduce the cost of running experiments
For one-off, high-stakes decisions requiring formal error control, frequentist methods are cleaner
The peeking problem is real — do not stop a frequentist test early because it looks significant