All articles
·10 min read

Bayesian vs Frequentist A/B Testing: A Practical Guide for Developers

statisticsa/b testingbayesianfrequentist

Both camps claim the other is wrong. The truth is that they answer different questions. Once you understand that, you can use both correctly — and stop misinterpreting p-values.

Every A/B testing platform eventually forces you into a philosophical argument: Bayesian vs frequentist. Product teams debate it. Data scientists go to war over it. And most developers in the middle just want a number they can act on.

Here is the honest take: both approaches are valid, they answer different questions, and conflating them is where most errors in A/B testing practice come from.

The frequentist view

Frequentist statistics — the approach behind p-values and confidence intervals — asks: "If there were truly no difference between control and variant, how often would we see data this extreme or more extreme by chance?"

That probability is the p-value. If p < 0.05, we say the result is statistically significant at the 95% confidence level.

What it does NOT say: that there is a 95% probability the variant is better. This is the single most common misinterpretation, and it is deeply baked into how product teams talk about A/B tests.

A p-value is not the probability that your hypothesis is true. It is the probability of observing your data (or more extreme data) assuming the null hypothesis is true. These are very different statements.

The Bayesian view

Bayesian statistics asks a different question: given the data we have observed, what is the updated probability distribution over the true effect size?

This is more intuitive for decision-making. Instead of a binary "significant / not significant" gate, you get a posterior distribution — a full picture of how likely different effect sizes are given the evidence so far.

From that posterior you can compute:

  • Probability to beat baseline (PTB): the probability the variant's true conversion rate exceeds the control's
  • Expected loss: how much you lose, in expectation, if you ship the wrong variant
  • Credible intervals: a range where the true effect lies with some probability (unlike confidence intervals, these mean exactly what most people think confidence intervals mean)

When frequentist wins

Frequentist methods shine when you need formal decision gates with known error rates — particularly in regulated contexts, academic publishing, or when stakeholders need a hard "yes/no" with a quantifiable false positive rate.

They are also simpler to explain and audit. A z-test is two lines of code. The assumptions are transparent. There is no prior to argue about.

Use frequentist statistics when:

  • You need a clear threshold decision (ship / do not ship) with a defined alpha
  • Stakeholders are familiar with p-values and confidence intervals
  • You are running a single experiment with a pre-specified sample size (no peeking)
  • You want to control the false positive rate across many experiments

When Bayesian wins

Bayesian methods are superior when you are making sequential decisions under uncertainty — which is exactly what continuous optimisation systems do.

In a Bayesian Multi-Armed Bandit (MAB), traffic is allocated proportionally to each arm's probability of being the best option, updated continuously as data arrives. This means you exploit good variants earlier and explore bad ones less, reducing the opportunity cost of running an experiment.

Use Bayesian methods when:

  • You are running multiple variants simultaneously and want to minimise regret (lost conversions during the experiment)
  • You have prior knowledge about effect sizes that can inform the posterior
  • You want to make sequential decisions without the 'peeking problem' inflating false positives
  • The cost of the experiment itself (in missed conversions) is high relative to the value of certainty

How ACO uses both

ACO runs frequentist statistics on every active experiment every 15 minutes. The z-test verdict (significant_win, significant_loss, borderline_win, flat) drives auto-conclude decisions. The SRM check is also frequentist — it is a chi-squared test against the expected traffic ratio.

The Bayesian Multi-Armed Bandit operates at the campaign level, across experiment cycles. When ACO selects which hypothesis to test next, it uses a Thompson Sampling strategy: each hypothesis has a Beta distribution over its probability of improving conversion, updated based on outcomes from previous cycles. Hypotheses with wider uncertainty intervals get more exploration; consistently flat ideas get less.

This combination — frequentist gates for individual experiment conclusions, Bayesian exploration for the long-run hypothesis portfolio — gives you the best of both frameworks.

The peeking problem

One practical advantage of Bayesian approaches is resistance to peeking. With frequentist tests, looking at results before reaching your pre-specified sample size inflates the false positive rate significantly. A study by Johari et al. found that peeking at frequentist results every day and stopping when p < 0.05 produces a true false positive rate of around 26%, not 5%.

Sequential Bayesian tests and always-valid confidence sequences solve this — you can look whenever you want without inflating error rates. Most modern A/B testing platforms that advertise "stop anytime" results are using some form of this.

What to remember

  • p-values answer: 'how surprising is this data if there is no effect?' Not 'how likely is the variant to be better?'
  • Bayesian posteriors answer: 'given this data, what is my updated belief about the effect size?'
  • Neither is universally superior — they solve different problems
  • For autonomous systems making sequential decisions, Bayesian methods reduce the cost of running experiments
  • For one-off, high-stakes decisions requiring formal error control, frequentist methods are cleaner
  • The peeking problem is real — do not stop a frequentist test early because it looks significant

Built by ACO

Run experiments like this automatically

ACO installs in one snippet. It generates hypotheses, runs A/B tests, checks for SRM and Twyman's anomalies, and rolls out winning variants — with a full Git audit trail.

Start free trial →