How Long Should You Run an A/B Test? The Math, Explained Simply

Most A/B tests are stopped too early. The result looks significant, someone gets excited, and the experiment is called. Two weeks later, the lift has vanished. Here is the math that prevents this.

One of the most common questions in A/B testing: "We have been running this for a week — can we call it?"

Usually, the answer is no. Not because a week is too short as an absolute rule, but because the right answer depends entirely on your traffic volume, your current conversion rate, and the minimum effect size you care about detecting.

The good news: you can calculate the required sample size before you start. The bad news: most teams skip this step, which is why most A/B tests produce unreliable results.

The three inputs you need

Sample size calculation for a two-sample proportions test requires three numbers:

The MDE is the most important input and the most frequently miscalibrated one. Teams often set it too low (wanting to detect a 2% relative improvement) which requires hundreds of thousands of users per variant. Set your MDE based on what would actually change a business decision.

Baseline conversion rate (p₁): your current measured conversion rate on the control
Minimum detectable effect (MDE): the smallest lift you actually care about — e.g., a 10% relative improvement on a 5% baseline rate means you want to detect moves from 5% to 5.5%
Statistical power (1 - β): typically 80% or 80%. This means: if the true effect is exactly your MDE, what fraction of experiments would correctly detect it?

The sample size formula

For a two-sided test at α = 0.05 and 80% power, a reasonable approximation is:

Sample size formula (TypeScript)

// n = required users per variant
// p1 = baseline conversion rate
// mde = minimum detectable effect (absolute, not relative)
// e.g. if baseline is 0.05 and you want 10% relative lift, mde = 0.005

function sampleSizePerVariant(p1: number, mde: number): number {
  const p2 = p1 + mde;
  const p_bar = (p1 + p2) / 2;
  // z-scores for α=0.05 (two-sided) and β=0.20 (80% power)
  const z_alpha = 1.96;
  const z_beta = 0.842;
  return Math.ceil(
    Math.pow(z_alpha * Math.sqrt(2 * p_bar * (1 - p_bar)) +
             z_beta * Math.sqrt(p1 * (1 - p1) + p2 * (1 - p2)), 2) /
    Math.pow(mde, 2)
  );
}

// Example: 5% baseline, 10% relative MDE → 0.5pp absolute MDE
sampleSizePerVariant(0.05, 0.005); // ≈ 29,000 per variant → 58,000 total

Translating to time

Once you have the required total sample, divide by your daily unique visitor count to get the minimum runtime in days.

Some important caveats:

Always run for at least one full week to capture day-of-week effects — weekday and weekend behaviour differ significantly for most products
If your traffic is highly seasonal (a sale event, a product launch), wait until a 'normal' traffic period to run the experiment
Two weeks is a common default because it captures two weekday/weekend cycles — but this is a minimum floor, not a guarantee
If your calculation says 6 months, your MDE is too small relative to your traffic — either accept that you cannot detect small effects, or increase the effect you care about

The peeking problem in practice

If you run a sample size calculation beforehand and then stop the experiment when it hits significance — even before reaching your target sample — you are peeking.

This is the most common source of inflated false positives in practitioner A/B testing. At a 5% alpha, peeking daily and stopping early can drive the true false positive rate above 20-25%.

The fix: commit to your sample size before you start and do not interpret interim results as final. If your platform shows results in real time, use it for monitoring (detecting anomalies, SRM) — not for early stopping decisions.

Pre-register your sample size. Write it down before you launch. Do not let 'it looks significant after three days' override your pre-specification.

What to do with low-traffic sites

If your calculation says you need 100,000 users per variant and you get 5,000 visitors a month, you have a low-traffic problem, not an A/B testing problem.

In this situation:

Raise your MDE — focus on testing changes large enough to produce large effects
Increase conversion rate first through qualitative research (user interviews, session recordings) before quantitative testing
Test higher in the funnel where volume is greater, not at the checkout step where you have 50 conversions per month
Consider Bayesian approaches that can make useful decisions with less data by explicitly incorporating prior knowledge

ACO surfaces this directly as a traffic_guide in the dashboard — each plan tier shows the minimum monthly visitors needed to reliably detect improvements of a given size. There is no point running experiments you are statistically underpowered to conclude.

Quick reference

Calculate sample size before you start — not after
Use an MDE that matches what would actually change a decision
Run for at least one full week regardless of sample count
Do not stop early because results look significant
Low traffic? Raise your MDE or test higher in the funnel
SRM check before you interpret any results — if traffic split is off, the data is invalid