All articles
·8 min read

Why Most A/B Tests Return Wrong Results: The Sample Ratio Mismatch Problem

a/b testingstatisticsexperimentation

Your A/B test reached statistical significance. Champagne? Not yet. If the traffic split between control and variant drifted from what you configured, every number on that results screen is a lie.

Your A/B test hit 95% confidence. The variant is up 12%. The team is celebrating. You ship — and conversions stay flat.

This is Sample Ratio Mismatch (SRM), and it affects more experiments than anyone wants to admit. Ronny Kohavi's large-scale study at Microsoft found SRM in roughly 10% of controlled experiments on the Bing platform alone. In smaller organisations with less mature infrastructure, the rate is higher.

What is Sample Ratio Mismatch?

An experiment configured for a 50/50 split should send exactly half its traffic to control and half to variant. SRM occurs when the observed ratio diverges significantly from that expectation.

A 50/50 experiment that ends up 48/52 might not sound alarming, but it is. You cannot trust any metric from that experiment because something in your assignment pipeline affected the two groups differently — and whatever caused the imbalance almost certainly also affected your conversion metric.

SRM is not a statistical quirk you can correct for. It indicates a flaw in the experiment itself. The only correct response is to investigate, fix, and re-run.

The most common causes

SRM arises wherever traffic touches your infrastructure between assignment and observation. The usual suspects:

  • Bot filtering applied only to one variant (bots get assigned, then scrubbed post-hoc from variant but not control, or vice versa)
  • Redirect chains adding latency to one variant, causing browsers to time out and drop the session entirely
  • Client-side assignment firing after page load — users who leave before the JS executes are counted in the denominator but not the numerator
  • Cache layers serving stale control pages to users already assigned to variant
  • A/A test contamination — users previously in an overlapping experiment inherit a biased assignment
  • Logging bugs where one variant's event stream is sampled at a different rate

How to detect it

The detection is a straightforward chi-squared test against the expected split. For a 50/50 experiment:

SRM detection (JavaScript)
// Expected vs observed traffic
const expected_control = total_users * 0.5;
const expected_variant = total_users * 0.5;

const chi_sq =
  Math.pow(observed_control - expected_control, 2) / expected_control +
  Math.pow(observed_variant - expected_variant, 2) / expected_variant;

// chi_sq > 3.84 → p < 0.05 → SRM detected (1 degree of freedom)
const srm_detected = chi_sq > 3.84;

ACO runs this check automatically every 15 minutes during a live experiment and surfaces an `invalid_srm` verdict before any conclusion is drawn. The experiment is flagged, not concluded.

Fixing SRM in practice

Once you detect SRM, work backwards through your assignment pipeline:

  • Check your logs for drop-off between assignment events and the first page-view event for each variant
  • Run a simple ratio test per hour — if the drift is worse at specific times, look at deployment events, cache purges, or bot spikes
  • Compare bot / crawler sessions between groups — if one group has 3× the crawler rate, bot filtering is the culprit
  • Audit redirect chains — a 301 on variant and a 200 on control will cause differential abandonment
  • If you use server-side assignment, confirm the experiment definition was not deployed gradually (rolling deploys contaminate early data)

Why tooling matters

Most visual A/B testing tools do not check for SRM by default. They show you a confidence interval and call it done. This is dangerous.

Proper experimentation infrastructure validates the experiment before it interprets the experiment. SRM detection should gate every result — if the ratio is off, no metric should be presented as conclusive.

This is why ACO runs the chi-squared SRM check as a first-class quality gate in every evaluation cycle, before computing lift, z-scores, or Bayesian posteriors. Bad data in, bad decision out, regardless of how sophisticated the statistics on top are.

Key takeaways

  • 10% of A/B tests have SRM significant enough to invalidate results
  • SRM indicates a broken pipeline — you cannot correct for it after the fact
  • Chi-squared against your expected split ratio is sufficient for detection
  • Any good experimentation platform should surface SRM before showing a conclusion
  • The most common cause is differential data loss between control and variant, not a fluke in your random number generator

Built by ACO

Run experiments like this automatically

ACO installs in one snippet. It generates hypotheses, runs A/B tests, checks for SRM and Twyman's anomalies, and rolls out winning variants — with a full Git audit trail.

Start free trial →