Twyman's Law: Why Your Most Exciting A/B Test Result Is Probably Wrong

You ran a test. One variant is up 40%. The team is ecstatic. Twyman's Law says: the more surprising the result, the more carefully you should check your data before celebrating.

Tony Twyman was a British media researcher who articulated a principle that should be tattooed on the wall of every analytics team:

"Any figure that looks interesting or different is usually wrong."

Twyman's Law is not a formal statistical theorem. It is an empirical observation about how data errors distribute. Errors are not random — they cluster in the places that make results look striking. When your A/B test shows a 35% improvement in conversion rate, the probability that something went wrong in your measurement pipeline is much higher than when it shows a flat result.

This is counterintuitive and uncomfortable, which is why most teams ignore it.

Twyman's Law: any figure that looks interesting or different is usually wrong. The more surprising the result, the more carefully you should investigate before acting on it.

Why does this happen?

Consider all the ways a measurement pipeline can fail: tracking bugs, SRM, bot contamination, logging mismatches, segment leakage, cookie persistence issues, attribution errors. Most of these failures produce noise — slightly higher or lower numbers that fall within your normal variance.

But some failures are directional. A tracking bug that only fires on the variant confirmation page artificially inflates variant conversion. A bot that preferentially visits the control doubles control sessions. These bugs look like wins.

The size of the apparent effect correlates with the severity of the error. A 40% improvement is far more likely to be a tracking bug than a genuine 40% improvement — because genuine 40% improvements are rare, and tracking bugs that look like 40% improvements happen regularly.

The base rate problem

Here is the formal version. Apply Bayes' theorem to your prior beliefs about effect sizes.

Before running any experiment, what is your prior probability that a copy or UI change produces a 30%+ relative improvement in conversion? In most mature products with reasonable existing copy, this is genuinely rare — call it a 2% prior.

Now: what is the probability that a tracking error or data quality issue produces an apparent 30% improvement? It is not uncommon — call it 5-10%.

When you observe a 30% result, you should update on both likelihoods. The posterior probability that the result is real is lower than your intuition suggests — because the error hypothesis has a higher likelihood ratio than you might think.

Extraordinary results require extraordinary evidence. Before shipping a 30%+ improvement, the evidence that your pipeline is clean needs to be stronger than for a 5% improvement.

The Twyman's Law checklist

When a result looks striking, run through these checks before concluding:

SRM check: is the traffic ratio close to your configured split? (chi-squared test, p < 0.05 → investigate)
Novelty effect: is the lift concentrated in returning users who already know your product, or in new users? Returning user lift often fades after a few weeks as novelty wears off
Metric correlation: if conversions are up 30%, are clicks, signups, and revenue all up proportionally? If conversions are up 30% but revenue is flat, something is wrong with the attribution
Guardrail metrics: are any of your counter-metrics (bounce rate, support tickets, churn) moving unusually?
Segment sanity: does the lift hold across all major segments (mobile/desktop, new/returning, geography)? A result that only exists in one narrow segment is fragile
A/A comparison: if you can run a quick A/A test (same variant, same variant), do you see a flat result? If you see lift in an A/A, your pipeline is broken
Timing: is the lift concentrated in a short window that coincides with a deployment, a campaign, or an external event?

How ACO handles this

ACO runs a Twyman's Law check on every experiment that reaches a significant verdict before presenting a promotion decision. Specifically, it flags a 'twymans_warning' verdict when:

The relative lift exceeds 25% (configurable threshold) — large lifts get additional scrutiny
The metric correlation between primary and secondary metrics is inconsistent (e.g., conversion up, revenue flat)
The SRM chi-squared is elevated but not quite at the threshold — a near-miss is still a signal

A `twymans_warning` experiment is not eligible for promotion. This is the correct response — a surprising result is an invitation to investigate, not a trigger to ship.

What to do when you have a big win

Big wins do happen. But they require stronger validation than small wins, not weaker.

Before shipping a result above your typical variance:

Run a holdout after declaring significance — keep 5% of traffic on control for two weeks post-ship and confirm the lift persists
Check the result in a different analysis tool independently — if your A/B platform shows 30% and your analytics tool shows flat, there is a discrepancy to resolve
Have a sceptic on the team audit the tracking implementation before you launched
Look for the mechanism — can you explain, from first principles, why this copy change would produce a 30% improvement? If you cannot tell a coherent story, that is a data quality signal

The right attitude toward surprising results

Twyman's Law does not say that exciting results are always wrong. It says they deserve more scrutiny, not less.

The correct team culture around striking A/B results is curiosity, not celebration. "Interesting — let's check the data" is the right response. "We're up 30%, ship it!" is how you end up with a production rollout that shows no real-world improvement and a team that has learned to distrust their own experimentation system.

Healthy experimentation culture treats large apparent wins as hypotheses to validate, not facts to act on. The tools that support this culture — SRM detection, metric correlation checks, Twyman's flags — are as important as the statistical tests themselves.