A/B Testing: The Mathematics of Making Better Decisions

In 2000, Google tested 41 different shades of blue to find which one users were most likely to click on in search results. That sounds absurd — does shade of blue really matter? It does: the winning shade generated an additional $200 million in annual revenue. Google didn't guess or debate. They ran an experiment, collected data, and applied statistics. That process is called A/B testing, and it's how technology companies make almost every design decision.

The Setup: Two Versions, One Question

An A/B test is simple in concept. You have two versions of something — a webpage, an email subject line, a button color. You randomly split your users into two groups: Group A sees the original, Group B sees the new version. After a set period, you measure an outcome — click rate, purchase rate, sign-up rate — and ask: is the difference between the two groups real, or just random noise?

That last question is the tricky part. If Group B clicks 4.2% of the time and Group A clicks 4.0% of the time, is the 0.2% difference meaningful? Or did it happen by chance, the way you might flip five heads in a row without the coin being unfair? Statistics provides the answer.

The Hypothesis Test

You start by assuming the two versions perform identically — this is called the null hypothesis. Then you ask: if they truly were identical, how likely is it that the data would show a difference as large as what you observed, purely by chance?

Test statistic: z = (p_B - p_A) / SE SE = standard error = \sqrt[p(1-p) \times (1/n_A + 1/n_B)] p_B and p_A = click rates in each group n_A and n_B = number of users in each group SE = how much random variation you'd expect in the difference

The standard error measures how noisy your data is. Larger sample sizes shrink the standard error — with a million users per group, even a tiny 0.01% difference becomes statistically detectable. With only 100 users per group, even a large difference might just be noise.

The z-score tells you how many standard errors separate your observed difference from zero. A z-score above 1.96 means there's less than a 5% chance the difference happened by chance — the standard threshold for declaring a result "statistically significant."

Back to Google's Blue

With millions of users clicking search results daily, Google could detect very small differences reliably. For each of the 41 shades, they computed the click rate and ran the hypothesis test against the current standard. Shades that passed the significance threshold advanced; others were eliminated. The winning shade wasn't the prettiest — it was the one that produced the highest click rate with statistical confidence that the result was real and not luck.

Sample Size Matters

One of the most common mistakes in A/B testing is stopping too early. If you check the results after 50 users and see a big difference, it might be noise — you just haven't collected enough data yet. The required sample size can be calculated before the experiment: it depends on how small a difference you want to be able to detect and how confident you want to be. For a 1% difference in click rate, you might need 50,000 users per group. For a 5% difference, a few thousand suffice.

Other Applications

A/B testing isn't limited to tech companies. Clinical trials use the same statistical framework to test whether a new drug outperforms a placebo — with the stakes being patient lives rather than click rates. Educators test whether a new teaching method improves test scores. Politicians test which campaign message drives more voter registration. The math is identical; only the stakes change.

Conclusion

A/B testing turns intuition into evidence. Instead of arguing about which shade of blue looks better, you measure which one works better — and statistics tells you whether the measured difference is real. The formula is straightforward: compute a z-score, compare it to a threshold, and declare a winner only when the data justifies it. It's how the modern world makes decisions at scale, one carefully measured experiment at a time.