The Low Base Rate Problem
By Evan Miller
June 2, 2014
If you’re running A/B tests on binary outcomes, and your conversion rate is in the single digits, there’s a good chance you’re wasting your time.
Let’s talk about power
Almost all testing frameworks report statistical significance, but very few talk about power. Significance is the probability of seeing an effect where no effect exists. Power is the flip side of the coin; it’s the probability of seeing an effect where an effect actually exists. Here’s a table summarizing the two concepts:
|Reality||Test says||Chance of
|Effect exists||Effect exists||A||A / (A + B) is power|
|Effect exists||No effect||B|
|No effect||Effect exists||C||C / (C + D) is significance|
|No effect||No effect||D|
To ignore power to ignore the top half of the table, that is, the world where an effect exists.
Power only exists in relation to the size of the effect that you wish to detect. The power level is typically set to 80%. That is, by setting power to 80% for some effect size (say, “10% increase in sales”), you ensure that if the experiment increases sales by exactly 10%, the test will detect the change 80% of the time. Power is important because without enough power, a statistical test will almost always come back with a null result, regardless of the significance level of your test.
You can find the handy power equations in How Not To Run an A/B Test. I won’t frighten the reader by reprinting the equations, but I want to discuss an aspect of the equations that is devastating to small conversion rates.
The “effect size” in the equations are in absolute terms, but most investigators are interested in knowing relative effects. Let’s see what this means for detecting a 10% relative change:
|Old rate||New rate||# Subjects needed
That is, detecting a change off of a 1% baseline ends up requiring about a hundred times as many observations as detecting the same relative change off of a 50% baseline.
To put things in perspective, detecting a relative change of 10% off of a 1% base rate requires a total subject pool of over 300,000 — larger than the population of Pittsburgh — and even dialing the minimum detected effect all the way up to 50% — an epoch-making figure to anyone in marketing — an experiment with a 1% base rate would still need about 6,000 observations in each branch (ref).
The situation is even more depressing at lower conversion rates. Here’s what happens if you’re only converting one visitor in a thousand:
|Old rate||New rate||# Subjects needed|
In other words, each branch of the experiment would need a subject pool about the size of Gabon.
Calculate power in advance…
If you’ve ever wondered why so many of your A/B experiments are coming back negative, it could be that you simply don’t have enough subjects to conduct proper tests. Before concluding that a treatment had no effect, it’s essential to calculate how much a power the statistical test actually had to begin with. Otherwise you run the risk of rejecting good changes without giving them a fair trial — or worse, concluding that the A/B methodology is somehow mystical or unsound.
To avoid a faulty conclusion, it is imperative to decide in advance the size of effect that you wish to detect. That number will determine the number of subjects needed before the test actually starts. Don’t try to guess what this number is and simply hope for the best. There’s a square root in the equation which makes intuition difficult. I recommend using my Sample Size Calculator, or working with the equations in my previous article directly.
If the required number of subjects is prohibitively large, you might try to devise ways to redesign the experiment in a way that makes the baseline conversion rate larger. I describe this issue in detail in Lazy Assignment and A/B Testing.
…or else step away from the vehicle
Because of the potential for erroneous conclusions and bad decision-making, it is not an exaggeration to say that anyone who lacks a firm understanding of statistical power should not be designing or interpreting A/B tests. This proscription may sound extreme, but designing a test without enough power is like designing a car without enough brakes.
So the next time you are presented with a negative test result, ask the experimenter: How much power did this test have? Did the test have enough subjects to reach a meaningful conclusion? Should the experiment have been run in the first place?
You’re reading evanmiller.org, a random collection of math, tech, and musings. If you liked this you might also enjoy:
Want to look for statistical patterns in your MySQL, PostgreSQL, or SQLite database? My desktop statistics software Wizard can help you analyze more data in less time and communicate discoveries visually without spending days struggling with pointless command syntax. Check it out!
Statistics the Mac way