Simple Sequential A/B Testing
By Evan Miller
October 13, 2015
Stopping an A/B test early because the results are statistically significant is usually a bad idea. In this post, I will describe a simple procedure for analyzing data in a continuous fashion via sequential sampling. Sequential sampling allows the experimenter to stop the trial early if the treatment appears to be a winner; it therefore addresses the "peeking" problem associated with eager experimenters who use (abuse) traditional fixed-sample methods.
The sequential procedure works like this:
At the beginning of the experiment, choose a sample size \(N\).
Assign subjects randomly to the treatment and control, with 50% probability each.
Track the number of incoming successes from the treatment group. Call this number \(T\).
Track the number of incoming successes from the control group. Call this number \(C\).
If \(T-C\) reaches \(2\sqrt{N}\), stop the test. Declare the treatment to be the winner.
If \(T+C\) reaches \(N\), stop the test. Declare no winner.
The above procedure can, in some circumstances, reduce the number of observations required for a successful experiment by 50% or more. The procedure works extremely well with low conversion rates (that is, with small Bernoulli parameters). With high conversion rates, it works less well, but in those cases traditional fixed-sample methods should do you just fine.
There are other solutions to the sequential-testing problem out there, but they often involve complicated equations and questionable assumptions. The test presented here is simple to understand, easy to implement, and assumes nothing about the possible distribution of treatment effects. Notably, this test completely ignores the number of failures in each group, which makes it significantly easier to implement in low-conversion settings. Because \(T-C\) and \(T+C\) are trivial to compute and understand, the test can be carried out by practically anyone.
The procedure is inspired by a post by Ben Tilly called A/B Testing Rigorously (without losing your job). Tilly's article, while somewhat lacking in mathematical rigor, contains an important insight into the problem of sequential testing, which I'll use to formulate a slightly different, and much simpler, test of statistical significance.
A Random Walk
The key insight in Ben Tilly's article is that if users are randomly assigned to two groups, and the two groups have the same conversion rate, then the sequence of successes from the two groups is mathematically equivalent to a series of random coin flips.
At any point in time, we can construct a variable \(d\) that represents the number of "heads" (that is, successes from the treatment) minus the number of "tails" (that is, successes from the control). As observations accumulate, the random variable \(d\) is described by a simple, one-dimensional random walk. If the treatment has the same conversion rate as the control, the random walk will be symmetrical; because the next success is as likely to come from the treatment as from the control, after each success \(d\) is just as likely to increase as it is to decrease. But if the conversion rates are not equal, the random walk will tend to be biased in one direction or the other.
Random walks have been studied extensively. There is a famous problem in the theory of random walks called the gambler's ruin, which happens to map perfectly to the sequential A/B testing problem. In the next section, we'll see how the solution to the gambler's ruin problem will form the basis of a frequentist A/B hypothesis test.^{1}
The Gambler's Ruin
Imagine a gambler with \(d\) tokens flipping a coin against an adversary. If the gambler correctly guesses heads or tails, the gambler receives a token from the adversary; if the gambler is incorrect, he loses a token. When the gambler (or the adversary) runs out of tokens, the game is over. (That is, one of them is "ruined," and presumably is forced to spend the night in the gutter.)
Suppose first that the coin being flipped is fair. If the adversary has an infinite number of tokens to gamble, the probability that a protagonist starting with \(d\) tokens will be ruined after exactly \(n\) coin flips is given by^{2}:
\[ r_{n,d} = \frac{d}{n} {n \choose (n+d)/2}2^{-n} \]The probability that the gambler will be ruined in the first \(N\) coin flips is given by a summation over \(r_{n,d}\):
\[ R_{N,d} = \sum_{n=1}^N r_{n,d} = \sum_{n=1}^N \frac{d}{n} {n \choose (n+d)/2}2^{-n} \]What does our poor gambler have to do with A/B testing? Recall the quantity \(T-C\) in the introduction, and notice that it increases by one with each success from the treatment group, and decreases by one with each success from the control group. If that sounds a lot like a pile of poker chips to you, then let \(d=T-C\). Now, starting at \(d=0\), the probability that \(d\) will drift up to some value \(d^*\) is exactly identical to the probability that the gambler starting with \(d^*\) tokens will be reduced to zero tokens. Therefore the above equation — the cumulative probability of ruin over time — can be used to form a one-sided, sequential test of statistical significance for an A/B test. For a given \(N\) and significance level \(\alpha\), choose the smallest positive value of \(d^*\) that satisfies:
\[ \sum_{n=1}^N \frac{d^*}{n} {n \choose (n+d^*)/2}2^{-n} < \alpha \]If the value \(d\) attains the critical value of \(d^*\) in \(N\) or fewer successes (coin flips), then we reject the null hypothesis that the treatment is no better than the control. If the critical value has not been reached, then we accept the null hypothesis.
Critical values of \(d^*\) are well-approximated by^{3}:
\[ d^* \approx z_{\alpha/2}\sqrt{N} \]Where \(z_{\alpha/2}\) is the point on a normal distribution with a CDF of (\(1-\alpha/2\)). For \(\alpha=0.05\), \(z_{\alpha/2} = 1.96\). Rounding \(z\) up to 2, a conservative rule of thumb is then \(d^* \approx 2\sqrt{N}\).
If you're skeptical of the above equation, a simple simulation script will confirm it. Here's one in Julia:
N = 1000 # anything >200 works great stopped = 0 # null hypothesis rejections for i=1:10000 # number of simulations d = 0 for n=1:N d += rand([-1, 1]) if d > 1.96*sqrt(N) stopped += 1 break end end end println("Type I Error: ", stopped / 10000)
That should print a number in the ballpark of 0.05.
The procedure can be stopped early in the sense that once \(d^*\) is attained, the test is over. However, it should not be stopped prematurely — i.e., before \(d^*\) is attained and before \(N\) successes have been recorded. Similarly, the test should not be extended past \(N\) successes, or else the statistical inferences will be invalid. As we will see in the following sections, the procedure can be used to reduce the number of observations required to complete a test, but the responsible experimenter will still abide by the basic principles of hypothesis testing.
The power equation
The value of \(N\) should be chosen in advance in order to accommodate a desired amount of statistical power. Suppose we want to detect a given amount of "lift" \(\delta\), where a 10% increase in conversions corresponds to \(\delta = 0.10\). Under this alternative hypothesis, the fraction of incoming successes from the control will be equal to:
\[ p = \frac{p_c}{p_c+p_t} = \frac{1}{1+p_t/p_c} = \frac{1}{1+(1+\delta)} = \frac{1}{2+\delta} \]And the fraction of incoming successes from the treatment group will be equal to:
\[ 1 - p = 1 - \frac{1}{2+\delta} = \frac{1+\delta}{2+\delta} \]The equations for the gambler's ruin with an unfair coin will be of service here. If the coin lands in the gambler's favor \(p < \frac{1}{2}\) of the time, and the adversary's favor \(1 - p > \frac{1}{2}\) of the time, the probability of ruin after exactly \(n\) coin flips is given by^{4}:
\[ r_{n,d} = \frac{d}{n}{n \choose (n+d)/2}p^{(n-d)/2} (1-p)^{(n+d)/2} \]The probability of being ruined in the first \(N\) flips is then a sum over \(r_{n,d}\):
\[ R_{N,d} = \sum_{n=1}^N r_{n,d} = \sum_{n=1}^N \frac{d}{n}{n \choose (n+d)/2}p^{(n-d)/2} (1-p)^{(n+d)/2} \]Equating successes from the treatment group to gains made by the adversary, we are now equipped to find the optimal sample size for a given significance level \(\alpha\) and statistical power \(1-\beta\). We simply choose the smallest positive values of \(N\) and \(d^*\) such that the probability of hitting \(d^*\) with \(\delta = 0\) is less than \(\alpha\), and the probability of hitting \(d^*\) with \(\delta > 0\) is at least \(1-\beta\).
In mathematical terms, \(N\) and \(d^*\) must satisfy:
\[ \sum_{n=1}^N \frac{d^*}{n}{n \choose (n+d^*)/2}\left(\frac{1}{2+\delta}\right)^{(n-d^*)/2} \left(\frac{1+\delta}{2+\delta}\right)^{(n+d^*)/2} > 1 - \beta \\ \sum_{n=1}^N \frac{d^*}{n} {n \choose (n+d^*)/2}2^{-n} < \alpha \]If you want to solve these equations at home, an integer grid search starting with \(N=1\) will arrive at optimal values in \(O(N^2)\) time; a binary-search algorithm over \(d^*\) will converge in \(O(N\ln \sqrt{N})\).
Or, you can just use the online calculator, which solves the equations in JavaScript.
Sample Size Tables
Below is a table with a few calculated values of optimal \(N\) and \(d^*\) for given significance levels, power levels, and lifts. The column \(E[N|H_A]\) is the expected number of conversions required to complete the test under the alternative hypothesis — you can see that under the alternative, the procedure will terminate up to 17% earlier than under the null hypothesis using standard power and significance levels. The \(N_{Z}\) column is the comparable number of total conversions expected under a fixed-sample, one-sided \(Z\)-test with a baseline conversion rate of 1%. (If \(N_{Z}\) is the total number of successes and failures, \(N_{Z}/100\) is the expected number of successes under the null.)
Finally, the "Savings" column represents the percent reduction in sample size when comparing the sequential test under the alternative to the fixed-sample test. Algebraically, it is \(1-E[N|H_A]/(N_{Z}/100)\).
Significance \(\alpha\) | Power \(1-\beta\) | Lift | \(N\) | \(d^*\) | \(E[N|H_A]\) | \(N_{Z}/100\) | "Savings" |
---|---|---|---|---|---|---|---|
0.05 | 0.8 | 50% | 170 | 26 | 120 | 106 | −13.9% |
20% | 808 | 56 | 569 | 632 | 10.0% | ||
10% | 2,922 | 106 | 2,055 | 2,489 | 17.4% | ||
0.05 | 0.6 | 50% | 100 | 20 | 82 | 59 | −38.5% |
20% | 475 | 43 | 384 | 361 | −6.3% | ||
10% | 1,732 | 82 | 1,398 | 1,436 | 2.7% | ||
0.01 | 0.8 | 50% | 262 | 42 | 198 | 169 | −17.4% |
20% | 1,272 | 92 | 956 | 1,019 | 6.2% | ||
10% | 4,644 | 176 | 3,491 | 4,026 | 13.3% |
Explaining to a lay-person how to run a test using the above table is relatively straightforward. An interpretation of the first row might go as follows:
Suppose we wish to run a test at the 5% significance level. (That is, if the treatment is the same as the control, we will incorrectly choose the treatment 5% of the time.) Suppose that we wish to detect a 50% lift at least 80% of the time. According to the table, we should plan to collect at most 170 successes. If the number of successes from the treatment ever exceeds the number of successes from the control by 26, we immediately stop the test and declare the treatment the winner. Otherwise, we declare the control the winner.
If the true lift was 50%, we will on average record 120 total successes before stopping the test. In that case, the total number of observations needed to complete the sequential test will be about 13.9% larger than the number needed to complete the fixed-sample test.
The sequential-test figures compare favorably to the fixed-sample numbers in some circumstances (emphasized in green) and less favorably in other circumstances. In all of the above scenarios, the worst-case sequential sample size \(N\) is actually larger than the fixed sample size \(N_Z\); that is, in the majority of tests where no effect is present, the sequential test will arrive at conclusion more slowly than its fixed-sample counterpart.
But the sequential test performs much better when an effect is present. It gives its strongest performance when detecting small lifts. For example, when the sample size has been chosen to detect a 10% lift, the expected "savings" under the alternative hypothesis are as high as 17.4%. Even when detecting 20% lifts, the expected savings with \(\beta=0.20\) are still noticeable, on the order of 10%; but when detecting a very high (50%) lift, any savings are essentially wiped out, and it ends up requiring more observations than a fixed-sample test (up to 38% more, in the case of low power).
Blockbusters
The sequential test comes into its own when the true effect is larger than the minimum detectable effect — i.e., when the treatment is a "blockbuster". The below table presents the savings when the true effect is twice the value of the alternative-hypothesis \(\delta_{A}\) that was chosen during the sample-size calculation. The "Savings" column is again relative to the sample required for a fixed-length test.
Significance \(\alpha\) | Power \(1-\beta\) | Lift | \(N\) | \(E[N|\delta=2\delta_{A}]\) | \(N_Z/100\) | "Savings" |
---|---|---|---|---|---|---|
0.05 | 0.8 | 50% | 170 | 78 | 106 | 26.3% |
20% | 808 | 336 | 632 | 46.9% | ||
10% | 2,922 | 1,166 | 2,489 | 53.2% | ||
0.05 | 0.6 | 50% | 100 | 59 | 59 | 0.0% |
20% | 475 | 256 | 361 | 29.2% | ||
10% | 1,732 | 897 | 1,436 | 37.6% | ||
0.01 | 0.8 | 50% | 262 | 126 | 169 | 25.4% |
20% | 1,272 | 552 | 1,019 | 45.8% | ||
10% | 4,644 | 1,936 | 4,026 | 51.9% |
The savings with a blockbuster treatment are nearly universal, on the order of 25%–50% (the exception being a high-lift, low-power test, which probably isn't a well-designed experiment anyway). The value of the sequential test now comes into greater focus: although a sequential test generally requires more observations than the equivalent fixed-sample test, it provides a kind of "early alert" for high-performing treatments. In settings where identifying a high-performing treatment is more urgent than reducing experiment times for low-performing treatments, the sequential test represents a compelling option.
Varying \(p\)
The savings in the above tables are calculated assuming a 1% baseline conversion rate. The sequential test performs less efficiently at higher conversion rates, as demonstrated in the table below. The difference is due in part to the fact that the sequential test ignores the total number of failures that occurred in each group, which the traditional \(Z\) test makes use of.
The figures in the table below assume \(\alpha=0.05\), \(\beta=0.20\), and \(\delta=0.20\). For all rows, \(N=808\), and the expected number of conversions in the fixed-sample test is given by \(N_Z\times p\) (corresponding to \(N_Z/100\) in the previous tables). Savings are presented relative to the alternative and blockbuster hypotheses.
Conversion rate \(p\) | \(N_Z\times p\) | \(E[N|H_A]\) | Savings | \(E[N|\delta=2\delta_{A}]\) | Savings |
---|---|---|---|---|---|
1% | 632 | 569 | 10.0% | 336 | 46.9% |
2% | 626 | 569 | 9.0% | 336 | 46.3% |
5% | 606 | 569 | 6.0% | 336 | 44.5% |
10% | 573 | 569 | 0.6% | 336 | 41.3% |
20% | 506 | 569 | −12.4% | 336 | 33.6% |
30% | 440 | 569 | −29.4% | 336 | 23.6% |
40% | 373 | 569 | −52.4% | 336 | 10.1% |
50% | 307 | 569 | −85.4% | 336 | −9.4% |
The sequential test performs quite respectably under the alternative hypothesis, creating positive savings when the baseline conversion rates is 10% or less. Under the blockbuster hypothesis, it creates savings in excess of 40% for conversion rates of 10% or less, and still creates positive savings except when the conversion rate begins to approach 50%. (Note that above 50%, the sequential method can be inverted to count failures rather than successes; then the complement of the conversion rate applies.)
The above table naturally begs the question of whether a set of cutoff values exist for deciding between a sequential and fixed-sample test. A final table calculates the minimum detectable effect (or MDE, here being the lift under the alternative) that will result in the same number of expected conversions under both the sequential and fixed-sample tests. In terms of expected sample size, the advantage will tend to tip in favor of the sequential test for detecting lower lifts, and in favor of the fixed-sample test when detecting larger lifts. The figures are again calculated assuming \(\alpha=0.05\) and \(\beta=0.20\).
Conversion rate \(p\) | Largest MDE with positive savings | \(N_Z\times p\) = \(E[N|H_A]\) |
---|---|---|
1% | 35% | 211 |
2% | 33.8% | 224 |
5% | 29.7% | 279 |
10% | 21.5% | 495 |
15% | 14.0% | 1,083 |
20% | 6.7% | 4,443 |
The first line of the table may be read as follows: If the baseline conversion rate is 1%, an experiment designed to detect a lift of no less than 35% will on average require the same number of conversions (211) with a sequential design as with a fixed-sample design, assuming the true lift is exactly the MDE. An experiment designed to detect a smaller lift will require fewer observations using a sequential test; an experiment designed to detect a larger lift will on average require fewer observations using a fixed-sample test.
Based on the calculated figures, a good rule of thumb for choosing a sequential or fixed-sample test given a baseline conversion rate \(p\) and the MDE \(\delta\) is to compute the quantity \(1.5p + \delta\). If that number is less than 36%, the sequential test will tend to offer positive savings under the alternative hypothesis; if it is higher than 36%, the fixed-sample test will tend to perform better.
Two-Sided Testing
The above procedure is a one-sided test; it looks for a positive lift in the treatment, but it won't stop the test early if the lift is negative.
Technically, adapting the test to be two-sided is not quite so simple as cutting \(\alpha\) in half and comparing the absolute value of \(d\) to \(d^*\). A conceptual trouble arises. Suppose we make a test that is stopped when \(d\) reaches either \(d^*\) or \(-d^*\). Such a test would, by some kind of naive symmetry argument, state that the probability of arriving at \(d^*\) or \(-d^*\) at a given moment must be twice that of arriving at \(d^*\) alone.
That's almost true, but not quite. We need to condition the arrival at \(d^*\) on not having first arrived at \(-d^*\), which would have stopped the test. Likewise, we need to condition the arrival at \(-d^*\) on not having arrived at \(d^*\) first.
In practice, \(d^*\) will be large enough that the probability of visiting \(-d^*\) and then traversing the \(2d^*\) ravine back up to \(d^*\) in order to be one of the first \(\alpha\) arrivals at \(d^*\) will be quite low, somewhat akin to winning a race from New York to L.A. by first visiting London. (If the probability of traversing \(d^*\) in \(N\) steps is \(\alpha\), the probability of traversing \(3d^*\) in \(N\) steps is roughly \(\alpha^3\).)
So for purposes of hypothesis testing, we will in fact assume that the probability of reaching \(-d^*\) or \(d^*\) is twice the probability of reaching \(d^*\) alone. Replacing \(\alpha\) with \(\alpha/2\) in the equations, a good approximation for \(d^*\) in a two-sided test becomes:
\[ d^* \approx z_{\alpha/4} \sqrt{N} \]If \(|T-C|\) reaches this number before \(T+C\) reaches N, the null hypothesis (that treatment equals control) is rejected. At the standard significance level of \(\alpha = 0.05\), a good rule of thumb is then \(d^* \approx 2.25 \sqrt{N}\).
A modification of the previous simulation script will confirm the approximation:
N = 1000 # anything >200 works great stopped = 0 # null hypothesis rejections for i=1:10000 # number of simulations d = 0 for n=1:N d += rand([-1, 1]) if d > 2.24*sqrt(N) || d < -2.24*sqrt(N) stopped += 1 break end end end println("Type I Error: ", stopped / 10000)
The above code will consistently produce answers in the neighborhood of 0.05, as expected.
Likewise in the power calculations, a reasonable approximation will be to replace \(\alpha\) by \(\alpha/2\) in the equations. (If you're using the one-sided calculator, simply halve the desired significance level.)^{5}
From Ruin to Profit
The sequential procedure presented here excels at quickly identifying small lifts in low-conversion settings — and allows winning tests to be stopped early — without committing the statistical sin of repeated significance testing. It requires a tradeoff in the sense that a no-effect test will take longer to complete than it would under a fixed-sample regime, but this tradeoff may be justified if identifying a winning treatment quickly is more important than reducing the average duration of tests. For organizations that wish to learn from their successful treatments and iterate on them rapidly, the advantages of sequential methods should be obvious.
Perhaps the greatest virtue of the gambler's-ruin method described here is that it's easy to try out because it's easy to implement. To remind you how simple the test is, let me repeat it in its napkin-sized entirety:
At the begining of the experiment, choose a sample size \(N\).
Assign subjects randomly to the treatment and control, with 50% probability each.
Track the number of incoming successes from the treatment group. Call this number \(T\).
Track the number of incoming successes from the control group. Call this number \(C\).
If \(T-C\) reaches \(2\sqrt{N}\), stop the test. Declare the treatment to be the winner.
If \(T+C\) reaches \(N\), stop the test. Declare no winner.
The two-sided test is essentially the same, but with an alternate ending:
If \(T-C\) reaches \(2.25\sqrt{N}\), stop the test. Declare the treatment to be the winner.
If \(C-T\) reaches \(2.25\sqrt{N}\), stop the test. Declare the control to be the winner.
If \(T+C\) reaches \(N\), stop the test. Declare no winner.
You'll notice that there is no alchemy required to produce a test statistic, or fudge factors invoked to evaluate it. The test is completely defined by the values \(N\) and \(d^*\). With those two numbers in hand — received either from a table, or a calculator — if you can count to a hundred and back, you have the intellectual tools necessary to begin reaping the benefits of sequential sampling.
Notes
The gambler's ruin also happens to be the mathematical basis of the Sequential Probability Ratio Test (SPRT) originally developed by Abraham Wald. An important difference between SPRT and the test described here is that SPRT requires an explicit alternative hypothesis (or an explicit weighting function over a set of alternative hypotheses) in order to be calculated.
In the world of online A/B testing, Optimizely's "New Stats Engine" utilizes SPRT; as such, it has a free parameter \(\tau\) which needs to be estimated from previous experiments. The present test has no such free parameter, and can be implemented with just the equations.
Actually, \(r_{n,d}=0\) when \(n+d\) is an odd number. You can find the details in Chapter 14 of Feller's Introduction to Probability book.
The proof is in Chapter 3 Section 7 of Feller. (Equation 7.7 / Theorem 3)
To reiterate, \(r_{n,d}=0\) when \(n+d\) is an odd number. This is an important point if you're trying to understand the math, which is why I'm putting two footnotes about it.
An exact form of the two-sided gambler's ruin equation appears at the end of Chapter 14 Section 5 in Feller. If you start the gambler's adversary with the same number of tokens as the gambler (\(a=2z\)), you have the basis for a two-sided test. The equation as presented involves large exponents, and I had trouble implementing it, so I am not sure how close the given approximation is to the exact solution.
References
Abraham Wald, Sequential Analysis (1947) (Amazon)
William Feller, An Introduction to Probability Theory and Its Applications, Vol. 1, Wiley, 3rd edition (1968) (Amazon)
Further reading
A/B Testing Rigorously (without losing your job) by Ben Tilly
How Not To Run An A/B Test and The Low Base Rate Problem by me :-)
The New Stats Engine (at Optimizely) by Leo Pekelis, David Walsh, and Ramesh Johari
Rapid A/B Testing with Sequential Analysis and Estimation in Sequential Analysis by Audun M. Ã˜ygard
Is Bayesian A/B Testing Immune to Peeking? Not Exactly by David Robinson
Bayesian A/B Testing at VWO by Chris Stucchio
You're reading evanmiller.org, a random collection of math, tech, and musings. If you enjoyed this, you might also like:
If you run A/B tests frequently, you should check out Evan's Awesome A/B Tools:Sample Size Calculator |
Chi-Squared Test |
Two-Sample T-Test |
Finally, if you own a Mac, my desktop statistics software Wizard can help you analyze more data in less time and communicate discoveries visually without spending days struggling with pointless command syntax. Check it out!
Back to Evan Miller's home page – Follow on Twitter – Subscribe to RSS