Lazy Assignment and A/B Testing
By Evan Miller
May 19, 2013 (changes)
If you're running an A/B test on one part of a sales funnel, here's a trick you might be able to use to reduce the number of subjects needed by half — or more.
Once Upon a Time in a Shopping Cart
Let's say you have a multi-step sales funnel. About 10% of users reach the checkout page, and of those, about half actually complete the sale. (So on average 5% of all users complete the sale.)
The design team has been hard at work on a new checkout button and wants to run an A/B test to determine its effect on sales. One half of subjects in the test will see the new button, and the other half will see the old button. Like a good student of A/B testing, you tell the experimenters to determine the number of subjects needed for the test in advance.
Here's a question: do you assign a new visitor to a branch of the experiment as soon as she walks in the door, or do you perform the assignment only when she reaches the checkout page?
Stop and think about it for a second. I'll wait.
Now you're probably thinking: Half of the visitors will see the new button, and half will see the old button. The time when they're assigned to a treatment can't possibly influence the number of subjects needed for the experiment.
Right?
The Numbers
Well, let's run the numbers, and assume we want to test for a 10% sales lift at standard power and significance levels. If we assign treatment as visitors walk in the door, we would want to test for a change from 5% to 5.5%. But if we assign treatment at the checkout page — which only 10% of users reach — we would instead want to test for a change from 50% to 55%.
Using a sample size calculator, we can determine the number of subjects required for each version of the experiment:
Experiment | Old rate | New rate | # test subjects per branch |
---|---|---|---|
Assign at beginning | 5% | 5.5% | 30,244 (ref) |
Assign at checkout | 50% | 55% | 1,567 (ref) |
Now let's work backwards to figure out the total number of visitors that need to walk in the door in order to complete both tests. In the first case, every visitor becomes a test subject, so the number is two times the number needed in each branch: 60,488. In the second case, we only need 1,567 subjects per branch at checkout; because 10% of visitors reach checkout, then on average we need only 31,340 (= 2 × 15,670) visitors to walk in the door. That is, by assigning treatment at checkout, we need only half the number of visitors as compared to assigning treatment at the beginning. Let's update that table:
Experiment | # test subjects per branch |
total # visitors needed |
||
---|---|---|---|---|
Assign at beginning | 5% | 5.5% | 30,244 | 60,488 |
Assign at checkout | 50% | 55% | 1,567 | 31,340 |
Huh? How can the number of visitors required for a successful experiment depend on when we assign treatment?
Dead Weight
There's a problem with our initial reasoning. (But the math is OK!) It's not true that half of all visitors see the new checkout button. Half of visitors who get to the checkout page will see the new checkout button. In other words, the first experiment is loaded up with "dead weight" in the form of visitors who are assigned to a group, never exposed to the experiment condition, and counted as a failure.
From the perspective of the experiment, these "failures" are indistinguishable from visitors who made it to the checkout page and then failed to convert — that is, the "signal" from users exposed to the experiment is muddied up by "noise" from users who are never exposed to the experiment. More test subjects are therefore required in order to extract meaningful information. In the example above, about twice as many (actually, 1.8 times as many) are required.
Think of it another way: you wouldn't randomly assign the 7 billion people in the world who never visited your website to an experiment branch and count them all as "failed observations," would you? So why include the people who visited the website but never reached the experiment?
In practical terms, designers of A/B tests should not assign treatment for all experiments as soon as visitors show up to the store.
Instead, they should practice what I call lazy assignment: assigning visitors to experiments only when it is certain that they will actually be exposed to the experiment condition.
How Much More Efficient Is Lazy Assignment?
The amount of efficiency gained from lazy assignment depends on how well subjects can be isolated in an experiment prior to conversion. The efficiency gain — the factor by which the required number of subjects is reduced — has a simple formula:
\[ E = \frac{1-p_1}{1-p_2} \]Here \(p_1\) is the initial probability, that is, the overall conversion rate from the moment a visitor walks in the door. The variable \(p_2\) is the conversion rate from the point at which the experiment actually applies. (See the Mathematical Appendix for a derivation of this formula.)
So instead of running through a complicated calculation as in The Numbers section above, we can compute the efficiency gain directly from the efficiency formula. Here is the result of plugging in numbers from the shopping cart example, where the initial converstion rate was 5% (\(p_1=0.05)\) and the final conversion rate was 50% (\(p_2=0.5\)):
\[ E = \frac{1-0.05}{1-0.5} = 1.9 \]Below is a table with more example values of \(p_1\) and \(p_2\) and the corresponding efficiency gain. To read it, find the row corresponding to the initial probability \(p_1\), and the column corresponding to the isolated probability \(p_2\).
Efficiency gains from lazy assignment | |||||
---|---|---|---|---|---|
\(p_1\) / \(p_2\) | 80% | 50% | 20% | 10% | 5% |
50% | 2.5 | ||||
20% | 4.0 | 1.6 | |||
10% | 4.5 | 1.8 | 1.13 | ||
5% | 4.75 | 1.9 | 1.19 | 1.06 | |
1% | 4.95 | 1.98 | 1.23 | 1.1 | 1.04 |
Low | 5.0 | 2.0 | 1.25 | 1.11 | 1.05 |
You can verify these numbers by playing around with my Sample Size Calculator, or by setting up the efficiency formula in Excel.
The table shows that major efficiency gains kick in when the event can be isolated to a 50/50 probability or better. In fact, if the checkout page itself has a baseline conversion rate of 80%, then we can easily obtain efficiency gains of 4 or more — that is, we can use less than a quarter the number of subjects as before!
Yet even if we can only isolate subjects to have a conversion rate of 20%, if the initial conversion rate was 10%, then we've reduced the required subject count by more than a tenth (\(E=1.13\)). If even that small change allows for ten percent more experiments, and hence ten percent more opportunities to learn, then I'd say it's worth being lazy about treatment assignments.
You're reading evanmiller.org, a random collection of math, tech, and musings. You might also enjoy these articles:
...and don't miss my collection of Awesome A/B Tools:Sample Size Calculator |
Chi-Squared Test |
Two-Sample T-Test |
Appendix: Completely Optional Mathematics
Below a derivation of the efficiency formula \(E = (1 - p_1) / (1 - p_2)\).
First we should formally define \(E\). It's the total number of visitors needed for experiment 1 (the "full" experiment with treatments assigned at the beginning) divided by the total number of visitors needed to complete experiment 2 (the "isolated" experiment with treatments assigned only at the checkout page).
Let \(n_1\) be the number of subjects in experiment 1, and \(n_2\) be the number of subjects in experiment 2. The number \(E\) is not the same as the ratio of subject counts \(n_1 / n_2\), since in experiment 2, only a fraction of visitors actually become subjects. Instead we can define it as the ratio of expected successes, since the expected number of successes will be proportional to the number of visitors in both experiments:
\[ E = \frac{n_1 p_1}{n_2 p_2} \]Now we're ready to derive a formula equal to this expression.
The sample size formula for a 2×2 chi-squared test (see 1) is:
\[ n = \theta_{\alpha,\beta}\left[ \frac{\delta^2}{p(1-p)} \right]^{-1} \]Here \(\delta\) is the size of the minimum detectable effect in absolute terms, \(p\) is the baseline conversion rate, \(\theta_{\alpha,\beta}\) is a constant corresponding to the desired significance and power levels, and \(n\) is the number subjects required for each branch.
We then have the equations:
\[ \delta_1 = \sqrt{\theta_{\alpha,\beta} p_1 (1-p_1)/n_1} \] \[ \delta_2 = \sqrt{\theta_{\alpha,\beta} p_2 (1-p_2)/n_2} \]We have a third equation by equating the effect sizes relative to the baseline conversion rates in both experiments:
\[ \frac{\delta_1}{p_1} = \frac{\delta_2}{p_2} \]Plugging the \(\delta_1\) and \(\delta_2\) definitions into this equation and doing a little algebra we find:
\[ \frac{n_1 p_1}{n_2 p_2} = \frac{1 - p_1}{1 - p_2} \]The left-hand side is our definition of \(E\), so finally we have: \[ E = \frac{1-p_1}{1-p_2} \]
Boom!
Reference
1. Sample Size Calculations in Clinical Research by Shein-Chung Chow, Hansheng Wang p. 145
Changes
- September 7, 2015 — Corrected the example efficiency calculation to match the introduction
If you made it this far, you probably like math. For a good time, check out Bayesian Average Ratings.