Formulas for Bootstrapping Sample Medians

September 9, 2022

A recent paper from researchers at Spotify describes a new method for sampling differences in quantiles using a bootstrap method. To speed up computation compared to existing methods, the authors approximate the distribution of bootstrapped quantiles using a binomial distribution, and conjecture that bootstrapped distribution of quantiles converges to a binomial in the large-sample limit.

This note derives an exact closed-form solution for the distribution of interest in the special case of the median, valid in both small and large samples. This closed-form solution is somewhat lighter in the tails than a binomial; whether it converges to a binomial, that is, whether the conjecture is true, is left as an open problem. The note concludes with an improved approximation that will yield less conservative confidence intervals than the binomial when measuring differences in the median across two samples.

Preliminaries

The bootstrap method in the paper is a Poisson bootstrap. Each point in the original sample is included in each bootstrap sample according to a Poisson distribution with parameter (mean) 1.

Suppose that the original sample consists of ordered points \((y_1, \ldots, y_n)\); we are interested in the probability that each point in the sample will be the quantile of interest in a bootstrapped sample. Denote this probability \({\rm Pr}(y_i=\hat{\tau}_q)\).

Unlike a traditional bootstrap, the total number of points in a Poisson-bootstrapped sample is a random variable. Poisson parameters are additive, and so the total number of points is Poisson-distributed with parameter \(n\).

As a small curiosity,

\[ \sum_{i=1}^n {\rm Pr}(y_i=\hat{\tau}_q) = 1 - e^{-n} < 1 \]

That is, it is possible that no points in the original sample are included in a given bootstrap sample, let alone included as the quantile of interest. But this probability vanishes with large \(n\).

The rest of the analysis will be restricted to the median quantile \(\hat{\tau}_{1/2}\), as its symmetric properties make it amenable to exact modeling.

Conditional model

As a first step, let \(m_i\) represent the number of times that \(y_i\) is included in a bootstrap sample. Suppose \(m_i=1\) for some point \(y_i\), \(1<i<n\), that is, \(y_i\) is included exactly once. What is the probability that this \(y_i\) will represent the median in the bootstrap sample?

Momentarily restricting the analysis to odd-numbered bootstrap sample sizes, \(y_i\) will be the bootstrapped median if the number of points drawn below \(y_i\) exactly equals the number of points drawn above \(y_i\).

The additive nature of the Poisson parameter lets us quantify this probability. The number of points drawn below \(y_i\) will be Poisson-distributed with mean \(i-1\), and the number of points drawn above \(y_i\) will be Poisson-distributed with mean \(n-i\). Denoting the odd-numbered sample size restriction with \({\rm Pr}_{\rm odd}\), the probability that these two numbers are equal is

\[ {\rm Pr}_{\rm odd}(y_i=\hat{\tau}_{1/2}|m_i=1) = \sum_{k=0}^\infty \frac{e^{-(i-1)}(i-1)^k}{k!} \frac{e^{-(n-i)}(n-i)^k}{k!} = e^{-(n-1)} \sum_{k=0}^\infty \frac{((i-1)(n-i))^k}{(k!)^2} \]

Noting that the modified Bessel equation of the first kind with \(\nu=0\) is

\[ I_0(z) = \sum_{k=0}^\infty \frac{(\frac{1}{4}z^2)^k}{(k!)^2} \]

we can set

\[ \begin{array}{c} \frac{1}{4}z^2 = (i-1)(n-i) \\ z = 2\sqrt{(i-1)(n-i)} \end{array} \]

and write

\[ {\rm Pr}_{\rm odd}(y_i=\hat{\tau}_{1/2}|m_i=1) = e^{-(n-1)}I_0(2\sqrt{(i-1)(n-i)}) \]

If you implement the above equation as written, you may notice that

\[ \sum_{i=1}^n{\rm Pr}_{\rm odd}(y_i=\hat{\tau}_{1/2}|m_i=1) < 0.5 \]

which will seem to indicate a missing factor of two somewhere. This factor of two is attributable to the fact that odd-numbered sample sizes will be chosen approximately half the time.

To rectify the deficiency, consider even-numbered sample sizes, where either of the middle two points in the sorted sample could be considered the median. Retaining the \(m_i=1\) condition, and following the paper’s convention of designating either middle point to be the median with 50% probability, then \(y_i\) will be the bootstrapped median (with probability one-half) if the difference between the number of points drawn less than \(y_i\) and the number of points drawn greater than \(y_i\) is exactly equal to one. Mathematically,

\[ {\rm Pr}_{\rm even}(y_i=\hat{\tau}_{1/2}|m_i=1)= \frac{1}{2}\left(\sum_{k=0}^\infty \frac{e^{-(i-1)}(i-1)^k}{k!} \frac{e^{-(n-i)}(n-i)^{k+1}}{(k+1)!} + \sum_{k=0}^\infty \frac{e^{-(i-1)}(i-1)^{k+1}}{(k+1)!} \frac{e^{-(n-i)}(n-i)^{k}}{(k)!} \right) \]

The first-order modified Bessel function of the first kind can be written

\[ I_1(z) = \frac{1}{2}z \sum_{k=0}^\infty \frac{(\frac{1}{4}z^2)^k}{k!(k+1)!} \]

and so setting again

\[ z = 2\sqrt{(i-1)(n-i)} \]

we have

\[ \begin{array}{c} {\rm Pr}_{\rm even}(y_i=\hat{\tau}_{1/2}|m_i=1)= \frac{1}{2}e^{-(n-1)}\left[\sqrt{\frac{i-1}{n-i}}I_1(2\sqrt{(i-1)(n-i)})+\sqrt{\frac{n-i}{i-1}}I_1(2\sqrt{(i-1)(n-i)})\right] \\ =\frac{1}{2}e^{-(n-1)}\left[\sqrt{\frac{i-1}{n-i}}+\sqrt{\frac{n-i}{i-1}}\right]I_1(2\sqrt{(i-1)(n-i)}) \end{array} \]

For notational simplicity, let

\[ I_{\nu}=I_\nu(2\sqrt{(i-1)(n-i)}) \\ I_{\nu}^*=\left[\left(\frac{i-1}{n-i}\right)^{\nu/2}+\left(\frac{n-i}{i-1}\right)^{\nu/2}\right]I_\nu \]

Then

\[ \begin{array}{c} {\rm Pr}_{\rm odd}(y_i=\hat{\tau}_{1/2}|m_i=1)= e^{-(n-1)}I_0 \\ {\rm Pr}_{\rm even}(y_i=\hat{\tau}_{1/2}|m_i=1)= \frac{1}{2}e^{-(n-1)}I_1^* \\ {\rm Pr}(y_i=\hat{\tau}_{1/2}|m_i=1)= e^{-(n-1)}\left[I_0+\frac{1}{2}I_1^*\right] \end{array} \]

We proceed in like manner for the case when \(m_i=2\), that is, \(y_i\) is drawn exactly twice, finding

\[ \begin{array}{c} {\rm Pr}_{\rm odd}(y_i=\hat{\tau}_{1/2}|m_i=2)= e^{-(n-1)}I_1^* \\ {\rm Pr}_{\rm even}(y_i=\hat{\tau}_{1/2}|m_i=2)= e^{-(n-1)}\left[I_0+\frac{1}{2}I_2^*\right] \\ {\rm Pr}(y_i=\hat{\tau}_{1/2}|m_i=2)= e^{-(n-1)}\left[I_0+I_1^*+\frac{1}{2}I_2^*\right] \end{array} \]

Generically,

\[ {\rm Pr}(y_i=\hat{\tau}_{1/2}|m_i>0)= e^{-(n-1)}\left[I_0+\sum_{s=1}^{m_i-1}I_s^*+\frac{1}{2}I_{m_i}^*\right] \]

Unconditional model

Recall that in the Poisson bootstrap, \(m_i\) will be Poisson-distributed with mean 1. We can then write an unconditional expression for the probability that \(y_i\) will be the bootstrapped median as

\[ {\rm Pr}(y_i=\hat{\tau}_{1/2})= \sum_{k=1}^\infty \frac{e^{-1}}{k!} {\rm Pr}(y_i=\hat{\tau}_{1/2}|m_i=k)= \sum_{k=1}^\infty \frac{e^{-1}}{k!} e^{-(n-1)}\left[I_0+\sum_{s=1}^{k-1}I_s^*+\frac{1}{2}I_{k}^*\right] \]

A bit of algebra yields

\[ {\rm Pr}(y_i=\hat{\tau}_{1/2})=e^{-n}\left[c_0 I_0 + \sum_{k=1}^\infty c_k I_k^*\right]\\ \]

where

\[ \begin{array}{l} c_0&=&e-1\\ c_k &=& \frac{1}{2}\frac{1}{k!}+e-\sum_{s=0}^k\frac{1}{s!} \end{array} \]

An implementation reveals that this equation is well-approximated by the suggested binomial distribution, though it is somewhat lighter in the tails.¹ We may conclude that for difference-in-quantile inference, the use of the binomial approximation for index selection will yield slightly conservative confidence intervals.

Approximation

As an alternative to the binomial, I suggest the approximation

\[ {\rm Pr}(y_i=\hat{\tau}_{1/2})\approx 2e^{-(n-1)}I_0(2\sqrt{(n+1-i)(i)}) \]

This expression can be derived as twice the probability that \((y_1,\ldots,y_i)\) and \((y_i,\ldots,y_n)\) will contribute equal numbers of points to the bootstrapped sample (note that \(y_i\) appears in both sets), conditioned on an odd-numbered bootstrapped sample size. The approximation adheres more closely to the exact expression than the binomial, though remains heavier in the tails than the exact expression. Thus when used for bootstrapping confidence intervals, it will err very slightly on the conservative side.

The foregoing treatment has omitted exact probability expressions for the edge cases \({\rm Pr}(y_1=\hat{\tau}_{1/2})\) and \({\rm Pr}(y_n=\hat{\tau}_{1/2})\); these are left as an exercise for the highly motivated reader. For all practical applications, they may be approximated as zero.

Conclusion

The Poisson bootstrap has gained a measure of popularity in recent years due to its ease of implementation as well as its adaptability to parallel-data settings. The derivations above demonstrate another benefit: amenability to exact analysis, which may shed some future light on the original authors’ conjecture that the distribution of sample quantile indexes converges to a binomial in the large-sample limit. In the meantime, implementers may use the above equations to tighten their bootstrapped confidence intervals for any statistics involving sample medians.

Notes

Implementers are advised to make use of scaled Bessel functions, such as SpecialFunctions.besselix in Julia or scipy.special.jve in Python.

You’re reading evanmiller.org, a random collection of math, tech, and musings. If you liked this you might also enjoy:

Formulas for Bayesian A/B Testing
Simple Sequential A/B Testing
The Amex EveryDay Bonus: A Stochastic Valuation Model
Evaluating Splatoon’s Ranking System

Get new articles as they’re published, via LinkedIn, Twitter, or RSS.

Want to look for statistical patterns in your MySQL, PostgreSQL, or SQLite database? My desktop statistics software Wizard can help you analyze more data in less time and communicate discoveries visually without spending days struggling with pointless command syntax. Check it out!

Wizard
Statistics the Mac way

Back to Evan Miller’s home page – Subscribe to RSS – LinkedIn – Twitter