Statistical Shortcomings in Standard Math Libraries (And How To Fix Them)
By Evan Miller
June 8, 2014
Summary: Basic statistical analysis requires special mathematical functions which almost no standard math libraries implement. These functions are not particular to statistics, but they form building blocks for a number of important statistical applications. If standard math libraries were to provide just a few more special functions, including incomplete beta, incomplete gamma, inverse normal, and better Bessel functions, developers would be able to integrate more statistical functions into their applications without having to license specialized libraries, or copy-paste code of dubious origin into their projects.
Jump straight to the list of functions, or read on to learn why I think they’re so important, and yet so neglected.
Ever wonder why statisticians flock to specialized languages? It’s not hard to figure out. They’re not just flocking; in part, they’re fleeing. The C standard math library — and by extension, nearly every standard math library out there — lacks even the most basic functionality for doing statistical analysis. I’m not talking about the latest and greatest algorithms in machine learning; I mean the absolute most fundamental, Chapter 1 things you can do in statistics, such perform a t test or chi-squared. (Both of these tests, incidentally, were considered state-of-the-art around the same time as electric lighting.)
Apparently no one believes that C’s statistical handicap is a problem — or at least no one seems to be talking about it — I think for three reasons:
The designers of the C standard have historically been more knowledgeable about traditional engineering problems (that require, say,
log1por hyperbolic cosines) than other kinds of math problems;
Most programmers haven’t yet realized that they need statistical analysis; and
Statisticians haven’t been asked for their opinion on the matter.
It may be tempting to dismiss statistics as a kind of “advanced application” that does not belong in a basic math library. For the most part, I agree. I believe that standard math libraries should contain only a bare-bones set of the most needed general-purpose mathematical functions. But my argument is not that the C standard library is missing statistical functions, such as drawing from probability distributions and what not; rather, it is missing basic mathematical functions, which happen to be necessary for doing statistics.
A central element of my thesis is that programmers need these functions even though they’re not yet asking for them. In my view, statistics is not a specialized application, but rather a foundational body of knowledge, akin to calculus or trigonometry. Its fruits should be permeating most software — from smartphones to space stations, telling us which patterns are real and which can be ascribed to chance. But that is not going to happen without first making a few changes to
What Makes Statistics So Special
If you ever sat through a statistics course in high school or college, you may have vague memories of textbook problems involving colored balls emerging from urns, hypothetical measurements of people’s heights and weights, and looking up numbers in vast small-print tables. The connection between these tasks and what a programmer does from day to day is unclear at best.
But I encourage you not to let past experiences put you off to statistics. There is a world of statistical insight waiting to be applied to software development — and to software itself.
If, in your life as a programmer, you have ever computed an average or a percentage, presented a daily count, or reported the results of a test, you needed statistics, whether or not you realized it at the time.
As a simple example, let’s talk about one of the first tasks in any programming book: compute the average value of an array of numbers. Let’s say your array looks something like:
[ 24.0, 29.0, 31.1, 26.4, 23.4, 24.1, 28.2 ]
You dutifully write an loop and report the result: 26.6 (or maybe 26.599999999999998).
Now step back for a moment: What does 26.6 — the number — really mean? Does that “.6” signify something real? Or should we just round up to 27? Or to 30? Would it be better to report the average as “26.600” or as “somewhere between 25 and 28”?
The mechanical part of computing an arithmetic average — the part emphasized by the programming book — is easy. Statistics forces you to at least consider the more difficult question, What can we infer about the world from the number we just computed? If we received a new set of numbers, would we still compute 26.6, or something near it? What’s the real average of the thing that produced the numbers we were given? In other words, where did that array come from?
Statistics is fundamentally about reality — understanding the numbers behind the numbers, the parameters that drive the observations. It’s about getting to know the man behind the curtain by studying the bumps and waves on the velvet.
Even if you don’t write programs that process observed data, your software will benefit from statistics. Any software engineer worth his beans should be analyzing profiling and performance data for ways to improve the program. Does a code change have a statistically significant effect on performance? What’s the effect on the 99th percentile? It’s not enough to calculate a standard deviation and call it a day. You need statistical inference to understand what’s going on below the surface.
I would love to show you sample code to (say) compute a confidence interval around the mean, and start answering those questions yourself. But of course there’s a small problem with that proposition. For some reason, your math library seems to be missing a few things…
Why <math.h> Should Support Statistics
<math.h> header file occupies a curious position of influence in the computing world. In some ways this C mathematical library is a relic from an era when computers were for largely for people who needed to compute things; it’s a small library, far from comprehensive, but contains just enough functionality for someone with mathematical know-how to do something interesting, such as take a logarithm or compute a cosine.
Because most systems these days are written in C, and because most programming language designers aren’t particularly interested in numerical methods, the functions in
<math.h> have formed a kind of blueprint for almost every standard math library in the modern computing landscape. For the most part, language developers simply write wrapper functions around whatever happens to be present in
<math.h>; purloin the descriptions of the wrapped C functions; and then move on to other things. Look at the documentation for the math libraries of Python, .NET, Java, and Erlang to see what I mean.
And so for purposes of making mathematical functions ubiquitous,
<math.h> is the head of the snake. Or the source of the river, if you prefer an aquatic metaphor. In any event, functions added to
<math.h> are bound to propagate to nearly every other language’s standard library, not to mention all the databases, spreadsheets, and software systems written in C or its derivatives. I believe that basic statistical analysis should be universally available, on every platform, in every language, without special libraries, and the only way to make that happen is by modifying
To reiterate, I am not advocating the addition of statistics functions to
<math.h>. The functions that I am calling for are math functions, similar to the sine and cosine and logarithms that already appear in
<math.h>. They will fit right in and play well with others, I promise.
The truth is that high-quality numerical functions are hard to come by. Without rock-solid library implementations, programmers who want to do any kind of statistics are forced to go out and find usable numerical code on the web — or worse, attempt to write their own. Given the amount of specialized knowledge required for accurate numerical programming, both tasks are fraught with peril. It would be far better to leave the task of verifying numerical code to the experts, that is, to the existing maintainers of the various standard C mathematical libraries.
What code, you may ask?
The Elusive Eight
The following mathematical functions are necessary for implementing any rudimentary statistics application; and yet they are general enough to have many applications beyond statistics. I hereby propose adding them to the standard C math library and to the libraries which inherit from it. For purposes of future discussion, I will refer to these functions as the Elusive Eight.
1-2. The regularized incomplete beta function, Ix(a, b), and its inverse
This is the big one. I would give up all functions below, implement my own floor and ceiling functions, and truncate my left leg at the knee in exchange for an incomplete beta function and its inverse.
Ix(a, b) shows up seemingly everywhere in statistics. You need it to know if there’s a difference in the mean across two groups (or several). You need it to know if the coefficients in your linear regression are statistically significant. You need it to calculate sample sizes for experiments, and to construct confidence intervals around the mean and mean differences. It shows up in a host of more specific applications, such as ranking things that were voted on.
Why is Ix so important? Because it plays a crucial role in the second-most important distribution in the history of statistics, Student’s t distribution. Without Ix, there is no t. Without t, there is essentially no hypothesis testing of continuously distributed data. (The first-most important distribution, of course, is the Gaussian — more on that below.)
The incomplete beta shows up frequently in Bayesian statistics, too. Beliefs about Bernoulli parameters are often modeled with a beta distribution, and if you want to integrate that distribution up to some number, well, you’re going to need Ix. For a statistician, to do without Ix and its inverse is like an electrical engineer foregoing his calculator. Its absence brings work to a screeching halt.
3-5. The regularized incomplete gamma function, P(s, x), its complement, and its inverse
The regularized incomplete gamma function is crucial for performing any kind of chi-squared test, including the single most common test in statistics — the test that spawned the modern study of statistics, in fact — Pearson’s goodness-of-fit test, first formulated in the year 1900. If you want to compare data to a discrete distribution, test a hypothesis about category counts, test the equality of medians across more than two groups, test the independence of row and column counts in a matrix larger than 2×2, or determine the combined statistical significance of two or more regression coefficients in a non-linear regression model, you’ll need a chi-squared test. And hence, you’ll need P(s, x).
P(s, x) shows up in places besides the chi-squared distribution. In the analysis of count data, for example, it appears in the formula for constructing a confidence interval around a Poisson parameter, and is useful for evaluating a Bayesian belief concerning the same. It’s another basic building block in statistics.
6-7. The cumulative normal distribution function and its inverse
I lied when I said I didn’t care for any probability distributions in my
<math.h>. The normal distribution is such a foundational probability distribution in science, engineering, and statistics (via the Central Limit Theorem) that I believe it deserves an exception. In fact, it already appears in
<math.h> in disguise; with a little manipulation, the error function (
erf) can be used to compute a cumulative distribution of a Gaussian.
But it would be nice to have first class support for the cumulative normal distribution; constructing it from the error function requires multiplying by a constant1 that is not present in the
<math.h> defines, and hence is susceptible to programmer error. Almost as important as a proper normal distribution function is an inverse normal function. The inverse function is necessary for constructing confidence intervals, creating Q-Q plots, and calculating critical values for statistics that are described by a normal distribution. It’s a genuine workhorse function, and engineers in many fields would appreciate it being made available more widely.
8. Bessel functions of non-integer order
I am sure I am not the first person to point this out, but the Bessel functions in
<math.h> contain a basic design error.
j1 are fine, but
jn makes a crucial mistake — it requires that the order of a Bessel function be an integer.2
That requirement is foolish, in my opinion. The order of a Bessel function can be any rational number, not just an integer.3 In statistics, non-integer orders of Bessel functions show up when testing whether an observed distribution is equal across multiple groups. The function’s order is related to half the number of groups, and so handling non-integer orders is essential for analyzing an odd number of groups. So my final request to the keepers of
<math.h> is to provide a Bessel function of non-integer order. Non-integer orders show up in differential equations, too, so it’s not only statisticians who would benefit.
Where To Go, and How To Get There
I realize that implementing all of these specialized numerical functions sounds like a lot of work, but in fact the work is essentially finished: the free Cephes library contains a host of mathematical functions with names and argument orders nearly identical to their
<math.h> counterparts, as well as all of the functions I have mentioned above. So my proposal boils down to “graduating” the following eight double-precision functions, along with their single-precision and extended precision counterparts, from Cephes into the standard C library:
/* Regularized incomplete beta function */ double incbet(double a, double b, double x); /* Inverse of incomplete beta integral */ double incbi(double a, double b, double y); /* Regularized incomplete gamma integral */ double igam(double a, double x); /* Complemented incomplete gamma integral */ double igamc(double a, double x); /* Inverse of complemented incomplete gamma integral double igami(double a, double p); /* Normal distribution function */ double ndtr(double x); /* Inverse of Normal distribution function */ double ndtri(double y); /* Bessel function of non-integer order */ double jv(double v, double x);
The Cephes folks seem to be stingy when it comes to doling out letters in function names, so the C committee may want to add a few characters to the above for clarity. But other than that minor taxonomic concern, these functions are essentially ready to go.
So if you know someone who is involved in maintaining a standard math library of any kind, I encourage you to send them the “Elusive Eight” functions above and ask whether they’re doing anything to implement them. Without the functions, regular programmers can’t begin to use statistics; and without statistics, they can’t distinguish chance from patterns, either for themselves on behalf of their users.
I sincerely believe we will one day look back on the current era with a sense of wonderment as to why it took so long to implement rather basic functionality in standard math libraries. I remain hopeful that adding these key mathematical functions will help us stamp out the current dark age of unadorned averages, and let us enter into a brave and sunlit century of statistically enlightened software.4
1. 1 divided by the square root of 2 times pi.
2. I realize that Bessel functions are technically not part of the C standard. They do, however, appear in the implementations of all the major operating systems, with the same integer limitation.
3. Actually, the order can be an imaginary number, too, but let’s not get carried away.
4. Or at least a period of 99.99999999998 years.