Big Data and the Price-Precision Curve
By Evan Miller
November 6, 2012
Suppose you figure out how to estimate the value of pi, the famous constant, by running a regression on a megabyte of data. The regression takes a full second to run and returns an estimate of 3, with a standard error of 0.1. You tell your boss with confidence that the answer is 3.
Your boss says that he’s going to need several more digits of precision, and asks you to please analyze a larger data set to arrive at a more precise answer. So you do.
This is a fool’s errand.
To get an answer of 3.1, you’ll need a hundred megabytes of data and two minutes to crunch the numbers. So you take a bathroom break while the computer does its work. You come back to the computer and think: well, that was easy. How about one more digit?
To get an answer of 3.14, you’ll need ten gigabytes of data and two and a half hours for the regression to complete. So you slip out the back and go to the movies.
To get an answer of 3.141, you’ll need a terabyte of data and 12 days to run the numbers. So you put a sticky note on your computer and go on vacation.
To get an answer of 3.1415, you’ll need 100 terabytes of data and three years of computation time. So you sign up for AWS and go on sabbatical.
To get an answer of 3.14159, you’ll need ten petabytes of data and three hundred years. So you make a wiki page and leave a few instructions for your successors.
To get an answer of 3.141592, you’ll need an exabyte of data and thirty thousand years. Now would be a good time to quit smoking.
Precision in regression analysis is more expensive than most people think. Each significant digit in a regression estimate costs 100 times as much as the previous digit in terms of computation time and required data size. This is a consequence of the regression formula and a hard fact of life.
It’s easy to miss this fact because the primary practitioners of regression analysis — social scientists — almost never talk about significant digits. Instead they talk about standard errors, probably to avoid embarrassment. Most published regression estimates, even those that are statistically significant, have zero significant digits. (“Statistically significant” simply means “probably not equal to zero.” A significant digit, on the other hand, means “probably is what it says.”)
There is a lot of talk about “Big Data” being used for business insight, and it is being sold to people who don’t understand the price-precision curve. If the sellers were more honest, the product line would look something like:
- One SigFig™ / 1 server / $1,000
- Two SigFigs™ / 100 servers / $100,000
- Three SigFigs™ / 10,000 servers / $10,000,000
- Four SigFigs™ / 1,000,000 servers / $1,000,000,000
- Five SigFigs™ / 100,000,000 servers / $100,000,000,000 worst value!
Big Data certainly has its uses, but often the best way to increase the precision of one’s understanding is to improve the quality of existing data, approach it in new way, run experiments, or perform quick analyses on random samples (for more discussion, see my previous post, In Praise of Small Data). If someone tells you they “just need more data”, and they’re already pushing the limits of what a single computer can handle, their data is probably not very good. Adding more of it might help, but only up to a point. When it comes to producing insights, Big Data is no substitute for good data.
You’re reading evanmiller.org, a random collection of math, tech, and musings. If you liked this you might also enjoy:
Want to look for statistical patterns in your MySQL, PostgreSQL, or SQLite database? My desktop statistics software Wizard can help you analyze more data in less time and communicate discoveries visually without spending days struggling with pointless command syntax. Check it out!
Statistics the Mac way