Bayesian Average Ratings

November 6, 2012

Bayesian statistics, like quantum theory, sounds completely nuts on its face, but it turns out to have a wide panoply of practical applications. In this post I’m going to revisit the problem that I posed in How Not To Sort By Average Rating, but instead of approaching the problem with a frequentist frame of mind, I’ll show the reader how to think about it like a true disciple of the Reverend Thomas Bayes. In the process we’ll arrive at a solution that offers much more flexibility than my previous, frequentist suggestion.

The problem to be addressed is: what is the best way to sort items by average rating if many of the items have only a handful of ratings?

The solution I proposed previously — using the lower bound of a confidence interval around the mean — is what computer programmers call a hack. It works not because it is a universally optimal solution, but because it roughly corresponds to our intuitive sense of what we’d like to see at the top of a best-rated list: items with the smallest probability of being bad, given the data.

Bayesian statistics lets us formalize this intuition. There are two main drawbacks to the Bayesian approach. First, it can be computationally expensive. For one-dimensional applications like this one, the expense usually isn’t prohibitive. Second, it introduces a number of “fudge factors” which will affect the analysis. But this second drawback is also a blessing, because these fudge factors give us a lot of flexibility to account for things that we might intuitively think belong in the model.

In particular, a Bayesian model can explicitly account for:

Time. Perhaps old ratings should count for less than new ratings.
Prior beliefs. Perhaps an unrated item from a bestselling author should appear above an unrated item from an unknown author.
Attitudes toward risk. Exactly how risk-averse should the algorithm be about promoting items with a small number ratings?

All of these considerations fit rather nicely into a Bayesian framework. Of course, the framework’s flexibility will leave a lot of room for error and expose a wide flank for criticism. The reader is advised that the methods below are not “true” in any sense, but they are useful. And usefulness, I think, should be the deciding factor for anyone pondering whether to take the Bayesian plunge.

The Setup

For the sake of simplicity, I’m only going to consider a binary (up/down) rating system. If you have a more complex rating system, you might try reducing it to a binary system (e.g. 4 or 5 stars is “up”, and 1-3 stars is “down”), or if you are feeling brave, you might try to work out the math for a five-star system. I haven’t.

Bayesianism is built around beliefs. A “belief” describes the probability that we think some parameter takes each of a number of possible values. For our application, we’ll have a belief for each item; a belief is a mathematical construct, but a typical belief might be verbalized as “There is a 50% chance that 10% of people or fewer like item A, and a 50% chance that more than 10% of people like item A.” Note that this belief is quite distinct from “There is a 100% chance that 10% of people like item A”.

The basic strategy for sorting rated items can be summarized:

Step 1: Start with a made-up belief about each item’s average rating
Step 2: Update the belief as new data arrives
Step 3: Use the newest belief to construct a sorting criterion for each item

The hope is that even if we start with a wildly inaccurate belief, it will become more precise as more ratings are received. And when it’s time to decide how imprecise beliefs should be ranked in comparison to precise beliefs, we’ll have some leeway in striking a balance that seems appropriate.

Step 1: The Prior Belief

The initial belief about an item’s possible average rating — called the prior belief — is completely made-up. Because we can make up any prior belief that we like, we’ll choose a prior that is easy to update in light of new information.

The prior belief will describe a Bernoulli parameter, so the most convenient mathematical form happens to be a beta distribution. A beta distribution takes two parameters, which happen to correspond nicely to the number of up-votes and the number of down-votes (plus one) that an item has received.

We don’t have to start both parameters at one if we don’t want to. We can use additional information to “prime” the parameters as we see fit. For example, if we think having a photo attached should be worth three up-votes, we can start the up-vote parameter at three for items with a picture. Or if we think an item’s average rating can be accurately predicted from one of its attributes (say, the identity of its creator), we can prime the pump that way. There are many possibilities.

At this point, it doesn’t really matter what the beta distribution looks like. The important bit is that we have two parameters: an initial number of up-votes, and an inital number of down-votes.

Step 2: Update the belief

This is so easy as to be banal. When a rating is received, update the number of up-votes or down-votes. The posterior belief is just a beta distribution with the new numbers.

Step 3: Construct a sorting criterion

The sorting criterion should answer the question: Where does this item belong in a list of items whose average rating is known with 100% certainty?

There are many ways to turn a belief into a sorting criterion. Technically speaking, we are supposed to formulate a loss function which will then dictate the decision (sorting criterion). The choice of loss function is somewhat arbitrary, but the most convenient one for our purposes is the multi-linear loss function. It lets us state exactly how much worse it is to over-rank an item than to under-rank it.

Call this “loss multiple” \(L\). Five seems like a reasonable number to me, but you can play around with your own choice. A loss multiple of five says that it’s five times worse to rank an item too high on a list than to rank it too low. In this way, we’ll avoid placing an item very high on list unless we have a relatively strong belief that its average rating is high.

With the loss multiple \(L\) and belief parameters \(U\) (up-votes) and \(D\) (down-votes) in hand, the sorting criterion \(X\) can be solved from a simple formula:

\[ I_X(U+1, D+1) = \frac{1}{1+L} \]

The function \(I_X\) is known as the incomplete beta function; to invert it and arrive at \(X\), you’ll need to use a mathematical routine such as ASA 109 or betaincinv from scipy. Unfortunately, the inverse of the incomplete beta function is not available from the typical database console, so the computation will have to occur in a separate script or program. If you’re the mathematical type and wondering how I arrived at that formula, see the Mathematical Appendix.

The result is that precise beliefs will be ranked very close to the average value of the belief, and imprecise beliefs will be ranked somewhat lower than the average value of the belief — which is just what we want. Note that the “pessimism” in the model — that is, the reason unrated items appear low on the list rather than high on the list — is a purely consequence of the loss function. We did not stuff the prior belief with down-votes. The loss function lets us penalize imprecise beliefs without forcing us to say bad things about new items.

It may take some tuning of \(L\) to arrive at a value that feels right for a particular application, but the Bayesian framework described here provides a consistent way to think about the problem.

Extension: Rating Decay

As I have emphasized, the great virtue (and vice) of Bayesian statistics is its flexibility. Next I want to describe a way to let ratings “cool off”, for example, if we have reason to believe that older ratings are not as informative as newer ratings. What follows should look more than a little familiar to anyone who has read my previous post, Rank Hotness With Newton’s Law of Cooling.

The simplest approach is to use a technique called exponential decay. The decay process takes a single parameter called the half-life, which corresponds to the amount of time it takes for a particular rating’s influence to diminish by half.

To implement exponential decay, simply keep track of the number of up-votes and down-votes as before, and alongside them keep track of the time of the most recent up-vote and most recent down-vote. Then instead of the usual update process, you’ll use the formula:

\[ new = old \times 2^{-t/H} + 1 \]

Where H is the half-life and t is the amount of time that has elapsed since the most recent update. Note that because this decay is a continuous process, you will need to store the rating counts as floating-point numbers instead of round integers.

In addition, when calculating the current sorting criterion, you’ll need to take the rating decay into account. The revised sorting criterion formula to be solved will look like:

\[ I_X(U \times 2^{-t_U/H}, D \times 2^{-t_D/H}) = \frac{1}{1+L} \]

Where \(t_U\) is the time since the most recent up-vote and \(t_D\) is the time since the most recent down-vote.

The statistics remain exactly the same as before; we’re just altering the belief structure according to whim, and nothing says that the beta distribution parameters must be integers. Such is the flexibility of Bayesian methods. Use the power with care.

Conclusion

Bayesian average ratings are an excellent way to sort items with up-votes and down-votes, and lets us incorporate a desired level of caution directly into the model. It is eminently “hackable” in the sense of affording opportunities for adjusting the model to accommodate new features (prior beliefs, belief decay) without violating the model’s consistency. As long as we make a judicious choice of belief structure (namely a beta distribution), it is feasible to compute.

As with other hackable systems, it is possible to take the Bayesian framework and produce something utterly useless. But as long as you set aside time for a little experimentation, I think you’ll eventually arrive at a sorting method that helps people find the good stuff quickly.

Mathematical Appendix

Below is a derivation of the formula \(I_X(U+1,D+1) = 1/(1+L)\).

Let the loss function be multilinear with a scalar of \(k\) below and unity above:

\[ {\rm L}_k(x,X) = \left\{ \begin{array}{lr} k(X-x) & : x < X \\ x-X & : x ≥ X \end{array} \right. \]

The expected loss is:

\[ E[{\rm L}_k(x,X)|U,D] = k \int_0^X {(X-x) f(x;U+1,D+1)dx} + \int_X^1 {(x-X)f(x;U+1,D+1) dx} \]

Where \(f\) is the beta distribution.

Integrating by parts this formula becomes:

\[ E[{\rm L}_k(x,X)|U,D] = k \int_0^X I_y(U+1, D+1) dy + \int_X^1 (1-I_y(U+1, D+1)) dy \]

Minimizing this expression with respect to \(X\) we get:

\[ I_X(U+1, D+1) = \frac{1}{1+k} \]

Changes

May 19, 2013: Added mathematical appendix
May 16, 2013: Corrected formula and text to start the counters at 1, not 0

You’re reading evanmiller.org, a random collection of math, tech, and musings. If you liked this you might also enjoy:

Ranking Items With Star Ratings: An Approximate Bayesian Approach
How To Read an Unlabeled Sales Chart
How Not To Run an A/B Test

Get new articles as they’re published, via LinkedIn, Twitter, or RSS.

Want to look for statistical patterns in your MySQL, PostgreSQL, or SQLite database? My desktop statistics software Wizard can help you analyze more data in less time and communicate discoveries visually without spending days struggling with pointless command syntax. Check it out!

Wizard
Statistics the Mac way

Back to Evan Miller’s home page – Subscribe to RSS – LinkedIn – Twitter