Ranking Items With Star Ratings

September 16, 2014 (Changes)

Ranking items rated on a five (or ten, or \(K\)) star scale is slightly more involved than the up-or-down case I wrote about previously. Here I’ll present a Bayesian approximation to the problem which can be used to answer two questions:

How should items be “sorted by average rating” when they are rated on a star scale?
When should an average star rating be displayed?

The main problem addressed here is what to do with items that have a small number of ratings, since an item with a single perfect rating probably should not sit atop a list of top-rated items. Unlike my previous Bayesian work, the method here is not exact, but it should work pretty well in most cases, and has the virtue of being (relatively) easy to compute.

Jump to:

Setup and Results
Derivation
Example Distributions

Uniformly distribution ratings
Consensus ratings
Polarized ratings
Sample size table

Discussion

Setup and Results

Assume you have \(K\) possible ratings, indexed by \(k\), each worth \(s_k\) points. For “star” rating systems, \(s_k = k\). (That is, 1 point, 2 points, ….) Assume a given item has received \(N\) total ratings, with \(n_k\) ratings for \(k\). Then the items can be effectively sorted with the criterion:

\[ S(n_1, \ldots, n_k) = \sum_{k=1}^K{s_k \frac{n_k + 1}{N + K}} - z_{\alpha/2} \sqrt{\left(\left(\sum_{k=1}^K s_k^2 \frac{n_k + 1}{N + K}\right) - \left(\sum_{k=1}^K{s_k \frac{n_k + 1}{N + K}}\right)^2\right)/(N + K + 1)} \]

Where \(z_{\alpha/2}\) is the \(1-\alpha/2\) quantile of a normal distribution. The above expression is the lower bound of a normal approximation to a Bayesian credible interval for the average rating. Setting \(\alpha=0.10\) (\(z = 1.65)\), a sort criterion of \(X\) means that 95% of the time, the item will have an average rating greater than \(X\), at least according to the belief structure.

If you display average ratings to the nearest half-star, you probably don’t want to display an average rating unless the credible interval is a half-star wide or less, that is:

\[ 2 \times z_{\alpha/2} \sqrt{\left(\left(\sum_{k=1}^K s_k^2 \frac{n_k + 1}{N + K}\right) - \left(\sum_{k=1}^K{s_k \frac{n_k + 1}{N + K}}\right)^2\right)/(N + K + 1)} \lt \frac{1}{2} \]

You can adjust the number on the right to reflect the resolution of your system for displaying stars, e.g. use 1 if you show ratings to the nearest star instead of half-star.

Derivation

Assume a Dirichlet prior distribution on the probability vector \(p = (p_1, \ldots, p_K)\), where \(p_k\) is the fraction of the population that would give a \(k\) rating to a particular item. The Dirichlet prior has parameters \((\alpha_1 = 1, \ldots, \alpha_K = 1)\). After observing \((n_1, \ldots, n_k)\) events (ratings), the posterior is given by a Dirichlet distribution with parameters \((\alpha_1 = n_1+1, \ldots, \alpha_K = n_k+1)\). We’ll use that fact later but for now do the derivation in terms of \(\alpha\). Let \(\alpha_0 = \sum_k \alpha_k\).

We wish to discover the mean and variance of the average rating function \(f(p) = s_1 p_1 + \ldots + s_k p_k\). We will then use these values to form a normal approximation to the credible interval.

The mean of \(f(p)\) is given by:

\[ E[f(p)] = E[s_1 p_1 + \ldots + s_K p_K] = E[s_1 p_1] + \ldots + E[s_K p_K] = \sum_{k=1}^K s_k E[p_k] \]

By property of the Dirichlet distribution, \(E[p_k] = \alpha_k / \alpha_0\), so

\[ E[f(p)] = \sum_{k=1}^K s_k \frac{\alpha_k}{\alpha_0} \]

Since \(f(p)\) is linear in \(p\) we can express it as a linear operator, \(f(p) = F'p\), where \(F = (s_1, \ldots, s_k)\). Then the variance of \(f(p)\) is given by:

\[ Var[f(p)] = Var[F'p] = F' Var[p] F \]

The variance-covariance matrix of \(p\) is given by:

\[ \begin{array}{rl} Var[p_i] &=& \frac{\alpha_i(\alpha_0-\alpha_i)}{\alpha_0^2(\alpha_0+1)} \\ Cov[p_i, p_j] &=& \frac{-\alpha_i \alpha_j}{\alpha_0^2(\alpha_0+1)} \end{array} \]

Doing the vector-matrix multiplication, we find:

\[ \begin{equation} \label{variance} Var[f(p)] = \frac{1}{\alpha_0+1} \left(\left(\sum_{k=1}^K s_k^2 \frac{\alpha_k}{\alpha_0}\right) - \left(\sum_{k=1}^K s_k\frac{\alpha_k}{\alpha_0}\right)^2\right) \end{equation} \]

Using a normal approximation, we can construct a \((1-\alpha)\) credible interval for \(f(p)\):

\[ f(p) \in E[f(p)] \pm z_{\alpha/2} \sqrt{Var[f(p)]} \]

Substituting \(\alpha_k = n_k + 1\) and \(\alpha_0 = N + K\), we then get the original expression for the lower bound:

\[ \sum_{k=1}^K{s_k \frac{n_k + 1}{N + K}} - z_{\alpha/2} \sqrt{\left(\left(\sum_{k=1}^K s_k^2 \frac{n_k + 1}{N + K}\right) - \left(\sum_{k=1}^K{s_k \frac{n_k + 1}{N + K}}\right)^2\right)/(N + K + 1)} \]

As well as an expression for the width of the interval:

\[ w = 2 \times z_{\alpha/2} \sqrt{\left(\left(\sum_{k=1}^K s_k^2 \frac{n_k + 1}{N + K}\right) - \left(\sum_{k=1}^K{s_k \frac{n_k + 1}{N + K}}\right)^2\right)/(N + K + 1)} \]

Example Distributions

To see how the formulas above work out in practice, let’s consider a five-star rating system and two kinds of items: one that receives only the highest rating from users; one that receives an equal number of ratings at each level; and one that receives extremely polarized ratings (half with the highest rating, half with the lowest rating). For each item, we’ll work out an expression for the variance of the mean, and calculate the number of ratings needed to achieve specific credible intervals. In all three examples, let \(s_k = k\) for \(k \in \{1, \ldots, 5\}\).

Uniformly distributed ratings

First consider the item that receives an equal number of ratings at each star level. That is, let \(n = n_1 = \ldots = n_K\). Also let \(\alpha = \alpha_1 = \ldots = \alpha_K = n + 1\).

To specialize the variance formula for an equal number of ratings at each level, we substitute \(\alpha_k = \alpha\) and \(\alpha_0 = K\alpha\) into \(\eqref{variance}\):

\[ Var[f(p)] = \frac{1}{K\alpha+1} \left(\left(\sum_{k=1}^K s_k^2 \frac{\alpha}{K\alpha}\right) - \left(\sum_{k=1}^K s_k\frac{\alpha}{K\alpha}\right)^2\right) \]

Then simplify:

\[ Var[f(p)] = \frac{1}{K(K\alpha+1)} \left(\left(\sum_{k=1}^K s_k^2 \right) - \frac{1}{K} \left(\sum_{k=1}^K s_k\right)^2\right) \]

The above formula applies to any rating system (that is, arbitrary values of \(s_k\)). For a star-rating system, substitue \(s_k = k\):

\[ Var[f(p)] = \frac{1}{K(K\alpha+1)} \left(\left(\sum_{k=1}^K k^2 \right) - \frac{1}{K} \left(\sum_{k=1}^K k\right)^2\right) \]

Then get rid of the second summation with an identity and simplify:

\[ Var[f(p)] = \frac{1}{K(K\alpha+1)} \left(\left(\sum_{k=1}^K k^2 \right) - \frac{1}{K} \left(\frac{K(K+1)}{2}\right)^2\right) \] \[ = \frac{1}{K(K\alpha+1)} \left(\left(\sum_{k=1}^K k^2 \right) - \frac{1}{4} K (K+1)^2\right) \]

In terms of the total number of ratings \(N\):

\[ Var[f(p)] = \frac{1}{K(N+K+1)} \left(\left(\sum_{k=1}^K k^2 \right) - \frac{1}{4} K (K+1)^2\right) \]

Incidentally, the formula is easily inverted to determine the number of ratings required to produce a given amount of variance:

\[ N = \frac{1}{Var[f(p)]K} \left(\left(\sum_{k=1}^K k^2 \right) - \frac{1}{4} K (K+1)^2\right) - K - 1 \]

Now we can specialize the formula to a five-star system with \(K = 5\) and \(s_k = k\):

\[ Var[f(p)] = \frac{1}{5(N+5+1)} \left(\left(\sum_{k=1}^5 k^2 \right) - \frac{1}{4} 5 (5+1)^2\right) = \frac{1}{5(N+6)} \left(55 - 45\right) \] \[ \begin{equation} \label{uniform_var} Var[f(p)] = \frac{2}{N+6} \end{equation} \]

Using the normal approximation above, the width of a credible interval for the mean becomes simply:

\[ w = 2 \times z_{\alpha/2} \sqrt{2/(N+6)} \]

Rewriting in terms of \(N\):

\[ \begin{equation} \label{uniform_n} N = \frac{8 z_{\alpha/2}^2}{w^2} - 6 \end{equation} \]

For example, a 90% credible interval (\(z = 1.65\)) with a width of a half-star (\(w = 0.5\)) requires:

\[ N = \frac{8 z_{\alpha/2}^2}{w^2} - 6 = \frac{8(1.65)^2}{0.5^2} - 6 \approx 81 \]

or about 16 ratings at each of the five star levels.

A 95% credible interval (\(z = 1.96\)) spanning a full star (\(w = 1\)) requires:

\[ N = \frac{8 z_{\alpha/2}^2}{w^2} - 6 = \frac{8(1.96)^2}{1^2} - 6 \approx 25 \]

or about 5 ratings at each of the five star levels.

Consensus ratings

Next we consider an item that is receiving only the highest star rating from users. That is, let \(n_1 = \ldots = n_{K-1} = 0\) and \(n_K = N\), and correspondingly, let \(\alpha_1 = \ldots = \alpha_{K-1} = 1\) and \(\alpha_K = N+1\).

Plugging these values into the variance formula \(\eqref{variance}\):

\[ Var[f(p)] = \frac{1}{\alpha_0+1} \left(\left(\sum_{k=1}^{K-1} \frac{s_k^2}{\alpha_0} + s_K^2 \frac{N+1}{\alpha_0}\right) - \left(\sum_{k=1}^{K-1} \frac{s_k}{\alpha_0} + s_K\frac{N+1}{\alpha_0}\right)^2\right) \]

Factoring out the shared \(\alpha_0\):

\[ Var[f(p)] = \frac{1}{\alpha_0(\alpha_0+1)} \left(\left(\sum_{k=1}^{K-1} s_k^2 + s_K^2 (N+1)\right) - \frac{1}{\alpha_0}\left(\sum_{k=1}^{K-1} s_k + s_K(N+1)\right)^2\right) \]

Plugging in the star formula \(s_k = k\) and using the sequence summation identity:

\[ Var[f(p)] = \frac{1}{\alpha_0(\alpha_0+1)} \left(\left(\sum_{k=1}^{K-1} k^2 + K^2 (N+1)\right) - \frac{1}{\alpha_0}\left(\frac{K(K-1)}{2} + K(N+1)\right)^2\right) \]

Multiplying out the square in the second term:

\[ Var[f(p)] = \frac{1}{\alpha_0(\alpha_0+1)} \left(\sum_{k=1}^{K-1} k^2 + K^2 (N+1) - \frac{K^2}{\alpha_0}\left(\frac{(K-1)^2}{4} + (K-1)(N+1) + (N+1)^2\right)\right) \]

Using the fact that \(\alpha_0 = N+K\), the equation simplifies to:

\[ Var[f(p)] = \frac{1}{\alpha_0(\alpha_0+1)} \left(\sum_{k=1}^{K-1} k^2 - \frac{K^2(K-1)^2}{4\alpha_0}\right) \]

The above equation contains both quadratic and cubic terms for \(\alpha_0\), but we can use it to arrive at a convenient upper bound on the variance:

\[ Var[f(p)] \lt \frac{\sum_{k=1}^{K-1} k^2}{\alpha_0(\alpha_0+1)} = \frac{\sum_{k=1}^{K-1} k^2}{(N+K)(N+K+1)} \]

For the five-star case, the formula specializes to:

\[ Var[f(p)] \lt \frac{30}{(N+5)(N+6)} \]

The width of the credible interval is then approximately bounded by:

\[ w \lt 2 \times z_{\alpha/2} \sqrt{30/(N+5)(N+6)} \]

Which can be used to produce a conservative approximation of \(N\):

\[ N \approx \frac{11 z_{\alpha/2}}{w} - 5.5 \]

Repeating the credible interval calculations from the first example, a 90% credible interval with a width of a half-star requires at most:

\[ N \approx \frac{11 z_{\alpha/2}}{w} - 5.5 = \frac{11(1.65)}{0.5} - 5.5 \approx 31 \]

Likewise, a 95% credible interval spanning a full star requires no more than:

\[ N \approx \frac{11 z_{\alpha/2}}{w} - 5.5 = \frac{11(1.96)}{1} - 5.5 \approx 16 \]

Note that \(N\) and \(w\) have an inverse linear relationship in the consensus case, whereas they had an inverse quadratic relationship in the uniform ratings example. There is less noise when everyone agrees on the rating, and so tightening the credible interval requires fewer observations.

Polarized ratings

Finally, let’s consider an item that receives an even mix of the highest and lowest ratings. That is, let \(n_1 = n_K = N/2\) and \(n_2 = \ldots = n_{K-1} = 0\), and likewise let \(\alpha_1 = \alpha_K = N/2+1\) and let \(\alpha_2 = \ldots = \alpha_{K-1} = 1\).

Once again, we plug these values into the variance formula \(\eqref{variance}\) and pull out the shared \(\alpha_0\):

\[ Var[f(p)] = \frac{1}{\alpha_0(\alpha_0+1)} \left(\left(\left(\frac{N}{2}+1\right)+\sum_{k=2}^{K-1} s_k^2 + s_K^2\left(\frac{N}{2}+1\right) \right) \\ - \frac{1}{\alpha_0}\left(\left(\frac{N}{2}+1\right) + \sum_{k=2}^{K-1} s_k + s_K\left(\frac{N}{2}+1\right)\right)^2\right) \]

Simplifying:

\[ Var[f(p)] = \frac{1}{\alpha_0(\alpha_0+1)} \left(\left(\left(\frac{N}{2}+1\right)\left(s_K^2+1\right)+\sum_{k=2}^{K-1} s_k^2 \right) - \frac{1}{\alpha_0}\left(\left(\frac{N}{2}+1\right)\left(s_K+1\right) + \sum_{k=2}^{K-1} s_k \right)^2\right) \]

Plugging in \(s_k = k\) and eliminating the second summation:

\[ Var[f(p)] = \frac{1}{\alpha_0(\alpha_0+1)} \left(\left(\left(\frac{N}{2}+1\right)(K^2+1)+\sum_{k=2}^{K-1} k^2 \right) \\ - \frac{1}{\alpha_0}\left(\left(\frac{N}{2}+1\right)\left(K+1\right) + \frac{K(K-1)}{2} - 1 \right)^2\right) \]

Consolidating the terms inside the square, we have:

\[ Var[f(p)] = \frac{1}{\alpha_0(\alpha_0+1)} \left(\left(\left(\frac{N}{2}+1\right)(K^2+1)+\sum_{k=2}^{K-1} k^2 \right) - \frac{1}{4\alpha_0}\left((K+1)(N+K)\right)^2\right) \]

Since \(\alpha_0 = N+K\), we can write this as:

\[ Var[f(p)] = \frac{1}{\alpha_0(\alpha_0+1)} \left(\left(\frac{N}{2}+1\right)(K^2+1)+\sum_{k=2}^{K-1} k^2 - \frac{1}{4}(K+1)^2(N+K)\right) \]

A bit of algebra reveals:

\[ Var[f(p)] = \frac{1}{\alpha_0(\alpha_0+1)} \left(\sum_{k=1}^{K-1} k^2 + \frac{1}{4}(N-K)(K-1)^2\right) \]

Assuming \(N \gt K\), we can construct a lower bound for the variance, which happens to be the same as the upper bound in the consensus ratings case:

\[ Var[f(p)] \gt \frac{\sum_{k=1}^{K-1} k^2}{\alpha_0(\alpha_0+1)} = \frac{\sum_{k=1}^{K-1} k^2}{(N+K)(N+K+1)} \]

This bound confirms that, to reduce the variance of the average to a certain level, the polarized ratings will require more observations than consensus ratings, which makes sense as polarized ratings exhibit more variance than consensus ratings.

We can also plug in numbers to come up with a specialized version of the variance for the five-star case:

\[ Var[f(p)] = \frac{1}{(N+5)(N+6)} \left(30 + \frac{1}{4}(N-5)(4)^2\right) = \frac{4N+10}{(N+5)(N+6)} \]

Which can also be written:

\[ Var[f(p)] = \frac{4(N+2.5)}{(N+5)(N+6)} \]

We can see that the variance is roughly twice that of the uniform ratings case \(\eqref{uniform_var}\), and thus for a given \(N\), the credible intervals will be inflated by a factor of roughly \(\sqrt{2}\). Likewise, to achieve a credible interval of a certain width, polarized ratings will require roughly twice as many observations as uniform ratings. Since polarized ratings represent the maximum possible variance of observed ratings, a good rule of thumb for calculating the worst-case \(N\) required to achieve a given credible interval is to use the approximate formula, following \(\eqref{uniform_n}\):

\[ N \approx \frac{16 z_{\alpha/2}^2}{w^2} - 6 \]

For the 90% / half-star interval, the formula produces:

\[ N \approx \frac{16 z_{\alpha/2}^2}{w^2} - 6 = \frac{16(1.65)^2}{0.5^2} - 6 \approx 168 \]

For the 95% / full-star interval, the formula gives us:

\[ N \approx \frac{16 z_{\alpha/2}^2}{w^2} - 6 = \frac{16(1.96)^2}{1^2} - 6 \approx 55 \]

Sample size table

The following table summarizes the formulas for the \(N\) to achieve various credible intervals in a five-star rating system for the three distribution assumptions.

Uniform \(N\)	Consensus \(N\)	Polarized \(N\)
\(N = 8 z_{\alpha/2}^2/w^2 - 6\)	\(N = 11 z_{\alpha/2}/w - 5.5\)	\(N = 16z_{\alpha/2}^2/w^2-6\)

Below is a summary of example values computed from the above (approximate) formulas.

Width (Stars)	Credibility Level	Uniform \(N\)	Consensus \(N\)	Polarized \(N\)
1.0	80%	7	9	20
1.0	90%	16	13	38
1.0	95%	25	16	55
1.0	99%	47	23	100
0.5	80%	46	23	99
0.5	90%	81	31	168
0.5	95%	117	38	240
0.5	99%	206	51	419

The figures for Consensus and Polarized can be seen as representing the lowest and highest theoretical values of N, since they have the lowest and highest theoretical amount of variance for five-star system. (Note that because Consensus calculations use an approximation, which breaks down for low \(N\), the Consensus \(N\) in the first row is higher than the \(N\) for Uniform.) Uniform might be seen as a kind of middle ground for a sample size calculation, and perhaps a conservative one for data that exhibits at least some amount of consensus on rated items.

In most cases it will be preferable to calculate the credible interval on a per-item basis, but when that is not possible, the formulas and table above should provide some useful guidance.

Discussion

Bayesian methods can be applied to user ratings both to find the good stuff (sorting by average rating) and to determine when to display an average rating in a user interface. The main advantage of the above Bayesian method over frequentist methods is that it is difficult to arrive at an estimate of the variance for small \(N\) using e.g. a frequentist Wald interval. (If all ratings are the same, the estimated variance is zero.) The Bayesian prior lets us provide our own initial estimate of the variance, essentially adding a single vote for each star rating in order to get the party started.

A more rigorous approach to the sorting problem would minimize a loss function, perhaps a multilinear loss function, over the Dirichlet distribution. However, given the success of sorting on the lower bound of a confidence interval in the binomial case, an approximate lower bound of a credible interval should offer plenty of bang for the buck without resorting to hairy integrals and special functions.

Changes

July 13, 2015: New Example Distributions section and table of contents
September 14, 2014: Initial version

You’re reading evanmiller.org, a random collection of math, tech, and musings. If you liked this you might also enjoy:

Formulas for Bayesian A/B Testing
Bayesian Average Ratings
How Not To Run an A/B Test

Get new articles as they’re published, via LinkedIn, Twitter, or RSS.

Want to look for statistical patterns in your MySQL, PostgreSQL, or SQLite database? My desktop statistics software Wizard can help you analyze more data in less time and communicate discoveries visually without spending days struggling with pointless command syntax. Check it out!

Wizard
Statistics the Mac way

Back to Evan Miller’s home page – Subscribe to RSS – LinkedIn – Twitter

Width (Stars)	Credibility Level	Uniform \(N\)	Consensus \(N\)	Polarized \(N\)
1.0	80%	7	9	20
1.0	90%	16	13	38
1.0	95%	25	16	55
1.0	99%	47	23	100
0.5	80%	46	23	99
0.5	90%	81	31	168
0.5	95%	117	38	240
0.5	99%	206	51	419

Width (Stars)	Credibility Level	Uniform \(N\)	Consensus \(N\)	Polarized \(N\)
1.0	80%	7	9	20
1.0	90%	16	13	38
1.0	95%	25	16	55
1.0	99%	47	23	100
0.5	80%	46	23	99
0.5	90%	81	31	168
0.5	95%	117	38	240
0.5	99%	206	51	419

Width (Stars)	Credibility Level	Uniform \(N\)	Consensus \(N\)	Polarized \(N\)
1.0	80%	7	9	20
1.0	90%	16	13	38
1.0	95%	25	16	55
1.0	99%	47	23	100
0.5	80%	46	23	99
0.5	90%	81	31	168
0.5	95%	117	38	240
0.5	99%	206	51	419