Ranking News Items With Upvotes

July 14, 2015

As a follow-up to Deriving the Reddit Formula, let’s consider a social news website that lets users upvote items, but that doesn’t permit them to downvote items. How can we infer the probability that a random user will upvote a story, given only the age of the story \(t\) and the total number of upvotes \(U\)?

To calculate that probability, we also need to know the total number of users visiting the site, call it \(N\). Of course, not all \(N\) users have seen a given story, so we first need an expression for the fraction of users who have seen a given story.

Following the model we used for Reddit users, if you think of a user’s reload behavior as a Poisson process with reload rate \(\lambda\), then the fraction of users who have seen a given story of age \(t\) is given by the formula:

\[ q = 1 - e^{-\lambda t} \]

(This is one minus the \(q\) from the previous model, which asked which fraction of users had not seen a given story.)

So the total number of users who have seen a given story is given by:

\[ n(t) = Nq = N(1-e^{-\lambda t}) \]

Therefore the fraction of users who upvoted a story — and therefore the expected probability of liking a story^[1] — is given by:

\[ p = \frac{U}{n(t)} = \frac{U}{N(1-e^{-\lambda t})} \]

Assuming a user gets utility of +1 from a story that she would upvote, the expected utility from a story of age is just:

\[ u(t, U) = \frac{U}{N(1-e^{-\lambda t})} \]

To form an expectation, we need to multiply this quantity by the probability that the story hasn’t been seen before:

\[ E[u(t, U)] = e^{-\lambda t}\frac{U}{N(1-e^{-\lambda t})} \]

Now for a neat trick. Suppose we want to use the expected utility to rank news items. Because utility is relative, we can multiply or divide it by positive numbers without affecting the rankings. So let’s divide the above expression by the number of site users \(N\), which is a constant:

\[ E[u] = e^{-\lambda t}\frac{U}{1-e^{-\lambda t}} \]

That is, we can form a ranking using only \(U\) and \(t\) — it turns out there’s no need to know \(N\), the total number of users on the site!

We can also take a logarithm to make the formula comparable to the Reddit version:

\[ \ln E[u] = \ln(U) -\lambda t - \ln{(1-e^{-\lambda t})} \]

(To make the first logarithm work, you’ll need to give each story an automatic upvote to start life with — but then you’ll also want to set a minimum \(t\) before ranking so that the second logarithm doesn’t blow up. The C functions expm1 and log1p might come in handy here.)

The difference between this formula (which uses only upvotes) and the formula derived for Reddit (which includes both upvotes and downvotes) boils down to how we want to model the probability of upvoting something. Previously, we used the total number of votes, but here, we use the total number of visitors, using only what we know about the reload rate. As a consequence, the formulas are identical except that \(\ln{(U+D+2)}\) (the total number of votes, including two “automatic” votes) is replaced with \(\ln{(1-e^{-\lambda t})}\) (the fraction of users who have seen an item).

Example

To get a feel for the algorithm, let’s calculate the substitution rate between points and time required to maintain a constant ranking:

\[ \frac{dE[u]/dt}{dE[u]/U} = \frac{-\lambda U}{e^{\lambda t}-1} = \frac{-\Delta U}{\Delta t} \]

Which can be rewritten in terms of a fractional change in \(U\):

\[ \frac{\Delta U}{U} = \frac{\lambda \Delta t}{(e^{\lambda t} - 1)} \]

Let’s plug in some numbers to see how it works. Consider a reload rate of once per 4 hours \(\lambda = 1/4\) and a story that is 2 hours old. This formula reveals that a fractional change in \(U\) is equal to:

\[ \frac{\Delta U}{U} = \frac{\frac{1}{4} \Delta t}{(e^{1/2} - 1)} = 0.385 \Delta t \]

That is, to maintain its rank value in the next hour (\(\Delta t = 1\)), the story will need to accumulate 38.5% more upvotes. (Why is that figure less than 50%? In this model, later votes are more difficult to acquire than earlier votes. As time progresses, more and more users have already had the chance to vote on a given item, which means that later votes are in a sense more “valuable”.)

Conclusion

If you run a social news site, consider ranking your stories using a formal model rather than a hand-me-down formula whose origins were lost in the mists of time. The present model isn’t all that sophisticated, but it should provide a good starting point as you hone and tweak your site to provide just the right mix of things that are new and things that are good.

You’re reading evanmiller.org, a random collection of math, tech, and musings. If you liked this you might also enjoy:

Deriving the Reddit Formula
How Not To Sort By Average Rating
Bayesian Average Ratings
Ranking Items With Star Ratings: An Approximate Bayesian Approach

Get new articles as they’re published, via LinkedIn, Twitter, or RSS.

Want to look for statistical patterns in your MySQL, PostgreSQL, or SQLite database? My desktop statistics software Wizard can help you analyze more data in less time and communicate discoveries visually without spending days struggling with pointless command syntax. Check it out!

Wizard
Statistics the Mac way

Back to Evan Miller’s home page – Subscribe to RSS – LinkedIn – Twitter