SlowerLogLog

February 6, 2020

HyperLogLog is a probabilistic data structure for counting unique elements in a set. Here I describe a variation of HyperLogLog, which I call SlowerLogLog.

SlowerLogLog is not novel. As the name implies, it’s more computationally expensive than HyperLogLog. (I won a couple scratch-off tickets last spring, so it’s fine.) SlowerLogLog can be derived in a few lines of math, and implemented in a small amount of code, so it might be interesting if you had trouble following the complex analysis in the original HyperLogLog paper, as I did.

SlowerLogLog has the following advantages over HyperLogLog:

The cardinality, or number of unique elements, is easily estimated via Maximum Likelihood Estimation (MLE)
The standard error of the cardinality estimate is easily estimated via MLE
It works with small cardinalities (i.e., does not need to resort to a separate algorithm for small cardinalities, as does HyperLogLog)
The number of registers need not be a power of 2

What’s the appeal of Maximum Likelihood Estimation? MLE is a statistical framework with well-known properties, that for small problems requires little more than high-school calculus to arrive at a consistent estimator. (I’ll walk you through a derivation below.) MLE also provides guarantees on the variance of the estimator, provided a few regularity conditions are met.

SlowerLogLog maintains the following property of HyperLogLog:

Independent sketches (data structures) are trivially merged

However, updating the SlowerLogLog data structure is much more computationally expensive than HyperLogLog. It requires that each inserted element be hashed not once, but rather \(M\) times, where \(M\) is the number of registers, which for typical applications, will range in the thousands. The name is intended to deter implementers from using SlowerLogLog in production, but ease of implementation (less than 100 lines of code in most languages) may make it a compelling time sink for bored programmers.

Motivation

A HyperLogLog data structure consists of \(M\) registers. Incoming elements are hashed, and the first few bits of the hash are used to assigned the element to a register. The remaining bits of the hash are used to update the register: each register contains the maximum number of leading zeros of all the hash values that it was assigned.

In the original paper [1], producing an estimate from the registers requires:

Deciding whether to use Linear Counting instead, and if not:
Computing a harmonic mean of the registers, multiplying by \(M^2\), and then:
Multiplying by an appropriate bias-correction constant based on the number of registers

The bias-correction constant in step 3 is provided by the HLL authors as a set of “magic numbers”, ultimately derived from an infinite integral that will appear talismanic to the casual observer:

\[ \alpha_m = \left( m\int_0^\infty \left(\log_2\left(\frac{2+x}{1+x}\right)\right)^m dx \right)^{-1} \]

A conceptually more elegant estimator, based on Maximum Likelihood Estimation, is derived in [2]. However, numerical estimation of HyperLogLog via MLE is not straightforward, as HyperLogLog’s registers are not independent. (If an element is assigned to register \(i\), then we know it’s not assigned to register \(j\).)

The purpose of SlowerLogLog is to provide a data structure similar to HyperLogLog, but that is more amenable to Maximum Likelihood Estimation. Registers in SlowerLogLog have the same function as those in HyperLogLog – counting leading zeros in hash values – but in SlowerLogLog, every element is assigned to every register, using iterative hash values. Thus unlike HyperLogLog, SlowerLogLog's registers can be treated as statistically independent observations.

The algorithms described below are essentially variants of the “Geometric random variables” model studied in Section 3.2.2 of [3], setting \(q=0.5\).

Update Algorithm

The SlowerLogLog algorithm does not “bucketize” the input. Rather, each element is assigned to the first register, and updated as in HyperLogLog. Then the element is re-hashed (taking a hash value of the first hash value), and the resulting hash value is used to update the second register. This process is repeated, until every register \(m\) has been updated with the \(m\)^th hash value of the original input.

Here is the update algorithm in Julia. It assumes that the elements are stored in values, and the number of registers is stored in m:


    registers = zeros(m)

for v in values
    h = hash(v)
    for i = 1:length(registers)
        k = leading_zeros(h)
        if k > registers[i]
            registers[i] = k
        end
        h = hash(h)
    end
end

Iterative hashing assures that identical inputs present the same stream of hash values to the registers. However, the iterative hash itself is not strictly necessary to the algorithm, so long as each register receives a different, deterministic, and uniformly distributed hash value for each input value. An alternative would be to add the register index to the input value before hashing; such a change would then make it possible to parallelize the register update logic.

Merge Algorithm

Merging two sketches (sets of registers) is the same as in HyperLogLog. Each register in the merged sketch takes the maximum value of the corresponding registers in the input sketches.

Estimation Algorithm

The values in the SlowerLogLog registers can be described by a simple likelihood function. Let \(g(n, k)\) be the probability that a register which has seen \(n\) distinct elements will contain the value \(k\). As in Proposition 1 of [1],

\[ \begin{equation} \label{g} g(n, k) = \left(1 - \frac{1}{ 2^{k+1} }\right)^n - \left(1 - \frac{1}{2^k}\right)^n \end{equation} \]

You can think of the above expression as the probability that none of the \(n\) registers have seen a hash value with at least \(k+1\) leading zeros (else the register would have a value of \(k+1\) or higher), minus the probability that none of the \(n\) registers have seen a hash value with at least \(k\) leading zeros (i.e, subtract the probability that the register has a value of \(k-1\) or lower).

The likelihood function of a single register having value \(k\) is:

\[ f(n|k) = g(n, k) \]

Because the registers are statistically independent, the likelihood function for \(M\) registers, each having a value \(k_m\), is the product:

\[ L(n|K) = \prod_{m=1}^M g(n, k_m) \]

Estimation proceeds via the usual Maximum Likelihood methods. First take the log-likelihood:

\[ \ln L(n|K) = \sum_{m=1}^M \ln g(n, k) \]

Maximize this expression by differentiating and setting to zero:

\[ \begin{equation} \label{jacobian} \frac{d \ln L(n|K)}{d n} = \sum_{m=1}^M g'(n, k_m) / g(n, k_m) = 0 \end{equation} \]

Where \(g'(n, k)\) is the derivative of \(\eqref{g}\) with respect to \(n\), or in explicit terms:

\[ g'(n, k) = \left( 1 - \frac{1}{ 2^{k+1} } \right)^n \ln \left(1 - \frac{1}{ 2^{k+1} }\right) - \left( 1 - \frac{1}{2^{k}} \right)^n \ln \left(1 - \frac{1}{2^{k}}\right) \]

The equation \(\eqref{jacobian}\) can be solved numerically for \(n\) in a few iterations using Newton's method. To do this, first compute the derivative of \(\eqref{jacobian}\):

\[ \frac{d^2 L(n|K)}{dn^2} = \sum_{m=1}^M \frac{ g(n, k) g''(n, k) - (g'(n, k)^2 ) }{ g(n, k)^2 } \]

Where \(g''(n, k)\) is the second derivative of \(g(n, k)\) with respect to \(n\), or:

\[ g''(n, k) = \left(1-\frac{1}{2^{k+1}}\right)^n \ln^2 \left(1-\frac{1}{2^{k+1}}\right) - \left(1-\frac{1}{2^{k}}\right)^n \ln^2 \left(1-\frac{1}{2^{k}}\right) \]

Form an initial guess, perhaps using a harmonic mean of the register values, then iterate:

\[ n_{i+1} = n_i - \frac{dL(n_i|K)/dn}{d^2 L(n_i|K)/dn^2} \]

A fixed number of iterations (say, 8) will do the job, or test a convergence criterion to arrive at an estimate \(\hat{n}\).

Here’s the outermost loop of my Julia implementation:


    function estimate(registers)
    n = harmonic_mean(registers)

    for j = 1:8
        n -= jacobian(registers, n) / hessian(registers, n)
    end

    return n
end

(For the uninitiated, Jacobian and Hessian are fancy names for first and second derivative matrices, but the “matrix” in this application is 1 × 1.)

Variance of the Estimate

The variance of an MLE estimator is equal to the expected negative inverse of its information matrix. That quantity can be estimated numerically using the second derivative at the estimate itself:

\[ {\rm Var}[\hat{n}] = -\left( \frac{d^2 L(\hat{n}|K)}{dn^2} \right)^{-1} = -\left( \sum_{m=1}^M \frac{ g(\hat{n}, k) g''(\hat{n}, k) - (g'(\hat{n}, k))^2 ) }{ g(\hat{n}, k)^2 } \right)^{-1} \]

If you used Newton's Method to solve for \(\hat{n}\), you will already have code implementing the expression in parentheses, and so reporting the variance will be trivial. Here’s mine:


    function variance(estimate, registers)
    return -1/hessian(registers, estimate)
end

If you’re not satisfied with that variance estimator, see [3] for an analytic expression for the asymptotic variance, although its infinite summation may raise the hairs on the back of your neck.

Implementation

I’ve implemented the above algorithms as a UNIX-style (unique) line-counting tool in both Julia and C. The repository is here. You may notice that the output of the two programs do not agree for the same input; this discrepancy is due to the use of different string-hashing algorithms. Try piping in the output of find, like this:


    $ find /tmp/ | ./slowcount
18.09 ± 0.42

By way of comparison,

$ find /tmp/ | sort | uniq | wc -l
    18

The programs as written are not especially accurate for \(n \lt 5\), but I think you’ll manage. Most of the compute time goes into updating the registers, as expected. I am very proud of myself for using the compiler intrinsic __builtin_ctz to count zeros, but this optimization had more to do with simplifying the code than reducing the electric bill.

A good challenge would be to see if you could get the computation time of SlowerLogLog competitive with HyperLogLog, perhaps by recycling unused hash bits and parallelizing portions. But I could count at least a dozen distinct better uses of your time. (Give or take a couple.)

References

You’re reading evanmiller.org, a random collection of math, tech, and musings. If you liked this you might also enjoy:

Simple Sequential A/B Testing
Evaluating Splatoon’s Ranking System
Deriving the Reddit Formula
Bayesian Average Ratings
Ranking Items With Star Ratings: An Approximate Bayesian Approach

Get new articles as they’re published, via LinkedIn, Twitter, or RSS.

Want to look for statistical patterns in your MySQL, PostgreSQL, or SQLite database? My desktop statistics software Wizard can help you analyze more data in less time and communicate discoveries visually without spending days struggling with pointless command syntax. Check it out!

Wizard
Statistics the Mac way

Back to Evan Miller’s home page – Subscribe to RSS – LinkedIn – Twitter