Membership Testing False Positive Rate: HyperLogLog

Here's a plot of the false positive rate of HyperLogLog on simulated data (in blue), with the expected false positive rate overlaid (in black):

False Positive Rate

Let b be the number of binary address bits used to map elements to buckets, so that B = 2^b is the number of buckets, and let n be the number of distinct elements inserted into the HyperLogLog so far. Given a new element x, we compute a hash h(x), use the first b bits of h(x) to map x into a bucket, and let rho(x) be the position of the first "1" in the binary representation of the remaining bits of h(x).

If the value stored in x's bucket is already greater than or equal to rho(x), we say that the HyperLogLog already contains x (and are wrong with some false positive probability).
Otherwise, we replace the value in the bucket with rho(x), and say that this is our first encounter with x (and we are always correct).

The false positive rate for checking set containment of an element x is:

P(false positive)
  = P(some element e, with rho(e) >= rho(x), was hashed into the same bucket as x)
  = SUM_k P[rho(x) = k] * P(some element e, with rho(e) >= k, was hashed into the same bucket as x | rho(x) = k)
  = SUM_k P[rho(x) = k] * [1 - P(no element e, with rho(e) >= k, was hashed into the same bucket as x | rho(x) = k)]
  = SUM_k P[rho(x) = k] * [1 - {1 - P(e has rho(e) >= k and e was hashed into the same bucket as x | rho(x) = k)}^n]

Now

P[rho(z) = k] = 1 / 2^k
P[rho(z) >= k] = 1 / 2^(k-1)

So

P(false positive)
  = SUM_k (1 / 2^k) * [1 - {1 - 1 / (B*2^(k-1))}^n]
  ~ SUM_k (1 / 2^k) * [1 - e^{-n / (B * 2^(k-1))}]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Membership Testing False Positive Rate: HyperLogLog

False Positive Rate

Clone this wiki locally