Skip to content

Membership Testing False Positive Rate: HyperLogLog

Edwin Chen edited this page Dec 17, 2017 · 7 revisions

Here's a plot of the false positive rate of HyperLogLog on simulated data (in blue), with the expected false positive rate overlaid (in black):

False Positive Rate

False Positive Rate

Let b be the number of binary address bits used to map elements to buckets, so that B = 2^b is the number of buckets, and let n be the number of distinct elements inserted into the HyperLogLog so far. Given a new element x, we compute a hash h(x), use the first b bits of h(x) to map x into a bucket, and let rho(x) be the position of the first "1" in the binary representation of the remaining bits of h(x).

  • If the value stored in x's bucket is already greater than or equal to rho(x), we say that the HyperLogLog already contains x (and are wrong with some false positive probability).
  • Otherwise, we replace the value in the bucket with rho(x), and say that this is our first encounter with x (and we are always correct).

The false positive rate for checking set containment of an element x is:

P(false positive)
  = P(some element e, with rho(e) >= rho(x), was hashed into the same bucket as x)
  = SUM_k P[rho(x) = k] * P(some element e, with rho(e) >= k, was hashed into the same bucket as x | rho(x) = k)
  = SUM_k P[rho(x) = k] * [1 - P(no element e, with rho(e) >= k, was hashed into the same bucket as x | rho(x) = k)]
  = SUM_k P[rho(x) = k] * [1 - {1 - P(e has rho(e) >= k and e was hashed into the same bucket as x | rho(x) = k)}^n]

Now

  • P[rho(z) = k] = 1 / 2^k
  • P[rho(z) >= k] = 1 / 2^(k-1)

So

P(false positive)
  = SUM_k (1 / 2^k) * [1 - {1 - 1 / (B*2^(k-1))}^n]
  ~ SUM_k (1 / 2^k) * [1 - e^{-n / (B * 2^(k-1))}]