-
Notifications
You must be signed in to change notification settings - Fork 1
Membership Testing False Positive Rate: HyperLogLog
Edwin Chen edited this page Dec 17, 2017
·
7 revisions
Here's a plot of the false positive rate of HyperLogLog on simulated data (in blue), with the expected false positive rate overlaid (in black):
Let b
be the number of binary address bits used to map elements to buckets, so that B = 2^b
is the number of buckets, and let n
be the number of distinct elements inserted into the HyperLogLog so far. Given a new element x
, we compute a hash h(x)
, use the first b
bits of h(x)
to map x
into a bucket, and let rho(x)
be the position of the first "1" in the binary representation of the remaining bits of h(x)
.
- If the value stored in
x
's bucket is already greater than or equal torho(x)
, we say that the HyperLogLog already containsx
(and are wrong with some false positive probability). - Otherwise, we replace the value in the bucket with
rho(x)
, and say that this is our first encounter withx
(and we are always correct).
The false positive rate for checking set containment of an element x
is:
P(false positive)
= P(some element e, with rho(e) >= rho(x), was hashed into the same bucket as x)
= SUM_k P[rho(x) = k] * P(some element e, with rho(e) >= k, was hashed into the same bucket as x | rho(x) = k)
= SUM_k P[rho(x) = k] * [1 - P(no element e, with rho(e) >= k, was hashed into the same bucket as x | rho(x) = k)]
= SUM_k P[rho(x) = k] * [1 - {1 - P(e has rho(e) >= k and e was hashed into the same bucket as x | rho(x) = k)}^n]
Now
P[rho(z) = k] = 1 / 2^k
P[rho(z) >= k] = 1 / 2^(k-1)
So
P(false positive)
= SUM_k (1 / 2^k) * [1 - {1 - 1 / (B*2^(k-1))}^n]
~ SUM_k (1 / 2^k) * [1 - e^{-n / (B * 2^(k-1))}]