Add intersection and similarity #30

erp12 · 2019-06-04T17:50:17Z

First of all, this library is great. Thank you!

I wanted to suggest two features found in some other HLL libraries that (as far as I can tell) are missing here. It is possible to compute an estimated cardinality of the intersection between two HLLs and an estimated similarity between two HLLs.

Here is a Go library that focuses on these two features: https://github.com/axiomhq/hyperminhash

It would be awesome to have C implementations of these operations exposed to python as part of the HyperLogLog object.

hll1.intersection(hll2)
hll1.similarity(hll2)

ascv · 2019-06-04T18:32:32Z

These features seem fine though I will need to more research on implementations for computing intersection cardinality.

kuk · 2021-12-10T07:16:54Z

Seems like there is no need for special implementations for computing intersection cardinality

cardinality(A | B) = cardinality(A) + cardinality(B) - cardinality(A & B)
-> cardinality(A & B) = cardinality(A) + cardinality(B) - cardinality(A | B)

For cardinality(A | B) you have already implemented .merge method.

ascv · 2021-12-10T09:04:37Z

@kuk

HyperLogLogs do not store the set, which is why they are memory efficient, so you cannot compute expressions like cardinality(A ∪ B) or cardinality(A ∩ B) with the data available.

Substituting the registers of the HyperLogLog for these expressions (e.g. cardinality(A ∪ B) --> cardinality(registers(A) ∪ registers(B))) doesn't work because the registers only store the maximum leading zero count observed. There is no way to know what values were previously observed which would be needed to compute an intersection or union of the sets.

For example:

Let H = a HyperLogLog 
Let A = {1, 2, ..., 2^32}
Let R = registers of H

Suppose k=4. Then max(cardinality(R)) = 2^4 since there are 2^4 registers which can at most store 2^4 distinct values. However cardinality(A) = 2^32 so it follows that cardinality(A) > max(cardinality(R)) and therefore cardinality(A) != cardinality(R).

kuk · 2021-12-10T09:11:29Z

... cannot compute expressions like cardinality(A ∪ B) ...

Wait, then what .merge is for? I thought is it for computing cardinality(A ∪ B)

ascv · 2021-12-10T09:33:24Z

Yes but this isn't done using a union operation. Instead there is a register by register comparison and the maximum value is the winner. Then the cardinality estimation of the merged HyperLogLog approximates the true cardinality of the union of the sets.

ascv · 2021-12-10T09:39:29Z

I should add that because you take the maximum value estimating the cardinality of a union is straightforward since the algorithm only needs the maximum values. However for intersections the maximum value doesn't give you sufficient information to accurately and reliably estimate the intersection cardinality. Of course, this is an area of active research so that could change in the future.

kuk · 2021-12-10T09:47:07Z

Knowing cardinality of sets union you can find cardinality of sets intersection. There is a formula in math:

cardinality(A ∩ B) = cardinality(A) + cardinality(B) - cardinality(A ∪ B)

For cardinality(A ∪ B) we have .merge, for cardinality(X) we have .cardinality.

ascv · 2021-12-10T11:34:59Z

And what happens when each estimate is 2% off in the same direction? The total error is then approximately 6% which is 3 times the error you normally get. Now extending this example, what happens when you use this approach on 1000 sets? Even if the error is low it is easy to see how the error quickly compounds to produce inaccurate results.

This was why my original preference was to "wait and see" and let people implement intersection estimates themselves. In the case of two sets, using inclusion-exclusion principle, it is trivial to do:

card_a = a.cardinality()
card_b = b.cardinality()
a.merge(b)
card_ab = card_a + card_b - a.cardinality()

All this being said, you're not the first person to ask for an intersection feature. Maybe I should re-consider my original stance since "perfect is the enemy of good." If we used inclusion-exclusion, I would need to put a disclaimer in the README regarding the accuracy of this method.

kuk · 2021-12-10T11:40:06Z

And what happens when each estimate is 2% off in the same direction? The total error is then approximately 6% which is 3 times the error you normally get

Oh, yes good point, did not consider nonoptimal error estimate

jianshu93 · 2023-12-06T23:05:36Z

Hello All,

I think the answer is not to use that math equation above (also in the Dashing paper) for Jaccard index, which is know to have much larger variation than MinHash. the joint MLE estimator for cardinality in the Ertl 2017 paper is the most accurate, a well written paper with example code in c++, a C version can be derived from my perspective. FYI the Ultraloglog (https://arxiv.org/abs/2308.16862) is out with new FGRA estimator, even more space efficient while more accurate than the improved estimator and raw estimator, the closest one to the Camera-Rao lower bound (MLE method meets the lower bound but very slow, see paper for details). Should be very useful if we have a C-python version to use. The best for set similarity is MinHash (more recent is the optimal densification MinHash), the much smaller variance than whatever HLL based Jaccard, limited by the sqrt(1.079/m) RMSE bound.

Thanks,

Jianshu

ascv · 2024-04-17T22:55:52Z

@jianshu93 Thanks for bringing to my attention this paper. I gave it a quick skim and it looks promising. I think it would be best implemented as a separate class (e.g. UltraLogLog). This project is basically a side hobby of mine that I work on very infrequently so I can't make any promises. The current plan is to fix #46 and then I'll look into the UltraLogLog implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add intersection and similarity #30

Add intersection and similarity #30

erp12 commented Jun 4, 2019

ascv commented Jun 4, 2019 •

edited

Loading

kuk commented Dec 10, 2021

ascv commented Dec 10, 2021 •

edited

Loading

kuk commented Dec 10, 2021

ascv commented Dec 10, 2021

ascv commented Dec 10, 2021 •

edited

Loading

kuk commented Dec 10, 2021 •

edited

Loading

ascv commented Dec 10, 2021 •

edited

Loading

kuk commented Dec 10, 2021

jianshu93 commented Dec 6, 2023 •

edited

Loading

ascv commented Apr 17, 2024 •

edited

Loading

Add intersection and similarity #30

Add intersection and similarity #30

Comments

erp12 commented Jun 4, 2019

ascv commented Jun 4, 2019 • edited Loading

kuk commented Dec 10, 2021

ascv commented Dec 10, 2021 • edited Loading

kuk commented Dec 10, 2021

ascv commented Dec 10, 2021

ascv commented Dec 10, 2021 • edited Loading

kuk commented Dec 10, 2021 • edited Loading

ascv commented Dec 10, 2021 • edited Loading

kuk commented Dec 10, 2021

jianshu93 commented Dec 6, 2023 • edited Loading

ascv commented Apr 17, 2024 • edited Loading

ascv commented Jun 4, 2019 •

edited

Loading

ascv commented Dec 10, 2021 •

edited

Loading

ascv commented Dec 10, 2021 •

edited

Loading

kuk commented Dec 10, 2021 •

edited

Loading

ascv commented Dec 10, 2021 •

edited

Loading

jianshu93 commented Dec 6, 2023 •

edited

Loading

ascv commented Apr 17, 2024 •

edited

Loading