-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SetSketch implementations #73
Comments
Hi - The SetSketch is a great option. I think it's the best way to sketch and compare unweighted sets. In this library, we provide two implementations:
The simplest approach is just to use If you want the additional speed that can come from using SetSketch by packing more data into fewer bytes, then you have two options.
For 1, we have some defaults for different sizes (ByteSetS, ShortSetS, UintSetS) which work well on plenty of applications. You could just run with that. For 2, you would use 2 takes more space a priori but yields higher-accuracy comparisons. From Python you have fewer options, at least at construction. I suggest the It would be helpful if more was exposed to the python interface; when I get time, I will expose the And lastly for relative error, it would be helpful to add it. The formula is:
For CSetSketch (b = 1), it's simply 1/sqrt(m). Then it increases slightly with higher b values. I'll add this functionality to a future edition as well. Thanks for the question, and don't hesitate to ask any more you might have! Best, Daniel |
Thanks a lot for your recommendations. I will use CSetSketch because currently it seems to me that the space is not an issue, but precision is important. I'm still wondering about the following questions:
|
The CSetSketch is the un-truncated SetSketch. If you set b = 1, you'll recover it. It's simpler to compute since it doesn't need to compute as many thresholds. I found lazy truncation gave me more flexibility in our genomic applications in Dashing2. CSetSketch is a MinHash with independent registers that can be computed efficiently thanks to early stopping. You can just use the minhash bounds directly. Its advantage is rapid computation compared to standard K-Mins, faster comparison than bottom-k hashing. The independent registers are also more powerful than bottom-k for LSH index construction because you can group them for stronger hash functions. |
Hi! Thank you for this wonderful library. I am working on estimation of the overlap between different web crawls, this basically requires estimating the number of unique URLs in lists of 10-100 billion URLs and the cardinalities of their intersections. After some literature search it seems that SetSketch is the right method for this case as it allows both cardinality and Jaccard index estimation.
I found several implementations of SetSketch in your library (ByteSetSketch, CSetSketch, FSetSketch, ShortSetSketch). Could you please give an advice how to select the appropriate one? Also I could not find how I can change the hyperparameters a,b from Python. Is it possible and reasonable to try selecting them, or better rely on the default values?
The final question is about the calculation of the confidence intervals for the estimates. In the implementation of HLL there is the method relative_error() to get those, is there a way to get similar estimates for SetSketch?
The text was updated successfully, but these errors were encountered: