Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SetSketches saved from different processes have jaccard estimation of 0 #74

Open
nvanva opened this issue Sep 22, 2023 · 2 comments
Open

Comments

@nvanva
Copy link

nvanva commented Sep 22, 2023

Hi! I'm using CSetSketch from python. I noticed that when I create, fill and then save this structure on disk for 2 sets in the same process, then loading it from disk and calculating jaccard estimation works well. But when 2 sets are processed in different processes, then the estimate is 0. For cardinality estimation everything works well in both cases.
Here is a minimal example showing this:

import os
import sketch

m = 2**18
hll, hll2 = sketch.setsketch.CSetSketch(m), sketch.setsketch.CSetSketch(m)

step1, step2, maxval1, maxval2 = 2, 5, 1000, 1000
for i in range(step1, maxval1+1, step1):
    hll.addh(str(i))

for i in range(step2, maxval2+1, step2):
    hll2.addh(str(i))
    
hll.write(f'tmp1_{os.getpid()}')
hll2.write(f'tmp2_{os.getpid()}')

Run this code twice in 2 different process. Then run:

from pathlib import Path
for ss1 in Path('./').glob('tmp1_*'):
    for ss2 in Path('./').glob('tmp2_*'):
        hll, hll2 = sketch.setsketch.CSetSketch(str(ss1)), sketch.setsketch.CSetSketch(str(ss2))
        jaccard_est = sketch.setsketch.jaccard_index(hll, hll2)
        print(ss1, ss2, jaccard_est)

It will print this in my case:
tmp1_736949 tmp2_736949 0.16761398315429688
tmp1_736949 tmp2_736999 0.0
tmp1_736999 tmp2_736949 0.0
tmp1_736999 tmp2_736999 0.166900634765625

@nvanva
Copy link
Author

nvanva commented Sep 24, 2023

The problem seems to be related to the hash function used by CSetSketch.addh(). Hashing strings separately and adding hashes helps to solve the problem.

%%timeit
import os
import sketch
import mmh3

m = 2**18
hll, hll2 = sketch.setsketch.CSetSketch(m), sketch.setsketch.CSetSketch(m)

step1, step2, maxval1, maxval2 = 2, 5, 100000, 100000
for i in range(step1, maxval1+1, step1):
#     o = str(i)
    o,_ = mmh3.hash64(str(i), seed=0, signed=False,)
    hll.add(o)

for i in range(step2, maxval2+1, step2):
#     o = str(i)
    o, _ = mmh3.hash64(str(i), seed=0, signed=False,)
    hll2.add(o)
    
hll.write(f'tmp1_{os.getpid()}')
hll2.write(f'tmp2_{os.getpid()}')

The speed remains almost the same. Maybe CSetSketch.addh() can be fixed by using murmurhash3 with fixed seed?

@dnbaker
Copy link
Owner

dnbaker commented Sep 25, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants