SetSketches saved from different processes have jaccard estimation of 0 #74

nvanva · 2023-09-22T00:02:10Z

Hi! I'm using CSetSketch from python. I noticed that when I create, fill and then save this structure on disk for 2 sets in the same process, then loading it from disk and calculating jaccard estimation works well. But when 2 sets are processed in different processes, then the estimate is 0. For cardinality estimation everything works well in both cases.
Here is a minimal example showing this:

import os
import sketch

m = 2**18
hll, hll2 = sketch.setsketch.CSetSketch(m), sketch.setsketch.CSetSketch(m)

step1, step2, maxval1, maxval2 = 2, 5, 1000, 1000
for i in range(step1, maxval1+1, step1):
    hll.addh(str(i))

for i in range(step2, maxval2+1, step2):
    hll2.addh(str(i))
    
hll.write(f'tmp1_{os.getpid()}')
hll2.write(f'tmp2_{os.getpid()}')

Run this code twice in 2 different process. Then run:

from pathlib import Path
for ss1 in Path('./').glob('tmp1_*'):
    for ss2 in Path('./').glob('tmp2_*'):
        hll, hll2 = sketch.setsketch.CSetSketch(str(ss1)), sketch.setsketch.CSetSketch(str(ss2))
        jaccard_est = sketch.setsketch.jaccard_index(hll, hll2)
        print(ss1, ss2, jaccard_est)

It will print this in my case:
tmp1_736949 tmp2_736949 0.16761398315429688
tmp1_736949 tmp2_736999 0.0
tmp1_736999 tmp2_736949 0.0
tmp1_736999 tmp2_736999 0.166900634765625

The text was updated successfully, but these errors were encountered:

nvanva · 2023-09-24T20:30:16Z

The problem seems to be related to the hash function used by CSetSketch.addh(). Hashing strings separately and adding hashes helps to solve the problem.

%%timeit
import os
import sketch
import mmh3

m = 2**18
hll, hll2 = sketch.setsketch.CSetSketch(m), sketch.setsketch.CSetSketch(m)

step1, step2, maxval1, maxval2 = 2, 5, 100000, 100000
for i in range(step1, maxval1+1, step1):
#     o = str(i)
    o,_ = mmh3.hash64(str(i), seed=0, signed=False,)
    hll.add(o)

for i in range(step2, maxval2+1, step2):
#     o = str(i)
    o, _ = mmh3.hash64(str(i), seed=0, signed=False,)
    hll2.add(o)
    
hll.write(f'tmp1_{os.getpid()}')
hll2.write(f'tmp2_{os.getpid()}')

The speed remains almost the same. Maybe CSetSketch.addh() can be fixed by using murmurhash3 with fixed seed?

dnbaker · 2023-09-25T18:05:15Z

Hi! That's right, the default addh uses Python's hash function which is seeded at each invocation. I should modify this to be consistent. In the short term, you can convert to a numpy array and hash it (from_np and from_shs should provide this). from_shs expects already hashed data (no hashing), and from_np hashes. I just used python's hash so it could work on any python object. I'll let you know when it's updates. Thanks for the issue! Best, Daniel

…

On Sunday, September 24, 2023, Nikolay Arefyev ***@***.***> wrote: The problem seems to be related to the hash function used by CSetSketch.addh(). Hashing strings separately and adding hashes helps to solve the problem. %%timeit import os import sketch import mmh3 m = 2**18 hll, hll2 = sketch.setsketch.CSetSketch(m), sketch.setsketch.CSetSketch(m) step1, step2, maxval1, maxval2 = 2, 5, 100000, 100000 for i in range(step1, maxval1+1, step1): # o = str(i) o,_ = mmh3.hash64(str(i), seed=0, signed=False,) hll.add(o) for i in range(step2, maxval2+1, step2): # o = str(i) o, _ = mmh3.hash64(str(i), seed=0, signed=False,) hll2.add(o) hll.write(f'tmp1_{os.getpid()}') hll2.write(f'tmp2_{os.getpid()}') The speed remains almost the same. Maybe CSetSketch.addh() can be fixed by using murmurhash3 with fixed seed? — Reply to this email directly, view it on GitHub <#74 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABQ5UVJVSEOYAH2N7UX2TY3X4CJ6HANCNFSM6AAAAAA5CJSOLM> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SetSketches saved from different processes have jaccard estimation of 0 #74

SetSketches saved from different processes have jaccard estimation of 0 #74

nvanva commented Sep 22, 2023 •

edited

Loading

nvanva commented Sep 24, 2023

dnbaker commented Sep 25, 2023 via email

SetSketches saved from different processes have jaccard estimation of 0 #74

SetSketches saved from different processes have jaccard estimation of 0 #74

Comments

nvanva commented Sep 22, 2023 • edited Loading

nvanva commented Sep 24, 2023

dnbaker commented Sep 25, 2023 via email

nvanva commented Sep 22, 2023 •

edited

Loading