Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better distances using counts #41

Open
bovee opened this issue Dec 3, 2019 · 1 comment
Open

Better distances using counts #41

bovee opened this issue Dec 3, 2019 · 1 comment

Comments

@bovee
Copy link
Contributor

bovee commented Dec 3, 2019

Right now we're doing purely set-based calculations, but we also have count information for each item in the set that we could use to improve the overall distance estimate. I wrote the following (not working anymore) code out as test, but something like this may work decently (as compared to e.g. cosine distances that may have different sets of issues).

    // if the original max_min parameter is larger than the max hash in either
    // table, then this will panic b/c i or j will point outside; note that i and j at the
    // end are slightly off from what they should be too (see the current distance
    // function for better accounting)
    while (sketch1[i].hash <= max_hash_value) && (sketch2[j].hash <= max_hash_value) {
        if sketch1[i].hash < sketch2[j].hash {
            s1_complement += u64::from(sketch1[i].count);
        } else if sketch2[j].hash < sketch1[i].hash {
            s2_complement += u64::from(sketch2[j].count);
        } else {
            s1_intersect += u64::from(sketch1[i].count);
            s2_intersect += u64::from(sketch2[j].count);
            i += 1;
            j += 1;
        }
    }
    // the normalization here is kind of made-up, but we do need some way to
    // make sure that sequencing depth differences don't impact the comparison
    let containment: f64 = s1_intersect as f64 / (s1_intersect as f64 + s1_complement as f64);
    let common = s1_intersect;
    // we're scaling the s2 counts to the range of s1 that they share here
    let total = s1_complement + s1_intersect + (s2_intersect + s2_complement) * s1_intersect / s2_intersect;
    let jaccard: f64 = common as f64 / total as f64;

Ioffe also has a paper on weighted MinHash that might be useful to understand here too: http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36928.pdf

@jianshu93
Copy link

Hello,

Have you implement this weighted idea and merged into the main branch? except the one you point out, baminhash and probminhash are also weighted set method.

Jianshu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants