Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and oxidize downsampling code in sbtmh.py #856

Merged
merged 15 commits into from
Jan 26, 2020
Merged

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Jan 22, 2020

The downsampling code we use in MinHash comparisons is ugly, so refactoring it makes sense. This does a fairly broad cleanup based on adding a downsample bool option to the similarity functions.

More specifically, this PR:

  • adds 'downsample' boolean parameters to rust compare, similarity, and count_common functions;
  • adds explicit tests for a few non-trivial angular distance calculations
  • refactors 'sourmash watch' to use new Index.search function instead of SBT-specific interface
  • simplifies and cleans up sbt and sbtmh functionality around downsampling in searches, and removes some unused functionality that is confusing.
  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@codecov
Copy link

codecov bot commented Jan 22, 2020

Codecov Report

Merging #856 into master will decrease coverage by 0.54%.
The diff coverage is 53.54%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #856      +/-   ##
==========================================
- Coverage   78.43%   77.89%   -0.55%     
==========================================
  Files          94       94              
  Lines        7299     7306       +7     
==========================================
- Hits         5725     5691      -34     
- Misses       1574     1615      +41
Flag Coverage Δ
#rusttests 50% <41.93%> (-1.24%) ⬇️
Impacted Files Coverage Δ
src/core/src/errors.rs 0% <0%> (ø) ⬆️
src/core/src/ffi/minhash.rs 0% <0%> (ø) ⬆️
sourmash/_minhash.py 96.2% <100%> (-0.89%) ⬇️
sourmash/sbtmh.py 92.79% <100%> (+8.5%) ⬆️
sourmash/sbt.py 86.81% <100%> (ø) ⬆️
sourmash/commands.py 84.06% <100%> (-0.03%) ⬇️
sourmash/compare.py 100% <100%> (+10.29%) ⬆️
sourmash/signature.py 86.77% <100%> (-0.42%) ⬇️
src/core/src/sketch/minhash.rs 64.25% <35%> (-9.08%) ⬇️
src/core/src/index/mod.rs 60.91% <60%> (ø) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f9de09b...6ebd166. Read the comment docs.

sourmash/sbtmh.py Outdated Show resolved Hide resolved
sourmash/sbtmh.py Outdated Show resolved Hide resolved
@luizirber
Copy link
Member

luizirber commented Jan 26, 2020

codecov/patch is failing because coverage can't track the Rust code when used inside Python (as an extension), so it looks like it was not covered (because the Rust tests don't test it). I don't think this is a blocker (since the Python code is probably covering it), so I don't think that's a blocker for merging.

Any other changes you want to do, @ctb? (other than the similarity tests you already mentioned)

@luizirber luizirber added this to the 3.2 milestone Jan 26, 2020
@luizirber
Copy link
Member

I also bumped the core version to 0.4.0, since some methods' signatures changed

@ctb
Copy link
Contributor Author

ctb commented Jan 26, 2020

@luizirber what do you think about 2e16a30? This switches the error message to reference scaled, instead of max hash. While it may (temporarily L) confuse developers to have 'mismatch in scaled' reported for a mismatch in max_hash, I think the improvement in end-user UX is worth it. Although of course in some theoretical world we would never report something like this to an end user... :)

@ctb
Copy link
Contributor Author

ctb commented Jan 26, 2020

another question - right now, downsample in MinHash.similarity only applies to scaled signatures. Do we want it to downsample num signatures also? (see sourmash/_minhash.py::MinHash.compare)

@ctb
Copy link
Contributor Author

ctb commented Jan 26, 2020

...and is there a reason not to have SourmashError::MismatchNum be checked in check_compatible (used by lib.kmerminhash_compare)?

@ctb
Copy link
Contributor Author

ctb commented Jan 26, 2020

I think we should merge this ASAP, and maybe cut a new release; and then I'll deal with the MismatchNum / downsample num code in another PR.

@ctb
Copy link
Contributor Author

ctb commented Jan 26, 2020

Well, and so. I was wrong above.

in #69, I claimed that we "switched to using cosine similarity for abundance comparisons". But we did not; the comment in the similarity function says, "If the sketches are abundance weighted, calculate a distance metric based on the cosine similarity."

It turns out that cosine similarity is not a distance metric, since it doesn't satisfy the triangle inequality. So instead we used "angular distance", per https://en.wikipedia.org/wiki/Cosine_similarity, section "Angular distance and similarity." This is defined as the inverse cosine of the cosine distance, divided by pi.

So I will revert the previous changes :)

@ctb
Copy link
Contributor Author

ctb commented Jan 26, 2020

(and then we need to update the docs)

@ctb
Copy link
Contributor Author

ctb commented Jan 26, 2020

update docs -> punted to #866

@ctb
Copy link
Contributor Author

ctb commented Jan 26, 2020

OK, ready for review!

Here is a brief summary of the remaining items for discussion / refactoring - these can be punted to issues if/when this is merged:

  • in src/core/src/sketch/minhash.rs, the if ignore_abundance code in similarity seems duplicative with the compare function; this should be resolved in some way, or at least commented.
  • in the same file, containment_ignore_maxhash is now identical in behavior to contained_by(..., downsample=True). However, contained_by has a lot more checks and is presumably slower. Benchmark before deciding which to go with in the medium term.
  • right now, downsample in MinHash.similarity only applies to scaled signatures. Do we want it to downsample num signatures also? (see sourmash/_minhash.py::MinHash.compare)
  • is there a reason not to have SourmashError::MismatchNum be checked in check_compatible (used by lib.kmerminhash_compare)?

@ctb ctb changed the title [WIP] refactor downsampling code in sbtmh.py [MRG] refactor and oxidize downsampling code in sbtmh.py Jan 26, 2020
@luizirber
Copy link
Member

OK, ready for review!

Here is a brief summary of the remaining items for discussion / refactoring - these can be punted to issues if/when this is merged:

* in the same file, `containment_ignore_maxhash` is now identical in behavior to `contained_by(..., downsample=True)`. However, `contained_by` has a lot more checks and is presumably slower. Benchmark before deciding which to go with in the medium term.

Nice to have: asv (Python) and criterion (Rust) benchmarks for both, because they are not covered now.

Anedoctically: Tests are running ~20% faster now 😺

right now, downsample in MinHash.similarity only applies to scaled signatures. Do we want it to downsample num signatures also? (see sourmash/_minhash.py::MinHash.compare)

Probably, but not as urgent as in the scaled case (since we have important use cases for the later)

in src/core/src/sketch/minhash.rs, the if ignore_abundance code in similarity seems duplicative with the compare function; this should be resolved in some way, or at least commented.
is there a reason not to have SourmashError::MismatchNum be checked in check_compatible (used by lib.kmerminhash_compare)?

These two are related: when I made #808 I had SourmashError::MismatchNum in check_compatible and merged similarity and compare, but that broke the code because check_compatible was not used in any of the scaled/downsampled parts of the code (which obviously don't have the same num.
Is this an excuse not to fix it? No. But it would be a way larger PR, so maybe we should do one for that soon =]

@ctb
Copy link
Contributor Author

ctb commented Jan 26, 2020

Sounds good to me.

(20% faster! wow.)

ctb and others added 12 commits January 26, 2020 11:18
* added downsample bool option to count_common and compare in rust code

* implement downsample functionality in rust

* refactoring sbtmh, step 1

* refactoring sbtmh, step 2

* update 'sourmash watch' to use SBT search fn

* refactoring out unnecessary functionality

* tackle downsampling in similarity with abundance; fix various rust issues

* simplify containment functions
@luizirber luizirber changed the title [MRG] refactor and oxidize downsampling code in sbtmh.py Refactor and oxidize downsampling code in sbtmh.py Jan 26, 2020
@ctb ctb merged commit cbd0c47 into master Jan 26, 2020
@luizirber luizirber deleted the refactor/sbtmh branch January 26, 2020 19:56
other.abunds.is_some(),
);
new_mh.add_many(&other.mins)?;
self.count_common(&new_mh, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if clause and the else below are very similar. If you wanted to, you could try to DRY up the code. Something along this might work:

let cmp = self.max_hash < other.max_hash;
let a = if cmp { self } else { other };
let b = if !cmp { self } else { other };

let mut new_mh = KmerMinHash::new(
    b.num,
    b.ksize,
    b.hash_function,
    b.seed,
    a.max_hash,
    b.abunds.is_some(),
);
new_mh.add_many(&b.mins)?;
new_mh.count_common(a, false)

Maybe you have to drop some mutable references in there. Just an idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants