Refactor and oxidize downsampling code in sbtmh.py #856

ctb · 2020-01-22T14:26:08Z

The downsampling code we use in MinHash comparisons is ugly, so refactoring it makes sense. This does a fairly broad cleanup based on adding a downsample bool option to the similarity functions.

More specifically, this PR:

adds 'downsample' boolean parameters to rust compare, similarity, and count_common functions;
adds explicit tests for a few non-trivial angular distance calculations
refactors 'sourmash watch' to use new Index.search function instead of SBT-specific interface
simplifies and cleans up sbt and sbtmh functionality around downsampling in searches, and removes some unused functionality that is confusing.

Is it mergeable?
make test Did it pass the tests?
make coverage Is the new code covered?
Did it change the command-line interface? Only additions are allowed
without a major version increment. Changing file formats also requires a
major version number increment.
Was a spellchecker run on the source code and documentation after
changes were made?

codecov · 2020-01-22T14:37:36Z

Codecov Report

Merging #856 into master will decrease coverage by 0.54%.
The diff coverage is 53.54%.

@@            Coverage Diff             @@
##           master     #856      +/-   ##
==========================================
- Coverage   78.43%   77.89%   -0.55%     
==========================================
  Files          94       94              
  Lines        7299     7306       +7     
==========================================
- Hits         5725     5691      -34     
- Misses       1574     1615      +41

Flag	Coverage Δ
#rusttests	`50% <41.93%> (-1.24%)`	⬇️

Impacted Files	Coverage Δ
src/core/src/errors.rs	`0% <0%> (ø)`	⬆️
src/core/src/ffi/minhash.rs	`0% <0%> (ø)`	⬆️
sourmash/_minhash.py	`96.2% <100%> (-0.89%)`	⬇️
sourmash/sbtmh.py	`92.79% <100%> (+8.5%)`	⬆️
sourmash/sbt.py	`86.81% <100%> (ø)`	⬆️
sourmash/commands.py	`84.06% <100%> (-0.03%)`	⬇️
sourmash/compare.py	`100% <100%> (+10.29%)`	⬆️
sourmash/signature.py	`86.77% <100%> (-0.42%)`	⬇️
src/core/src/sketch/minhash.rs	`64.25% <35%> (-9.08%)`	⬇️
src/core/src/index/mod.rs	`60.91% <60%> (ø)`	⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f9de09b...6ebd166. Read the comment docs.

sourmash/sbtmh.py

luizirber · 2020-01-26T00:30:47Z

codecov/patch is failing because coverage can't track the Rust code when used inside Python (as an extension), so it looks like it was not covered (because the Rust tests don't test it). I don't think this is a blocker (since the Python code is probably covering it), so I don't think that's a blocker for merging.

Any other changes you want to do, @ctb? (other than the similarity tests you already mentioned)

luizirber · 2020-01-26T02:16:01Z

I also bumped the core version to 0.4.0, since some methods' signatures changed

ctb · 2020-01-26T15:31:57Z

@luizirber what do you think about 2e16a30? This switches the error message to reference scaled, instead of max hash. While it may (temporarily L) confuse developers to have 'mismatch in scaled' reported for a mismatch in max_hash, I think the improvement in end-user UX is worth it. Although of course in some theoretical world we would never report something like this to an end user... :)

ctb · 2020-01-26T15:48:44Z

another question - right now, downsample in MinHash.similarity only applies to scaled signatures. Do we want it to downsample num signatures also? (see sourmash/_minhash.py::MinHash.compare)

ctb · 2020-01-26T15:50:20Z

...and is there a reason not to have SourmashError::MismatchNum be checked in check_compatible (used by lib.kmerminhash_compare)?

ctb · 2020-01-26T17:02:11Z

I think we should merge this ASAP, and maybe cut a new release; and then I'll deal with the MismatchNum / downsample num code in another PR.

ctb · 2020-01-26T17:20:32Z

Well, and so. I was wrong above.

in #69, I claimed that we "switched to using cosine similarity for abundance comparisons". But we did not; the comment in the similarity function says, "If the sketches are abundance weighted, calculate a distance metric based on the cosine similarity."

It turns out that cosine similarity is not a distance metric, since it doesn't satisfy the triangle inequality. So instead we used "angular distance", per https://en.wikipedia.org/wiki/Cosine_similarity, section "Angular distance and similarity." This is defined as the inverse cosine of the cosine distance, divided by pi.

So I will revert the previous changes :)

ctb · 2020-01-26T17:20:44Z

(and then we need to update the docs)

ctb · 2020-01-26T17:43:13Z

update docs -> punted to #866

ctb · 2020-01-26T17:51:47Z

OK, ready for review!

Here is a brief summary of the remaining items for discussion / refactoring - these can be punted to issues if/when this is merged:

in src/core/src/sketch/minhash.rs, the if ignore_abundance code in similarity seems duplicative with the compare function; this should be resolved in some way, or at least commented.
in the same file, containment_ignore_maxhash is now identical in behavior to contained_by(..., downsample=True). However, contained_by has a lot more checks and is presumably slower. Benchmark before deciding which to go with in the medium term.
right now, downsample in MinHash.similarity only applies to scaled signatures. Do we want it to downsample num signatures also? (see sourmash/_minhash.py::MinHash.compare)
is there a reason not to have SourmashError::MismatchNum be checked in check_compatible (used by lib.kmerminhash_compare)?

luizirber · 2020-01-26T19:01:22Z

OK, ready for review!

Here is a brief summary of the remaining items for discussion / refactoring - these can be punted to issues if/when this is merged:
* in the same file, `containment_ignore_maxhash` is now identical in behavior to `contained_by(..., downsample=True)`. However, `contained_by` has a lot more checks and is presumably slower. Benchmark before deciding which to go with in the medium term.

Nice to have: asv (Python) and criterion (Rust) benchmarks for both, because they are not covered now.

Anedoctically: Tests are running ~20% faster now 😺

right now, downsample in MinHash.similarity only applies to scaled signatures. Do we want it to downsample num signatures also? (see sourmash/_minhash.py::MinHash.compare)

Probably, but not as urgent as in the scaled case (since we have important use cases for the later)

in src/core/src/sketch/minhash.rs, the if ignore_abundance code in similarity seems duplicative with the compare function; this should be resolved in some way, or at least commented.
is there a reason not to have SourmashError::MismatchNum be checked in check_compatible (used by lib.kmerminhash_compare)?

These two are related: when I made #808 I had SourmashError::MismatchNum in check_compatible and merged similarity and compare, but that broke the code because check_compatible was not used in any of the scaled/downsampled parts of the code (which obviously don't have the same num.
Is this an excuse not to fix it? No. But it would be a way larger PR, so maybe we should do one for that soon =]

ctb · 2020-01-26T19:11:38Z

Sounds good to me.

(20% faster! wow.)

* added downsample bool option to count_common and compare in rust code * implement downsample functionality in rust * refactoring sbtmh, step 1 * refactoring sbtmh, step 2 * update 'sourmash watch' to use SBT search fn * refactoring out unnecessary functionality * tackle downsampling in similarity with abundance; fix various rust issues * simplify containment functions

…ng code for compare

kloetzl · 2020-01-26T21:09:16Z

src/core/src/sketch/minhash.rs

+                    other.abunds.is_some(),
+                );
+                new_mh.add_many(&other.mins)?;
+                self.count_common(&new_mh, false)


This if clause and the else below are very similar. If you wanted to, you could try to DRY up the code. Something along this might work:

let cmp = self.max_hash < other.max_hash; let a = if cmp { self } else { other }; let b = if !cmp { self } else { other }; let mut new_mh = KmerMinHash::new( b.num, b.ksize, b.hash_function, b.seed, a.max_hash, b.abunds.is_some(), ); new_mh.add_many(&b.mins)?; new_mh.count_common(a, false)

Maybe you have to drop some mutable references in there. Just an idea.

luizirber reviewed Jan 22, 2020

View reviewed changes

sourmash/sbtmh.py Outdated Show resolved Hide resolved

luizirber reviewed Jan 22, 2020

View reviewed changes

sourmash/sbtmh.py Outdated Show resolved Hide resolved

ctb mentioned this pull request Jan 25, 2020

[WIP] oxidize downsampling in various similarity functions #863

Merged

luizirber added this to the 3.2 milestone Jan 26, 2020

ctb changed the title ~~[WIP] refactor downsampling code in sbtmh.py~~ [MRG] refactor and oxidize downsampling code in sbtmh.py Jan 26, 2020

ctb and others added 12 commits January 26, 2020 11:18

refactor downsampling code to no longer introspect exception text

ba84e18

clean up _similarity_downsample

483a6fe

add some tests for the basic bulk behavior

08eb42b

another basic test

487ffd7

fix rust tests

6b3ed2d

fix clippy lints

bacc4df

bump core version

a6ee21f

switch mismatch error to reference scaled, not max_hash

1c36ead

add docstrings

8cec004

slightly more understandble if statement :)

7056649

fix cosine similarity calculation, AND fix bug in new rust downsampli…

952e90a

…ng code for compare

ctb and others added 3 commits January 26, 2020 11:18

checks that downsampling is doing the right thing

901867a

update to make it clear we're calculating the angular similarity

fb53dbb

fmt and clippy fixes for #865

6ebd166

luizirber force-pushed the refactor/sbtmh branch from 5aaa56a to 6ebd166 Compare January 26, 2020 19:24

luizirber approved these changes Jan 26, 2020

View reviewed changes

luizirber changed the title ~~[MRG] refactor and oxidize downsampling code in sbtmh.py~~ Refactor and oxidize downsampling code in sbtmh.py Jan 26, 2020

luizirber added the ready for merge! label Jan 26, 2020

luizirber mentioned this pull request Jan 26, 2020

Improve sketching performance for DNA #865

Merged

5 tasks

ctb merged commit cbd0c47 into master Jan 26, 2020

luizirber deleted the refactor/sbtmh branch January 26, 2020 19:56

kloetzl reviewed Jan 26, 2020

View reviewed changes

This was referenced Jan 27, 2020

refactor compare and similarity in _minhash.py and sketch/minhash.rs #868

Closed

consider removing containment_ignore_max_hash function altogether #869

Closed

update 'num' behavior, in terms of downsampling and comparison checking #870

Open

ctb mentioned this pull request Dec 18, 2021

explore using unweighted cosine similarity with scaled sketches #1770

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor and oxidize downsampling code in sbtmh.py #856

Refactor and oxidize downsampling code in sbtmh.py #856

ctb commented Jan 22, 2020 •

edited

Loading

codecov bot commented Jan 22, 2020 •

edited

Loading

luizirber commented Jan 26, 2020 •

edited

Loading

luizirber commented Jan 26, 2020

ctb commented Jan 26, 2020

ctb commented Jan 26, 2020

ctb commented Jan 26, 2020

ctb commented Jan 26, 2020

ctb commented Jan 26, 2020

ctb commented Jan 26, 2020

ctb commented Jan 26, 2020

ctb commented Jan 26, 2020 •

edited

Loading

luizirber commented Jan 26, 2020

ctb commented Jan 26, 2020

kloetzl Jan 26, 2020

Refactor and oxidize downsampling code in sbtmh.py #856

Refactor and oxidize downsampling code in sbtmh.py #856

Conversation

ctb commented Jan 22, 2020 • edited Loading

codecov bot commented Jan 22, 2020 • edited Loading

Codecov Report

luizirber commented Jan 26, 2020 • edited Loading

luizirber commented Jan 26, 2020

ctb commented Jan 26, 2020

ctb commented Jan 26, 2020

ctb commented Jan 26, 2020

ctb commented Jan 26, 2020

ctb commented Jan 26, 2020

ctb commented Jan 26, 2020

ctb commented Jan 26, 2020

ctb commented Jan 26, 2020 • edited Loading

luizirber commented Jan 26, 2020

ctb commented Jan 26, 2020

kloetzl Jan 26, 2020

Choose a reason for hiding this comment

ctb commented Jan 22, 2020 •

edited

Loading

codecov bot commented Jan 22, 2020 •

edited

Loading

luizirber commented Jan 26, 2020 •

edited

Loading

ctb commented Jan 26, 2020 •

edited

Loading