-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor and oxidize downsampling code in sbtmh.py #856
Conversation
Codecov Report
@@ Coverage Diff @@
## master #856 +/- ##
==========================================
- Coverage 78.43% 77.89% -0.55%
==========================================
Files 94 94
Lines 7299 7306 +7
==========================================
- Hits 5725 5691 -34
- Misses 1574 1615 +41
Continue to review full report at Codecov.
|
codecov/patch is failing because coverage can't track the Rust code when used inside Python (as an extension), so it looks like it was not covered (because the Rust tests don't test it). I don't think this is a blocker (since the Python code is probably covering it), so I don't think that's a blocker for merging. Any other changes you want to do, @ctb? (other than the similarity tests you already mentioned) |
I also bumped the core version to |
@luizirber what do you think about 2e16a30? This switches the error message to reference scaled, instead of max hash. While it may (temporarily L) confuse developers to have 'mismatch in scaled' reported for a mismatch in max_hash, I think the improvement in end-user UX is worth it. Although of course in some theoretical world we would never report something like this to an end user... :) |
another question - right now, |
...and is there a reason not to have |
I think we should merge this ASAP, and maybe cut a new release; and then I'll deal with the MismatchNum / downsample num code in another PR. |
Well, and so. I was wrong above. in #69, I claimed that we "switched to using cosine similarity for abundance comparisons". But we did not; the comment in the similarity function says, "If the sketches are abundance weighted, calculate a distance metric based on the cosine similarity." It turns out that cosine similarity is not a distance metric, since it doesn't satisfy the triangle inequality. So instead we used "angular distance", per https://en.wikipedia.org/wiki/Cosine_similarity, section "Angular distance and similarity." This is defined as the inverse cosine of the cosine distance, divided by pi. So I will revert the previous changes :) |
(and then we need to update the docs) |
update docs -> punted to #866 |
OK, ready for review! Here is a brief summary of the remaining items for discussion / refactoring - these can be punted to issues if/when this is merged:
|
Nice to have: Anedoctically: Tests are running ~20% faster now 😺
Probably, but not as urgent as in the scaled case (since we have important use cases for the later)
These two are related: when I made #808 I had |
Sounds good to me. (20% faster! wow.) |
* added downsample bool option to count_common and compare in rust code * implement downsample functionality in rust * refactoring sbtmh, step 1 * refactoring sbtmh, step 2 * update 'sourmash watch' to use SBT search fn * refactoring out unnecessary functionality * tackle downsampling in similarity with abundance; fix various rust issues * simplify containment functions
…ng code for compare
5aaa56a
to
6ebd166
Compare
other.abunds.is_some(), | ||
); | ||
new_mh.add_many(&other.mins)?; | ||
self.count_common(&new_mh, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if clause and the else below are very similar. If you wanted to, you could try to DRY up the code. Something along this might work:
let cmp = self.max_hash < other.max_hash;
let a = if cmp { self } else { other };
let b = if !cmp { self } else { other };
let mut new_mh = KmerMinHash::new(
b.num,
b.ksize,
b.hash_function,
b.seed,
a.max_hash,
b.abunds.is_some(),
);
new_mh.add_many(&b.mins)?;
new_mh.count_common(a, false)
Maybe you have to drop some mutable references in there. Just an idea.
The downsampling code we use in MinHash comparisons is ugly, so refactoring it makes sense. This does a fairly broad cleanup based on adding a
downsample
bool option to thesimilarity
functions.More specifically, this PR:
compare
,similarity
, andcount_common
functions;make test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?