-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] distance utility functions to support ANI estimation #1934
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #1934 +/- ##
==========================================
+ Coverage 82.94% 91.14% +8.19%
==========================================
Files 125 95 -30
Lines 13763 9674 -4089
Branches 1877 1910 +33
==========================================
- Hits 11416 8817 -2599
+ Misses 2075 584 -1491
- Partials 272 273 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Ready for review @ctb |
Overall looks really good - only a few test-this-please requests, and some suggested changes to function parameters! |
Co-authored-by: C. Titus Brown <titus@idyll.org>
…rmash into add-distance-utils-only
…rmash into add-distance-utils-only
@ctb I think this is ready for re-review, assuming tests pass |
tiny test data to trigger the following: | ||
WARNING: Cannot estimate ANI confidence intervals from containment. Do your sketches contain enough hashes? | ||
Error: varN <0.0! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ctb is there a way to check this here that I'm missing? I ended up adding a direct test in test_distance_utils.py
, so I could rm this test if we want. But it's kinda nice to know that this situation triggers it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good! just one suggestion, and one comment! merge as you will ;)
Co-authored-by: C. Titus Brown <titus@idyll.org>
@ctb py tests were passing, but now they're not and I'm not really sure why... |
you merged my quote adjustment but didn't fix the tests, looks like -
|
🎉 |
This PR adds distance utility functions, ultimately to support #1788
A few notes:
For ANI from either jaccard and containment, we can assess the probability that the sketches have nothing in common based on chance alone (== false negative). Ideally, we would keep count of how many times we encounter a too-high probability, and recommend that the user retains more hashes (decrease scaled).
For containment, we can optionally output confidence intervals, which actually take into account the impact of the
scaled
factor on ANI estimation (point estimate does not). I would like to retain the ability to output these, and output them only when desired (python API or maybesourmash sig overlap
).For jaccard, we only calculate the point estimate, but there is a small error associated with this estimation (it was suggested that > 10^-4 should be handled with caution). We can warn the user when this situation occurs, but I'm also currently returning this, as it might be useful to have as output, so users can eliminate untrustworthy comparisons.
Note that containment --> ANI will always be more accurate than jaccard --> ANI, but it seems useful to support jaccard, as 1. in my experience, the point estimate is actually quite similar. and 2. there are still occasions where users may prefer jaccard. Both formulas do require
scaled
sketches, sonum
sketches will not be compatible withANI
estimationget_exp_probability_nothing_common
distance_utils.py
minhash
fnsSince the
signature
functions are really similar to theminhash
functions, I'm going to save those for a separate PR, so this one is smaller/easier to review.