Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] distance utility functions to support ANI estimation #1934

Merged
merged 26 commits into from
Apr 15, 2022

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented Apr 6, 2022

This PR adds distance utility functions, ultimately to support #1788

A few notes:
For ANI from either jaccard and containment, we can assess the probability that the sketches have nothing in common based on chance alone (== false negative). Ideally, we would keep count of how many times we encounter a too-high probability, and recommend that the user retains more hashes (decrease scaled).

For containment, we can optionally output confidence intervals, which actually take into account the impact of the scaled factor on ANI estimation (point estimate does not). I would like to retain the ability to output these, and output them only when desired (python API or maybe sourmash sig overlap).

For jaccard, we only calculate the point estimate, but there is a small error associated with this estimation (it was suggested that > 10^-4 should be handled with caution). We can warn the user when this situation occurs, but I'm also currently returning this, as it might be useful to have as output, so users can eliminate untrustworthy comparisons.

Note that containment --> ANI will always be more accurate than jaccard --> ANI, but it seems useful to support jaccard, as 1. in my experience, the point estimate is actually quite similar. and 2. there are still occasions where users may prefer jaccard. Both formulas do require scaled sketches, so num sketches will not be compatible with ANI estimation

  • finish updating all tests to reflect changes in function return values
    • by default, do not return CI values
    • no CI values from jaccard dist
  • add test for get_exp_probability_nothing_common
  • run pyflakes; black formatting for distance_utils.py
  • add minhash fns

Since the signature functions are really similar to the minhash functions, I'm going to save those for a separate PR, so this one is smaller/easier to review.

@codecov
Copy link

codecov bot commented Apr 6, 2022

Codecov Report

Merging #1934 (fb54506) into latest (0d04cca) will increase coverage by 8.19%.
The diff coverage is 99.47%.

@@            Coverage Diff             @@
##           latest    #1934      +/-   ##
==========================================
+ Coverage   82.94%   91.14%   +8.19%     
==========================================
  Files         125       95      -30     
  Lines       13763     9674    -4089     
  Branches     1877     1910      +33     
==========================================
- Hits        11416     8817    -2599     
+ Misses       2075      584    -1491     
- Partials      272      273       +1     
Flag Coverage Δ
python 91.14% <99.47%> (+0.16%) ⬆️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/distance_utils.py 99.32% <99.32%> (ø)
src/sourmash/minhash.py 93.20% <100.00%> (+0.64%) ⬆️
src/core/src/ffi/signature.rs
src/core/src/index/bigsi.rs
src/core/src/index/search.rs
src/core/src/cmd.rs
src/core/src/ffi/hyperloglog.rs
src/core/src/encodings.rs
src/core/src/index/linear.rs
src/core/tests/test.rs
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0d04cca...fb54506. Read the comment docs.

@bluegenes bluegenes changed the title [WIP] distance utility functions to support ANI estimation [MRG] distance utility functions to support ANI estimation Apr 8, 2022
@bluegenes
Copy link
Contributor Author

Ready for review @ctb

src/sourmash/distance_utils.py Show resolved Hide resolved
src/sourmash/distance_utils.py Show resolved Hide resolved
src/sourmash/distance_utils.py Show resolved Hide resolved
src/sourmash/minhash.py Outdated Show resolved Hide resolved
src/sourmash/distance_utils.py Outdated Show resolved Hide resolved
src/sourmash/distance_utils.py Outdated Show resolved Hide resolved
src/sourmash/distance_utils.py Outdated Show resolved Hide resolved
src/sourmash/distance_utils.py Show resolved Hide resolved
src/sourmash/distance_utils.py Show resolved Hide resolved
tests/test_minhash.py Outdated Show resolved Hide resolved
@ctb
Copy link
Contributor

ctb commented Apr 10, 2022

Overall looks really good - only a few test-this-please requests, and some suggested changes to function parameters!

@bluegenes
Copy link
Contributor Author

bluegenes commented Apr 15, 2022

@ctb I think this is ready for re-review, assuming tests pass

Comment on lines +2777 to +2779
tiny test data to trigger the following:
WARNING: Cannot estimate ANI confidence intervals from containment. Do your sketches contain enough hashes?
Error: varN <0.0!
Copy link
Contributor Author

@bluegenes bluegenes Apr 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ctb is there a way to check this here that I'm missing? I ended up adding a direct test in test_distance_utils.py, so I could rm this test if we want. But it's kinda nice to know that this situation triggers it...

Copy link
Contributor

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! just one suggestion, and one comment! merge as you will ;)

src/sourmash/distance_utils.py Outdated Show resolved Hide resolved
src/sourmash/distance_utils.py Show resolved Hide resolved
Co-authored-by: C. Titus Brown <titus@idyll.org>
@bluegenes
Copy link
Contributor Author

@ctb py tests were passing, but now they're not and I'm not really sure why...

@ctb
Copy link
Contributor

ctb commented Apr 15, 2022

@ctb py tests were passing, but now they're not and I'm not really sure why...

you merged my quote adjustment but didn't fix the tests, looks like -

assert("Error: distance estimation requires input of either `sequence_len_bp` or `n_unique_kmers`") in str(exc)
  E       assert 'Error: distance estimation requires input of either `sequence_len_bp` or `n_unique_kmers`' in '<ExceptionInfo ValueError("Error: distance estimation requires input of either \'sequence_len_bp\' or \'n_unique_kmers\'") tblen=2>'
  E        +  where '<ExceptionInfo ValueError("Error: distance estimation requires input of either \'sequence_len_bp\' or \'n_unique_kmers\'") tblen=2>' = str(<ExceptionInfo ValueError("Error: distance estimation requires input of either 'sequence_len_bp' or 'n_unique_kmers'") tblen=2>)

@bluegenes bluegenes merged commit 6f7eb06 into latest Apr 15, 2022
@bluegenes bluegenes deleted the add-distance-utils-only branch April 15, 2022 17:18
@ctb
Copy link
Contributor

ctb commented Apr 15, 2022

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants