[MRG] distance utility functions to support ANI estimation #1934

bluegenes · 2022-04-06T21:45:14Z

This PR adds distance utility functions, ultimately to support #1788

A few notes:
For ANI from either jaccard and containment, we can assess the probability that the sketches have nothing in common based on chance alone (== false negative). Ideally, we would keep count of how many times we encounter a too-high probability, and recommend that the user retains more hashes (decrease scaled).

For containment, we can optionally output confidence intervals, which actually take into account the impact of the scaled factor on ANI estimation (point estimate does not). I would like to retain the ability to output these, and output them only when desired (python API or maybe sourmash sig overlap).

For jaccard, we only calculate the point estimate, but there is a small error associated with this estimation (it was suggested that > 10^-4 should be handled with caution). We can warn the user when this situation occurs, but I'm also currently returning this, as it might be useful to have as output, so users can eliminate untrustworthy comparisons.

Note that containment --> ANI will always be more accurate than jaccard --> ANI, but it seems useful to support jaccard, as 1. in my experience, the point estimate is actually quite similar. and 2. there are still occasions where users may prefer jaccard. Both formulas do require scaled sketches, so num sketches will not be compatible with ANI estimation

finish updating all tests to reflect changes in function return values
- by default, do not return CI values
- no CI values from jaccard dist
add test for get_exp_probability_nothing_common
run pyflakes; black formatting for distance_utils.py
add minhash fns

Since the signature functions are really similar to the minhash functions, I'm going to save those for a separate PR, so this one is smaller/easier to review.

…nges

codecov · 2022-04-06T21:52:16Z

Codecov Report

Merging #1934 (fb54506) into latest (0d04cca) will increase coverage by 8.19%.
The diff coverage is 99.47%.

@@            Coverage Diff             @@
##           latest    #1934      +/-   ##
==========================================
+ Coverage   82.94%   91.14%   +8.19%     
==========================================
  Files         125       95      -30     
  Lines       13763     9674    -4089     
  Branches     1877     1910      +33     
==========================================
- Hits        11416     8817    -2599     
+ Misses       2075      584    -1491     
- Partials      272      273       +1

Flag	Coverage Δ
python	`91.14% <99.47%> (+0.16%)`	⬆️
rust	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/sourmash/distance_utils.py	`99.32% <99.32%> (ø)`
src/sourmash/minhash.py	`93.20% <100.00%> (+0.64%)`	⬆️
src/core/src/ffi/signature.rs
src/core/src/index/bigsi.rs
src/core/src/index/search.rs
src/core/src/cmd.rs
src/core/src/ffi/hyperloglog.rs
src/core/src/encodings.rs
src/core/src/index/linear.rs
src/core/tests/test.rs
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0d04cca...fb54506. Read the comment docs.

bluegenes · 2022-04-09T17:52:58Z

Ready for review @ctb

src/sourmash/distance_utils.py

src/sourmash/minhash.py

src/sourmash/distance_utils.py

tests/test_minhash.py

ctb · 2022-04-10T12:58:51Z

Overall looks really good - only a few test-this-please requests, and some suggested changes to function parameters!

Co-authored-by: C. Titus Brown <titus@idyll.org>

…rmash into add-distance-utils-only

bluegenes · 2022-04-15T00:17:04Z

@ctb I think this is ready for re-review, assuming tests pass

bluegenes · 2022-04-15T00:27:21Z

tests/test_minhash.py

+    tiny test data to trigger the following:
+    WARNING: Cannot estimate ANI confidence intervals from containment. Do your sketches contain enough hashes?
+    Error: varN <0.0!


@ctb is there a way to check this here that I'm missing? I ended up adding a direct test in test_distance_utils.py, so I could rm this test if we want. But it's kinda nice to know that this situation triggers it...

ctb

looks good! just one suggestion, and one comment! merge as you will ;)

src/sourmash/distance_utils.py

Co-authored-by: C. Titus Brown <titus@idyll.org>

bluegenes · 2022-04-15T16:32:01Z

@ctb py tests were passing, but now they're not and I'm not really sure why...

ctb · 2022-04-15T16:34:38Z

@ctb py tests were passing, but now they're not and I'm not really sure why...

you merged my quote adjustment but didn't fix the tests, looks like -

assert("Error: distance estimation requires input of either `sequence_len_bp` or `n_unique_kmers`") in str(exc)
  E       assert 'Error: distance estimation requires input of either `sequence_len_bp` or `n_unique_kmers`' in '<ExceptionInfo ValueError("Error: distance estimation requires input of either \'sequence_len_bp\' or \'n_unique_kmers\'") tblen=2>'
  E        +  where '<ExceptionInfo ValueError("Error: distance estimation requires input of either \'sequence_len_bp\' or \'n_unique_kmers\'") tblen=2>' = str(<ExceptionInfo ValueError("Error: distance estimation requires input of either 'sequence_len_bp' or 'n_unique_kmers'") tblen=2>)

ctb · 2022-04-15T17:18:58Z

🎉

split distance utils and start updating tests for function return cha…

2e9910f

…nges

bluegenes added 6 commits April 6, 2022 14:57

upd containment tests

f1cd7db

upd jaccard tests

7ec4e49

add p_nothing_in_common test; cleanup unused jaccard formulas

28eb968

pyflakes, formatting

3ebae36

raise err with bad input to distance_to_identity

09dedd2

add minhash fns and tests

e9c055f

bluegenes changed the title ~~[WIP] distance utility functions to support ANI estimation~~ [MRG] distance utility functions to support ANI estimation Apr 8, 2022

ctb requested changes Apr 10, 2022

View reviewed changes

bluegenes and others added 16 commits April 11, 2022 12:55

upd

e0719da

Apply suggestions from code review

8d7b0d3

Co-authored-by: C. Titus Brown <titus@idyll.org>

Merge branch 'add-distance-utils-only' of github.com:sourmash-bio/sou…

3fbe684

…rmash into add-distance-utils-only

try a dataclass

e92589a

better dataclasses; init revamp tests for aniresult classes

853ea85

mod all tests to for aniresult dataclass

454abf1

init minhash changes for aniresult

9a49224

Merge branch 'latest' into add-distance-utils-only

f1b8d72

upd tests

a6c660a

tiny testdata tests for var0.0

90671fa

Merge branch 'add-distance-utils-only' of github.com:sourmash-bio/sou…

9344158

…rmash into add-distance-utils-only

test var_n_mutated directly

01d4ae7

test test_handle_seqlen_nkmers directly

017bd32

clarify comment

83a050d

dont populate fake CI for dist=0,dist=1 cases

2f46649

fix newline

e88e454

bluegenes commented Apr 15, 2022

View reviewed changes

ctb approved these changes Apr 15, 2022

View reviewed changes

src/sourmash/distance_utils.py Outdated Show resolved Hide resolved

src/sourmash/distance_utils.py Show resolved Hide resolved

Update src/sourmash/distance_utils.py

731d286

Co-authored-by: C. Titus Brown <titus@idyll.org>

bluegenes mentioned this pull request Apr 15, 2022

potentially switch to python warnings instead of notify warnings #1954

Open

Merge branch 'latest' into add-distance-utils-only

9f2349c

fix corresponding test

fb54506

bluegenes merged commit 6f7eb06 into latest Apr 15, 2022

bluegenes deleted the add-distance-utils-only branch April 15, 2022 17:18

This was referenced Apr 21, 2022

Draft release notes for sourmash v4.4.0 #1968

Closed

Collect Result functionality into a common class #416

Open

bluegenes mentioned this pull request Apr 27, 2022

[EXP] add sourmash distance estimation #1788

Closed

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] distance utility functions to support ANI estimation #1934

[MRG] distance utility functions to support ANI estimation #1934

bluegenes commented Apr 6, 2022 •

edited

Loading

codecov bot commented Apr 6, 2022 •

edited

Loading

bluegenes commented Apr 9, 2022

ctb commented Apr 10, 2022

bluegenes commented Apr 15, 2022 •

edited

Loading

bluegenes Apr 15, 2022 •

edited

Loading

ctb left a comment

bluegenes commented Apr 15, 2022

ctb commented Apr 15, 2022

ctb commented Apr 15, 2022

[MRG] distance utility functions to support ANI estimation #1934

[MRG] distance utility functions to support ANI estimation #1934

Conversation

bluegenes commented Apr 6, 2022 • edited Loading

codecov bot commented Apr 6, 2022 • edited Loading

Codecov Report

bluegenes commented Apr 9, 2022

ctb commented Apr 10, 2022

bluegenes commented Apr 15, 2022 • edited Loading

bluegenes Apr 15, 2022 • edited Loading

Choose a reason for hiding this comment

ctb left a comment

Choose a reason for hiding this comment

bluegenes commented Apr 15, 2022

ctb commented Apr 15, 2022

ctb commented Apr 15, 2022

bluegenes commented Apr 6, 2022 •

edited

Loading

codecov bot commented Apr 6, 2022 •

edited

Loading

bluegenes commented Apr 15, 2022 •

edited

Loading

bluegenes Apr 15, 2022 •

edited

Loading