Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] prevent ANI estimation when sketch size estimate may be inaccurate #2032

Merged
merged 7 commits into from
May 13, 2022

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented May 4, 2022

Using equations from @dkoslicki's #2031

  • integrate eqn's /create minhash method
  • when doing ANI comparisons, check set size accuracy, return "" when sizes are insufficient

Notes and concerns:

  • Adding the bias term for changes the containment for small sketches, which we probably only want to do for major releases. If HLL cardinality estimation will be added soon, I'm not sure it's a good idea to do this, only to change it back for HLL. Containment values will change then too, but at least it will just be once? ref add input number of k-mers (before sketching) to signature format #2030.

@codecov
Copy link

codecov bot commented May 4, 2022

Codecov Report

Merging #2032 (a249b95) into latest (6c7b3a8) will increase coverage by 7.50%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           latest    #2032      +/-   ##
==========================================
+ Coverage   84.15%   91.65%   +7.50%     
==========================================
  Files         129       98      -31     
  Lines       15087    10854    -4233     
  Branches     2119     2133      +14     
==========================================
- Hits        12696     9948    -2748     
+ Misses       2095      607    -1488     
- Partials      296      299       +3     
Flag Coverage Δ
python 91.65% <100.00%> (-0.02%) ⬇️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/compare.py 100.00% <100.00%> (ø)
src/sourmash/distance_utils.py 99.39% <100.00%> (+0.03%) ⬆️
src/sourmash/minhash.py 94.17% <100.00%> (+0.18%) ⬆️
src/sourmash/signature.py 91.81% <100.00%> (+0.15%) ⬆️
src/sourmash/sketchcomparison.py 95.23% <100.00%> (-4.77%) ⬇️
src/core/src/lib.rs
src/core/src/ffi/nodegraph.rs
src/core/src/ffi/cmd/compute.rs
src/core/src/encodings.rs
... and 27 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6c7b3a8...a249b95. Read the comment docs.

@bluegenes
Copy link
Contributor Author

@ctb here's where/how I'm thinking of integrating the equations from #2031

@bluegenes
Copy link
Contributor Author

@ctb - Now that we zero/null out ANI when the size estimation may be inaccurate, should self x self be a special case where we return 1? Or still avoid returning ANI?

@ctb
Copy link
Contributor

ctb commented May 12, 2022 via email

@bluegenes
Copy link
Contributor Author

bluegenes commented May 12, 2022

On Thu, May 12, 2022 at 01:35:23PM -0700, Tessa Pierce Ward wrote: @ctb - Now that we zero/null out ANI when the size estimation may be inaccurate, should self x self be a special case where we return 1? Or still avoid returning ANI?
I think avoid, is simplest.

agreed! Bit of an issue though -- compare automatically populates diagonal with ones.

e.g. -- https://github.com/sourmash-bio/sourmash/blob/latest/src/sourmash/compare.py#L34

similarities = np.ones((n, n))

@ctb
Copy link
Contributor

ctb commented May 12, 2022 via email

@bluegenes
Copy link
Contributor Author

bluegenes commented May 12, 2022

Just to put the answer somewhere on here..

compare using jaccard keeps the original matrix 1's because it uses iterator = itertools.combinations(range(n), 2) to get unique pairs to compare, since comparisons are not directional. compare with --containment instead loops through i/j in matrix, so it calculates containment/ani for every pair, including self vs self.

To fix, I am just always letting self vs self be 1. I'm also using the itertools.combinations for max_containment to avoid recalculating identical vals.

@bluegenes bluegenes changed the title [WIP] prevent ANI estimation when sketch size estimate is likely to be inaccurate [MRG] prevent ANI estimation when sketch size estimate is likely to be inaccurate May 12, 2022
Copy link
Contributor

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@bluegenes bluegenes changed the title [MRG] prevent ANI estimation when sketch size estimate is likely to be inaccurate [MRG] prevent ANI estimation when sketch size estimate may be inaccurate May 13, 2022
@bluegenes bluegenes merged commit 06989d7 into latest May 13, 2022
@bluegenes bluegenes deleted the card-estimate-scaled branch May 13, 2022 01:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants