Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add max_containment to MinHash class. #1346

Merged
merged 41 commits into from
Mar 12, 2021
Merged

[MRG] add max_containment to MinHash class. #1346

merged 41 commits into from
Mar 12, 2021

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Feb 23, 2021

Add max_containment() and --max-containment per #1343.

Fixes #1247
Fixes #1343

Main features:

  • adds MinHash.max_containment(other, downsample=False)
  • adds SourmashSignature.max_containment(other, downsample=False)
  • adds --max-containment flag to sourmash search
  • adds do_max_containment flag to search for Index classes (LinearIndex, SBT, LCA_Database)

In addition, this PR indulges in misc cleanup:

  • refactors some of the Index code to use named arguments, now that legacy support for Python 2.7 was removed.
  • removes untested result caching in sbtmh.py.
  • fixes namespace collision and associated broken test in test_signature.py - there were two test_str functions.
  • fixes namespace collision and associated broken test in test__minhash.py -- there were two test_mh_len functions.
  • test that SBT.search(...) properly fails when threshold=None (this code did not have test coverage)
  • remove duplicate test_compare_containment_abund_flatten test function in test_sourmash.py.

TODO:

  • add feature for compare --max-containment along with tests
  • check that max_containment does not allow non-scaled signatures
  • add test for if do_containment and do_max_containment conditions
  • add test for if threshold is None: in sbt.py
  • add command line tests for max containment
  • check SBT especially

Checklist

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@codecov
Copy link

codecov bot commented Feb 23, 2021

Codecov Report

Merging #1346 (108748a) into latest (55741dc) will increase coverage by 0.27%.
The diff coverage is 97.82%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1346      +/-   ##
==========================================
+ Coverage   88.88%   89.15%   +0.27%     
==========================================
  Files         123      123              
  Lines       18321    18593     +272     
  Branches     1410     1432      +22     
==========================================
+ Hits        16284    16577     +293     
+ Misses       1800     1780      -20     
+ Partials      237      236       -1     
Flag Coverage Δ
python 94.41% <97.82%> (+0.24%) ⬆️
rust 67.37% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/index.py 93.10% <70.00%> (-2.20%) ⬇️
src/sourmash/sbt.py 84.14% <84.61%> (-0.16%) ⬇️
src/sourmash/lca/lca_db.py 92.42% <85.71%> (+0.02%) ⬆️
src/sourmash/sbtmh.py 89.83% <89.47%> (+6.65%) ⬆️
src/sourmash/cli/compare.py 100.00% <100.00%> (ø)
src/sourmash/cli/search.py 100.00% <100.00%> (ø)
src/sourmash/commands.py 82.63% <100.00%> (+0.41%) ⬆️
src/sourmash/compare.py 89.18% <100.00%> (+1.31%) ⬆️
src/sourmash/minhash.py 93.51% <100.00%> (+0.15%) ⬆️
src/sourmash/search.py 91.22% <100.00%> (ø)
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 55741dc...108748a. Read the comment docs.

ignore_abundance = kwargs.get('ignore_abundance', False)

# configure search - containment? ignore abundance?
if do_containment:
query_match = lambda x: query.contained_by(x, downsample=True)
elif do_max_containment:
query_match = lambda x: query.max_containmenty(x, downsample=True)

This comment was marked as outdated.

@ctb
Copy link
Contributor Author

ctb commented Feb 23, 2021

witness!

search

% sourmash search genome-s10.fa.gz.sig all.sbt.zip  

== This is sourmash version 4.0.0rc2.dev1+gce950caa. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: ../genome-s10.fa.gz... (k=31, DNA)
loaded 1 databases.

3 matches:
similarity   match
----------   -----
100.0%       ../genome-s10.fa.gz
 51.6%       ../genome-s10+s11.fa.gz
 16.7%       ../genome-s10-small.fa.gz

search --containment

sourmash search genome-s10.fa.gz.sig all.sbt.zip  --containment

== This is sourmash version 4.0.0rc2.dev1+gce950caa. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: ../genome-s10.fa.gz... (k=31, DNA)
loaded 1 databases.

3 matches:
similarity   match
----------   -----
100.0%       ../genome-s10.fa.gz
100.0%       ../genome-s10+s11.fa.gz
 16.7%       ../genome-s10-small.fa.gz

search --max-containment

% sourmash search genome-s10.fa.gz.sig all.sbt.zip  --max-containment

== This is sourmash version 4.0.0rc2.dev1+gce950caa. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: ../genome-s10.fa.gz... (k=31, DNA)
loaded 1 databases.

3 matches:
similarity   match
----------   -----
100.0%       ../genome-s10-small.fa.gz
100.0%       ../genome-s10.fa.gz
100.0%       ../genome-s10+s11.fa.gz

@ctb
Copy link
Contributor Author

ctb commented Mar 5, 2021

@bluegenes this might be ready to try out. I wouldn't trust the SBT code just yet, but everything else should work I think.

@bluegenes
Copy link
Contributor

Thanks @ctb!! Will take it for a spin

@bluegenes
Copy link
Contributor

sourmash search --max-containment seems to be running on my testing! YAY.

I somewhat forgot that the search csv output is less informative than the gather csv output -- ultimately, it would be really nice to have add'l info, particularly signature name + matched bp, etc, as you suggest in #1366 for 5.0.

some mini results:

sourmash search --max-containment output.protein-pigeon/prodigal/signatures/pigeon1.0-GOV_18503.prodigal.sig output.pigeon1.0-cluster/pigeon1.0.protein-k10.mc0.05.founders.sbt.zip -o output.pigeon1.0-cluster/cluster_info/GOV_18503_x_pigeon1.0.protein-k10.mc0.05.best-founder.txt --threshold 0.01 --ksize 10 --protein

--max-containment:

4 matches; showing first 3:
similarity   match
----------   -----
 10.0%       EarthsVirome_13607
  8.7%       GOV_66421
  4.5%       EarthsVirome_15562

** note, none of these show up with the default threshold (0.08). I needed --threshold 0.05 for the first match to show up, 0.03 for the next and 0.02 for the 3rd match. I guess I thought search used a percentage threshold -- will go look at it, but wanted to record here in the mean time.

--containment:

4 matches; showing first 3:
similarity   match
----------   -----
  5.3%       EarthsVirome_13607
  3.1%       GOV_66421
  2.3%       EarthsVirome_15562

jaccard:

3 matches:
similarity   match
----------   -----
  3.8%       EarthsVirome_13607
  2.7%       GOV_66421
  1.7%       EarthsVirome_15562

@bluegenes
Copy link
Contributor

would it be straightforward to add compare --max-containment here, or would that be better in a separate PR?

@ctb
Copy link
Contributor Author

ctb commented Mar 11, 2021

would it be straightforward to add compare --max-containment here, or would that be better in a separate PR?

Totes. Great idea.

@ctb
Copy link
Contributor Author

ctb commented Mar 11, 2021

sourmash search --max-containment seems to be running on my testing! YAY.

I somewhat forgot that the search csv output is less informative than the gather csv output -- ultimately, it would be really nice to have add'l info, particularly signature name + matched bp, etc, as you suggest in #1366 for 5.0.

Yeah, search CSV output is terrible! As you will see in #1370 😉 I am planning to output the following in the prefetch command,

  • intersect bp
  • match bp
  • query bp
  • jaccard similarity
  • query containment
  • match containment
  • max_containment
    and the only hesitation I have in adding all of these to search output is that they would be subject to different thresholds depending on the command line arguments, and my hot take is that I don't like that.

Specifically, imagine that you do a search with --max-containment vs --similarity - for the same query, the results will (of course) be different, and the order and thresholding in the output CSV will be different.

Of course, this is a problem for the current output, too, in that the actual meaning of the similarity number in the output file changes depending on the search parameters. So that's bad. Grr.

🤔

Hmm, I wonder if we could include an extra column that is something like "search key" to indicate what the search was? It'll be redundant (since it would have to be in every row for this columnar data output) but at least it'd be there.

Might also be a good opportunity to support JSON or YAML output from search, gather, and prefetch - then those files could contain far more information, including full command-line parameters, threshold, etc. etc. See #448.

We could also turn off CSV output from search entirely, and tell people to use prefetch for programmatic foo...

** note, none of these show up with the default threshold (0.08). I needed --threshold 0.05 for the first match to show up, 0.03 for the next and 0.02 for the 3rd match. I guess I thought search used a percentage threshold -- will go look at it, but wanted to record here in the mean time.

Curious what you mean by "percentage threshold"? It just takes a floating point number that is the lowest similarity etc to report.

OH! You mean the thresholding is actually just wrong for --max-containment. I, umm, should fix that! I appreciate the implication that this was actually something intelligent, and not just a dumb bug...

@ctb ctb changed the title [WIP] add max_containment to MinHash class. [MRG] add max_containment to MinHash class. Mar 12, 2021
@ctb
Copy link
Contributor Author

ctb commented Mar 12, 2021

hi @bluegenes this PR is ready for review (and merge?)!

Other than code review (sorry for all the mess...) the two remaining things are --

  • do you want me to add more stuff to the CSV output in this PR, or should I start a new PR for that? I think this one is getting kind of big.
  • could do a test on a large-ish SBT, by doing sourmash search --max-containment on both the SBT and on a directory of contents created with sourmash sig split run on the SBT? I'd like a double-check on the SBT max containment code - no reason it shouldn't work, just slightly paranoid about computers...

@ctb
Copy link
Contributor Author

ctb commented Mar 12, 2021

I added the query information into the search CSV output in cbd2503. I couldn't add all of the info that @bluegenes might want in, however, because search works on regular MinHashes, too, which don't support containment or max_containment.

Perhaps another reason to consider getting rid of "regular" MinHash per #1354

@ctb
Copy link
Contributor Author

ctb commented Mar 12, 2021

(I suppose I could put zeros in for all those numbers when running on regular minhashes...)

@bluegenes
Copy link
Contributor

(I suppose I could put zeros in for all those numbers when running on regular minhashes...)

NA's?

@ctb
Copy link
Contributor Author

ctb commented Mar 12, 2021 via email

src/sourmash/commands.py Outdated Show resolved Hide resolved
Comment on lines +500 to +501
fieldnames = ['similarity', 'name', 'filename', 'md5',
'query_filename', 'query_name', 'query_md5']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to modify similarity to containment / max_containment for csv output. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

semantic versioning prevents us from removing the similarity header before 5.0. we could add new columns, I 'spose. I don't like the idea that column headers change depending on command line arguments, though. Not sure how to think about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(suggest we punt this to a new issue and discuss it there.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

punted to #1390

Co-authored-by: Tessa Pierce <bluegenes@users.noreply.github.com>
src/sourmash/minhash.py Outdated Show resolved Hide resolved
Co-authored-by: Tessa Pierce <bluegenes@users.noreply.github.com>
query=query,
query_filename=query.filename,
query_name=query.name,
query_md5=query.md5sum()[:8]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason to truncate one md5 and not the other?

...and now that I think about it, it's for semantic versioning, eh?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err, nope, didn't notice I wasn't truncating the md5sum for the match. Hrm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

punted to #1390

Copy link
Contributor

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the rest lgtm!

@ctb
Copy link
Contributor Author

ctb commented Mar 12, 2021

OK, I think I addressed everything. Will wait for tests to pass before merging.

@ctb ctb merged commit 77467de into latest Mar 12, 2021
@ctb ctb deleted the add/max_containment branch March 12, 2021 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants