Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add lca DBs as inputs to 'sourmash search' and 'gather' #533

Merged
merged 21 commits into from
Dec 19, 2018

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Aug 18, 2018

This lets 'search' and 'gather' take LCA DBs as well as SBTs and signatures as input; adds 'multigather' command.

  • multigather is tested
  • search and gather on LCA DBs is tested
  • old style LCA DBs fail with good error messages
  • performance check of gather on genbank databases
  • rebuilt LCA databases are available ;)
  • documentation updated with latest LCA databases

Note that results on SBT and LCA databases for the shakya unassigned contigs are nearly but not completely identical. Trying to decide if this needs to be tracked down...

For merge purposes:

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

Command used to construct LCA DBs from SBT:

sourmash lca index genbank-genomes-taxonomy.2017.05.29.csv genbank-k31.lca.json.gz -k 31 --scaled=10000 -f --traverse-directory .sbt.genbank-k31 --split-identifiers

New LCA DB size:

-rw-r--r-- 1 ctb ged-lab 109M Dec 17 03:51 genbank-k21.lca.json.gz
-rw-r--r-- 1 ctb ged-lab 120M Dec 17 03:56 genbank-k31.lca.json.gz
-rw-r--r-- 1 ctb ged-lab 125M Dec 17 04:00 genbank-k51.lca.json.gz

Benchmarking:

        Command being timed: "sourmash gather shakya-unaligned-contigs.sig genbank-k31.lca.json.gz"
        User time (seconds): 111.97
        System time (seconds): 5.46
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:57.87
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 5118232
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 4105295
        Voluntary context switches: 1373
        Involuntary context switches: 133
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

@codecov-io
Copy link

codecov-io commented Aug 18, 2018

Codecov Report

Merging #533 into master will decrease coverage by 0.02%.
The diff coverage is 91.35%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #533      +/-   ##
==========================================
- Coverage   88.56%   88.54%   -0.03%     
==========================================
  Files          25       25              
  Lines        3543     3754     +211     
  Branches       37       37              
==========================================
+ Hits         3138     3324     +186     
- Misses        405      430      +25
Impacted Files Coverage Δ
sourmash/__main__.py 95.83% <ø> (ø) ⬆️
sourmash/search.py 93.61% <100%> (+0.7%) ⬆️
sourmash/sourmash_args.py 95.63% <100%> (+0.31%) ⬆️
sourmash/lca/command_rankinfo.py 89.79% <100%> (+0.43%) ⬆️
sourmash/lca/command_gather.py 83.53% <100%> (-0.4%) ⬇️
sourmash/commands.py 89.22% <79.59%> (-1.3%) ⬇️
sourmash/lca/command_index.py 90.26% <92.22%> (+0.82%) ⬆️
sourmash/lca/lca_utils.py 94.67% <96.66%> (+0.19%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 21b400d...e708d3a. Read the comment docs.

luizirber and others added 7 commits October 15, 2018 23:06
* a trial refactoring of the lca db

* save and load seem to basically work

* got search & gather on LCA working in new framework

* allow key error

* keep identities with no lineage assigment

* add multigather

* Hyperlink DOIs to preferred resolver (#562)

* fix divide by zero issue in MinHash.contained_by (#572)

* fix divide by zero issue in contained_by

* remove unused lineages and identifiers

* update report output

* majority of lca tests passing now!

* fix gather?

* ...all tests pass?
@luizirber
Copy link
Member

about the test coverage: we can add a .codecov.yml file to configure a bit better how codecov measures coverage.

assert status != 0


@utils.in_tempdir
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured they would ;). But I'm happy with the functionality for now and not inclined to fix it; can be adjusted in the future, yah?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, the tmpdir example I linked was a trial on using it instead of our current approach, good to see another option (and eventually refactor it out =P)

@ctb
Copy link
Contributor Author

ctb commented Dec 15, 2018

I'm working on regenerating databases. Once that's done and I do some performance benchmarking, I think I'd like to suggest merging this (since it's becoming big).

query.minhash = query.minhash.downsample_scaled(args.scaled)

# empty?
if not query.minhash.get_mins():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not query.minhash.get_mins():
if not len(query.minhash):

@luizirber
Copy link
Member

I'll review in detail on Tue, but agree on merging after perf bench

@ctb
Copy link
Contributor Author

ctb commented Dec 17, 2018

Ready for review & merge! @luizirber the LCA DBs are here on the HPC:

[ctb@dev-intel14-phi sourmash-lca]$ pwd
/mnt/home/ctb/research/sourmash-lca
[ctb@dev-intel14-phi sourmash-lca]$ ls *.lca.json.gz
genbank-k21.lca.json.gz  genbank-k31.lca.json.gz  genbank-k51.lca.json.gz

@ctb ctb mentioned this pull request Dec 17, 2018
11 tasks
@ctb
Copy link
Contributor Author

ctb commented Dec 17, 2018

Latest databases available here for download: https://osf.io/vk4fa/files/. Will update documentation appropriately.

try:
record_remnants.remove(ident)
except KeyError:
# @CTB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is this a new pattern? If you pass an exception, you need to sign it to be made responsible at some point in the future? =]

Copy link
Member

@luizirber luizirber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, but merge at will!

@luizirber luizirber changed the title [WIP] add lca DBs as inputs to 'sourmash search' and 'gather' add lca DBs as inputs to 'sourmash search' and 'gather' Dec 19, 2018
@ctb ctb changed the title add lca DBs as inputs to 'sourmash search' and 'gather' [MRG] add lca DBs as inputs to 'sourmash search' and 'gather' Dec 19, 2018
@ctb ctb merged commit 954d729 into master Dec 19, 2018
@ctb ctb deleted the add/search_gather_lca_db branch December 19, 2018 04:23
@luizirber luizirber mentioned this pull request Jan 4, 2019
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants