Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explore building database minimum set covers #1852

Open
ctb opened this issue Feb 25, 2022 · 14 comments
Open

explore building database minimum set covers #1852

ctb opened this issue Feb 25, 2022 · 14 comments

Comments

@ctb
Copy link
Contributor

ctb commented Feb 25, 2022

gave a talk at USC yesterday, and Mark Chaisson suggested that we consider preprocessing the highly redundant databases to remove some of the redundancy, and/or adjusting the approach in light of known redundancy. one particular idea was (IIRC) to reduce memory for human microbiome data sets by pre-selecting some sketches and species that tend to show up in them.

@ctb ctb changed the title explore apply min-set-cov/gather to databases as part of preprocessing? explore applying min-set-cov/gather to databases as part of preprocessing? Feb 25, 2022
@ctb
Copy link
Contributor Author

ctb commented Feb 26, 2022

making a database minimum cover

implemented some stuff over in https://github.com/ctb/2022-database-covers.

In brief,

  • walk through a pile of sketches, tracking all previously seen hashes
  • for each new sketch, subtract all previous hashes from sketch
  • save downsized sketch to new database, add hashes to previously seen sketches

At the end of the process, you should have a new 'cover' database where the hash content of the database is the same as the original, but there is no redundancy between the sketches.

A simpler version (that would be not nearly as effective) would be to simply remove any sketches where exact duplicates were present.

Either way, you should be able to identify the same number of hashes for a given query. Initial query results are promising but we may need to adjust thresholds for reporting.

summary

database num sketches total hashes file size
gtdb entire 258,406 98,501,934 901 MB
gtdb cover 154,949 17,237,504 214 MB

So by removing redundant portions of sketches we eliminate 40% of the sketches and 83% of the hashes in the database, and decrease the file size by 76%.

downsampled GTDB rs202 to 10k:

GTDB entire, at a scaled of 10000:

sourmash sig fileinfo gtdb-rs202.genomic.k31.scaled\=10000.zip 

== This is sourmash version 4.2.4.dev27+g0a1b713f. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'gtdb-rs202.genomic.k31.scaled=10000.zip'
path filetype: ZipFileLinearIndex
location: /group/ctbrowngrp2/ctbrown/cover/gtdb-rs202.genomic.k31.scaled=10000.zip
is database? yes
has manifest? yes
num signatures: 258406
** examining manifest...
total hashes: 98501934
summary of sketches:
   258406 sketches with DNA, k=31, scaled=10000, abund 98501934 total hashes

building a minimum set cover for the database:

then ran:

% ./2022-database-covers/make-db-cover.py gtdb-rs202.genomic.k31.scaled\=10000.zip -o gtdb-rs202.genomic.k31.cover.zip

which progressively discarded already seen hashes.

% sourmash sig fileinfo gtdb-rs202.genomic.k31.cover.zip

== This is sourmash version 4.2.4.dev27+g0a1b713f. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'gtdb-rs202.genomic.k31.cover.zip'
path filetype: ZipFileLinearIndex
location: /group/ctbrowngrp2/ctbrown/cover/gtdb-rs202.genomic.k31.cover.zip
is database? yes
has manifest? yes
num signatures: 154949
** examining manifest...
total hashes: 17237504
summary of sketches:
   154949 sketches with DNA, k=31, scaled=10000, abund 17237504 total hashes

@ctb
Copy link
Contributor Author

ctb commented Mar 1, 2022

Did some validation on the covers code.

# grab the k-31 signatures from podar-ref database
sourmash sig cat ../sourmash/podar-ref.zip -k 31 -o podar-ref.zip

# make a database cover
./make-db-cover.py podar-ref.zip -o podar-ref.cover.zip --scaled=1000

# build a single merged signature from the database cover
sourmash sig merge podar-ref.cover.zip -o podar-ref.cover.merge.zip

# build a single merged signature from the original database
sourmash sig merge podar-ref.zip -o podar-ref.merge.zip

# compare!
sourmash sig overlap podar-ref.merge.zip podar-ref.cover.merge.zip 

and at the end you do indeed see that the cover contains all the same k-mers as the original database:

== This is sourmash version 4.2.5.dev19+g3a6028fb.d20220217. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded one signature each from podar-ref.merge.zip and podar-ref.cover.merge.zip
first signature:
  signature filename: podar-ref.merge.zip
  signature: 
  md5: 199f2f9d45b82cff605f231b1803c912
  k=31 molecule=DNA num=0 scaled=1000

second signature:
  signature filename: podar-ref.cover.merge.zip
  signature: 
  md5: 199f2f9d45b82cff605f231b1803c912
  k=31 molecule=DNA num=0 scaled=1000

similarity:                  1.00000
first contained in second:   1.00000
second contained in first:   1.00000

number of hashes in first:   200209
number of hashes in second:  200209

number of hashes in common:  200209
only in first:               0
only in second:              0
total (union):               200209

@ctb ctb changed the title explore applying min-set-cov/gather to databases as part of preprocessing? explore building database minimum set covers Mar 1, 2022
@ctb
Copy link
Contributor Author

ctb commented Mar 1, 2022

I built a GTDB cover at 100k, as well as a GTDB+genbank cover. I was nervous about how slow things would be with a large .zip, so I made an SBT for the gtdb+genbank cover.

Running it against SRR11669366 shows that the SBT of ~600,000 sketches is actually faster than the .zip, although it does use more memory 🎉

database time memory
gtdb cover (.zip) 67s 488 MB
gtdb+genbank SBT 60s 1727 MB

@ctb
Copy link
Contributor Author

ctb commented Mar 1, 2022

The main remaining thing I need to evaluate out for actually using these database covers is: how do the results compare when running gather against a cover vs running gather against the full database? I'm getting somewhat different results. My intuition tells me it's probably about the thresholds I'm providing, but I need to dig into this.

@ctb
Copy link
Contributor Author

ctb commented Mar 2, 2022

@bluegenes made the excellent point that these database covers will no longer work for strain-level identification, although they will probably work great for species and genus level identification.

@taylorreiter
Copy link
Contributor

I think these databases could be a really great way for increasing the speed of charcoal -- I'm testing it now. prefetch is failing however. I'm using it with sourmash 4.2.3, is there an obvious reason that it would fail? It doesn't appear to be a memory problem, and it fails silently until snakemake records the failed rule.

@taylorreiter
Copy link
Contributor

Another note -- it would be really great if one could select specific genomes to seed from -- like it would be wonderful to be able to tell the cover to include first all of the GTDB reps, and then add in hashes from other signatures secondarily. That way, well-loved organisms will be most likely to be the result that is returned (e.g. E. coli K12, P. aeruginosa PA14, etc), which I think is helpful when trying to integrate with downstream analysis resources.

@ctb
Copy link
Contributor Author

ctb commented Mar 25, 2022

I think these databases could be a really great way for increasing the speed of charcoal -- I'm testing it now. prefetch is failing however. I'm using it with sourmash 4.2.3, is there an obvious reason that it would fail? It doesn't appear to be a memory problem, and it fails silently until snakemake records the failed rule.

yes, I think you want to update to sourmash 4.3.0 so that you get #1866 and #1870. I vaguely recall that those two fixes were connected to this functionality :)

@ctb
Copy link
Contributor Author

ctb commented Mar 25, 2022

Another note -- it would be really great if one could select specific genomes to seed from -- like it would be wonderful to be able to tell the cover to include first all of the GTDB reps, and then add in hashes from other signatures secondarily. That way, well-loved organisms will be most likely to be the result that is returned (e.g. E. coli K12, P. aeruginosa PA14, etc), which I think is helpful when trying to integrate with downstream analysis resources.

should work as is - just specify the genomes or collections with priority first. It should be ok if the databases are redundant (e.g. GTDB genomic-reps first, all GTDB next) because the redundant signatures will be removed.

@ctb
Copy link
Contributor Author

ctb commented Mar 25, 2022

note that using this for charcoal when searching gtdb against self might be dangerous - contamination is one of the things that will present weirdly with database covers.

@ctb
Copy link
Contributor Author

ctb commented Mar 25, 2022

aaaaand a note on the note -

it could be really interesting to store just the differences between sketches in some complicated data structure... I think this is similar to #545 / AllSome or Split SBTs.

@ctb
Copy link
Contributor Author

ctb commented Apr 8, 2022

see also sourmash-bio/database-examples#1 for a species-level merged GTDB db.

@ctb
Copy link
Contributor Author

ctb commented Apr 15, 2022

note, one problem with species-level merges is that they require adhering to a taxonomy, which the database min set cov does not.

curious if we could do something with agglomerative clustering on overlaps of (say) 100kb that would get us there without a taxonomy 🤔

(requiring a taxonomy changes the tenor of sourmash databases, I think, so it's a step to be taken with care)

@ctb
Copy link
Contributor Author

ctb commented Sep 6, 2023

updates happening to codebase! some small cleanup; running gtdb eps first; outputting to directory for use with fastmultigather per run.sh

sooner or later I will make into plugin, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants