explore building database minimum set covers #1852

ctb · 2022-02-25T13:12:16Z

gave a talk at USC yesterday, and Mark Chaisson suggested that we consider preprocessing the highly redundant databases to remove some of the redundancy, and/or adjusting the approach in light of known redundancy. one particular idea was (IIRC) to reduce memory for human microbiome data sets by pre-selecting some sketches and species that tend to show up in them.

ctb · 2022-02-26T16:27:31Z

making a database minimum cover

implemented some stuff over in https://github.com/ctb/2022-database-covers.

In brief,

walk through a pile of sketches, tracking all previously seen hashes
for each new sketch, subtract all previous hashes from sketch
save downsized sketch to new database, add hashes to previously seen sketches

At the end of the process, you should have a new 'cover' database where the hash content of the database is the same as the original, but there is no redundancy between the sketches.

A simpler version (that would be not nearly as effective) would be to simply remove any sketches where exact duplicates were present.

Either way, you should be able to identify the same number of hashes for a given query. Initial query results are promising but we may need to adjust thresholds for reporting.

summary

database	num sketches	total hashes	file size
gtdb entire	258,406	98,501,934	901 MB
gtdb cover	154,949	17,237,504	214 MB

So by removing redundant portions of sketches we eliminate 40% of the sketches and 83% of the hashes in the database, and decrease the file size by 76%.

downsampled GTDB rs202 to 10k:

GTDB entire, at a scaled of 10000:

sourmash sig fileinfo gtdb-rs202.genomic.k31.scaled\=10000.zip 

== This is sourmash version 4.2.4.dev27+g0a1b713f. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'gtdb-rs202.genomic.k31.scaled=10000.zip'
path filetype: ZipFileLinearIndex
location: /group/ctbrowngrp2/ctbrown/cover/gtdb-rs202.genomic.k31.scaled=10000.zip
is database? yes
has manifest? yes
num signatures: 258406
** examining manifest...
total hashes: 98501934
summary of sketches:
   258406 sketches with DNA, k=31, scaled=10000, abund 98501934 total hashes

building a minimum set cover for the database:

then ran:

% ./2022-database-covers/make-db-cover.py gtdb-rs202.genomic.k31.scaled\=10000.zip -o gtdb-rs202.genomic.k31.cover.zip

which progressively discarded already seen hashes.

% sourmash sig fileinfo gtdb-rs202.genomic.k31.cover.zip

== This is sourmash version 4.2.4.dev27+g0a1b713f. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'gtdb-rs202.genomic.k31.cover.zip'
path filetype: ZipFileLinearIndex
location: /group/ctbrowngrp2/ctbrown/cover/gtdb-rs202.genomic.k31.cover.zip
is database? yes
has manifest? yes
num signatures: 154949
** examining manifest...
total hashes: 17237504
summary of sketches:
   154949 sketches with DNA, k=31, scaled=10000, abund 17237504 total hashes

ctb · 2022-03-01T15:05:06Z

Did some validation on the covers code.

# grab the k-31 signatures from podar-ref database
sourmash sig cat ../sourmash/podar-ref.zip -k 31 -o podar-ref.zip

# make a database cover
./make-db-cover.py podar-ref.zip -o podar-ref.cover.zip --scaled=1000

# build a single merged signature from the database cover
sourmash sig merge podar-ref.cover.zip -o podar-ref.cover.merge.zip

# build a single merged signature from the original database
sourmash sig merge podar-ref.zip -o podar-ref.merge.zip

# compare!
sourmash sig overlap podar-ref.merge.zip podar-ref.cover.merge.zip

and at the end you do indeed see that the cover contains all the same k-mers as the original database:

== This is sourmash version 4.2.5.dev19+g3a6028fb.d20220217. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded one signature each from podar-ref.merge.zip and podar-ref.cover.merge.zip
first signature:
  signature filename: podar-ref.merge.zip
  signature: 
  md5: 199f2f9d45b82cff605f231b1803c912
  k=31 molecule=DNA num=0 scaled=1000

second signature:
  signature filename: podar-ref.cover.merge.zip
  signature: 
  md5: 199f2f9d45b82cff605f231b1803c912
  k=31 molecule=DNA num=0 scaled=1000

similarity:                  1.00000
first contained in second:   1.00000
second contained in first:   1.00000

number of hashes in first:   200209
number of hashes in second:  200209

number of hashes in common:  200209
only in first:               0
only in second:              0
total (union):               200209

ctb · 2022-03-01T16:05:49Z

I built a GTDB cover at 100k, as well as a GTDB+genbank cover. I was nervous about how slow things would be with a large .zip, so I made an SBT for the gtdb+genbank cover.

Running it against SRR11669366 shows that the SBT of ~600,000 sketches is actually faster than the .zip, although it does use more memory 🎉

database	time	memory
gtdb cover (.zip)	67s	488 MB
gtdb+genbank SBT	60s	1727 MB

ctb · 2022-03-01T16:11:24Z

The main remaining thing I need to evaluate out for actually using these database covers is: how do the results compare when running gather against a cover vs running gather against the full database? I'm getting somewhat different results. My intuition tells me it's probably about the thresholds I'm providing, but I need to dig into this.

ctb · 2022-03-02T13:51:13Z

@bluegenes made the excellent point that these database covers will no longer work for strain-level identification, although they will probably work great for species and genus level identification.

taylorreiter · 2022-03-25T15:19:26Z

I think these databases could be a really great way for increasing the speed of charcoal -- I'm testing it now. prefetch is failing however. I'm using it with sourmash 4.2.3, is there an obvious reason that it would fail? It doesn't appear to be a memory problem, and it fails silently until snakemake records the failed rule.

taylorreiter · 2022-03-25T15:21:18Z

Another note -- it would be really great if one could select specific genomes to seed from -- like it would be wonderful to be able to tell the cover to include first all of the GTDB reps, and then add in hashes from other signatures secondarily. That way, well-loved organisms will be most likely to be the result that is returned (e.g. E. coli K12, P. aeruginosa PA14, etc), which I think is helpful when trying to integrate with downstream analysis resources.

ctb · 2022-03-25T15:26:49Z

I think these databases could be a really great way for increasing the speed of charcoal -- I'm testing it now. prefetch is failing however. I'm using it with sourmash 4.2.3, is there an obvious reason that it would fail? It doesn't appear to be a memory problem, and it fails silently until snakemake records the failed rule.

yes, I think you want to update to sourmash 4.3.0 so that you get #1866 and #1870. I vaguely recall that those two fixes were connected to this functionality :)

ctb · 2022-03-25T15:28:26Z

Another note -- it would be really great if one could select specific genomes to seed from -- like it would be wonderful to be able to tell the cover to include first all of the GTDB reps, and then add in hashes from other signatures secondarily. That way, well-loved organisms will be most likely to be the result that is returned (e.g. E. coli K12, P. aeruginosa PA14, etc), which I think is helpful when trying to integrate with downstream analysis resources.

should work as is - just specify the genomes or collections with priority first. It should be ok if the databases are redundant (e.g. GTDB genomic-reps first, all GTDB next) because the redundant signatures will be removed.

ctb · 2022-03-25T15:28:55Z

note that using this for charcoal when searching gtdb against self might be dangerous - contamination is one of the things that will present weirdly with database covers.

ctb · 2022-03-25T15:30:01Z

aaaaand a note on the note -

it could be really interesting to store just the differences between sketches in some complicated data structure... I think this is similar to #545 / AllSome or Split SBTs.

ctb · 2022-04-08T15:42:41Z

see also sourmash-bio/database-examples#1 for a species-level merged GTDB db.

ctb · 2022-04-15T13:25:12Z

note, one problem with species-level merges is that they require adhering to a taxonomy, which the database min set cov does not.

curious if we could do something with agglomerative clustering on overlaps of (say) 100kb that would get us there without a taxonomy 🤔

(requiring a taxonomy changes the tenor of sourmash databases, I think, so it's a step to be taken with care)

ctb · 2023-09-06T15:23:59Z

updates happening to codebase! some small cleanup; running gtdb eps first; outputting to directory for use with fastmultigather per run.sh

sooner or later I will make into plugin, too.

ctb changed the title ~~explore apply min-set-cov/gather to databases as part of preprocessing?~~ explore applying min-set-cov/gather to databases as part of preprocessing? Feb 25, 2022

ctb mentioned this issue Feb 26, 2022

[EXP] try out making a cover for the prefetch results #1853

Open

ctb changed the title ~~explore applying min-set-cov/gather to databases as part of preprocessing?~~ explore building database minimum set covers Mar 1, 2022

ctb mentioned this issue Apr 16, 2022

optimizing gather on large databases - some thoughts in re benchmarking #1957

Open

ctb mentioned this issue May 1, 2022

sourmash database construction - current status and future thoughts #2015

Open

This was referenced Sep 29, 2022

downsampling signatures for dramatic speeds ups in subtraction and taxonomic classification Arcadia-Science/2022-mtx-not-in-mgx-pairs#7

Open

Start a test pipeline for subtracting genomes from transcriptomes in eukaryotes Arcadia-Science/2022-tx-not-in-gx#1

Merged

taylorreiter mentioned this issue Dec 2, 2022

Build a 100k cover db of all genomes and subtract using that instead of organism-specific databases? Arcadia-Science/2022-tx-not-in-gx#5

Open

ctb mentioned this issue Aug 17, 2023

sourmash plugins - ideas dumping ground #2453

Open

ctb mentioned this issue Jan 9, 2024

explore building pangenome databases #2903

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

explore building database minimum set covers #1852

explore building database minimum set covers #1852

ctb commented Feb 25, 2022

ctb commented Feb 26, 2022

ctb commented Mar 1, 2022

ctb commented Mar 1, 2022 •

edited

Loading

ctb commented Mar 1, 2022

ctb commented Mar 2, 2022

taylorreiter commented Mar 25, 2022

taylorreiter commented Mar 25, 2022

ctb commented Mar 25, 2022

ctb commented Mar 25, 2022

ctb commented Mar 25, 2022

ctb commented Mar 25, 2022

ctb commented Apr 8, 2022

ctb commented Apr 15, 2022 •

edited

Loading

ctb commented Sep 6, 2023

explore building database minimum set covers #1852

explore building database minimum set covers #1852

Comments

ctb commented Feb 25, 2022

ctb commented Feb 26, 2022

making a database minimum cover

summary

downsampled GTDB rs202 to 10k:

building a minimum set cover for the database:

ctb commented Mar 1, 2022

ctb commented Mar 1, 2022 • edited Loading

ctb commented Mar 1, 2022

ctb commented Mar 2, 2022

taylorreiter commented Mar 25, 2022

taylorreiter commented Mar 25, 2022

ctb commented Mar 25, 2022

ctb commented Mar 25, 2022

ctb commented Mar 25, 2022

ctb commented Mar 25, 2022

ctb commented Apr 8, 2022

ctb commented Apr 15, 2022 • edited Loading

ctb commented Sep 6, 2023

ctb commented Mar 1, 2022 •

edited

Loading

ctb commented Apr 15, 2022 •

edited

Loading