-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
explore building database minimum set covers #1852
Comments
making a database minimum coverimplemented some stuff over in https://github.com/ctb/2022-database-covers. In brief,
At the end of the process, you should have a new 'cover' database where the hash content of the database is the same as the original, but there is no redundancy between the sketches. A simpler version (that would be not nearly as effective) would be to simply remove any sketches where exact duplicates were present. Either way, you should be able to identify the same number of hashes for a given query. Initial query results are promising but we may need to adjust thresholds for reporting. summary
So by removing redundant portions of sketches we eliminate 40% of the sketches and 83% of the hashes in the database, and decrease the file size by 76%. downsampled GTDB rs202 to 10k:GTDB entire, at a scaled of 10000:
building a minimum set cover for the database:then ran:
which progressively discarded already seen hashes.
|
Did some validation on the covers code.
and at the end you do indeed see that the cover contains all the same k-mers as the original database:
|
I built a GTDB cover at 100k, as well as a GTDB+genbank cover. I was nervous about how slow things would be with a large .zip, so I made an SBT for the gtdb+genbank cover. Running it against SRR11669366 shows that the SBT of ~600,000 sketches is actually faster than the .zip, although it does use more memory 🎉
|
The main remaining thing I need to evaluate out for actually using these database covers is: how do the results compare when running gather against a cover vs running gather against the full database? I'm getting somewhat different results. My intuition tells me it's probably about the thresholds I'm providing, but I need to dig into this. |
@bluegenes made the excellent point that these database covers will no longer work for strain-level identification, although they will probably work great for species and genus level identification. |
I think these databases could be a really great way for increasing the speed of charcoal -- I'm testing it now. prefetch is failing however. I'm using it with sourmash 4.2.3, is there an obvious reason that it would fail? It doesn't appear to be a memory problem, and it fails silently until snakemake records the failed rule. |
Another note -- it would be really great if one could select specific genomes to seed from -- like it would be wonderful to be able to tell the cover to include first all of the GTDB reps, and then add in hashes from other signatures secondarily. That way, well-loved organisms will be most likely to be the result that is returned (e.g. E. coli K12, P. aeruginosa PA14, etc), which I think is helpful when trying to integrate with downstream analysis resources. |
yes, I think you want to update to sourmash 4.3.0 so that you get #1866 and #1870. I vaguely recall that those two fixes were connected to this functionality :) |
should work as is - just specify the genomes or collections with priority first. It should be ok if the databases are redundant (e.g. GTDB genomic-reps first, all GTDB next) because the redundant signatures will be removed. |
note that using this for charcoal when searching gtdb against self might be dangerous - contamination is one of the things that will present weirdly with database covers. |
aaaaand a note on the note - it could be really interesting to store just the differences between sketches in some complicated data structure... I think this is similar to #545 / AllSome or Split SBTs. |
see also sourmash-bio/database-examples#1 for a species-level merged GTDB db. |
note, one problem with species-level merges is that they require adhering to a taxonomy, which the database min set cov does not. curious if we could do something with agglomerative clustering on overlaps of (say) 100kb that would get us there without a taxonomy 🤔 (requiring a taxonomy changes the tenor of sourmash databases, I think, so it's a step to be taken with care) |
updates happening to codebase! some small cleanup; running gtdb eps first; outputting to directory for use with fastmultigather per run.sh sooner or later I will make into plugin, too. |
gave a talk at USC yesterday, and Mark Chaisson suggested that we consider preprocessing the highly redundant databases to remove some of the redundancy, and/or adjusting the approach in light of known redundancy. one particular idea was (IIRC) to reduce memory for human microbiome data sets by pre-selecting some sketches and species that tend to show up in them.
The text was updated successfully, but these errors were encountered: