-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running sourmash gather
in parallel/on many queries
#2066
Comments
hi @krastegar, the final part of the gather algorithm itself is not directly parallelizable, or at least not easily so. But there are things you can do. Read on... As of #1370, the default implementation of gather consists of two parts: first, a sweep across the provided databases that finds all genomes that have an overlap with the query, and then an implementation of a greedy minimum set cover algorithm that chooses a (much) smaller set of non-redundant matches. The second part can't easily be parallelized. However, the first part can be done in parallel using the This is not yet implemented natively in sourmash (see #1752 for details on this effort), but you can hack it together at the command line or use a workflow system to do it - see #1664 if you're interested in a snakemake workflow. There are better/faster ways of doing this now, but I haven't updated #1664 with those; let me know if you're interested. The only real problem with this solution is that it's not that much faster if you have a lot of matches - that is, if the prefetch sweep returns more than 30% of the database, as can happen with human microbiomes, I wouldn't expect a big speed increase. |
Thanks for getting back to me @ctb, Sorry for not being specific with my question. I have a relatively large amount of kmer signatures and I was tasked with using gather on every single signature. I was able to speed up the process by using GNU parallel. |
excellent, glad to hear it! |
sourmash gather
in parallel/on many queries
https://github.com/sourmash-bio/pyo3_branchwater is a plugin with a fast (multithreaded) implementation of multigather. |
as of sourmash_plugin_branchwater v0.9.5, I'll close this once I update the FAQ and documentation here. |
Hi,
I have been struggling with finding an option to run on the sourmash gather command using multiple threads. Is this not available?
The text was updated successfully, but these errors were encountered: