Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running sourmash gather in parallel/on many queries #2066

Open
krastegar opened this issue May 30, 2022 · 5 comments
Open

Running sourmash gather in parallel/on many queries #2066

krastegar opened this issue May 30, 2022 · 5 comments
Labels
faq things to add to an FAQ or docs

Comments

@krastegar
Copy link

Hi,

I have been struggling with finding an option to run on the sourmash gather command using multiple threads. Is this not available?

@ctb
Copy link
Contributor

ctb commented May 31, 2022

hi @krastegar, the final part of the gather algorithm itself is not directly parallelizable, or at least not easily so. But there are things you can do. Read on...

As of #1370, the default implementation of gather consists of two parts: first, a sweep across the provided databases that finds all genomes that have an overlap with the query, and then an implementation of a greedy minimum set cover algorithm that chooses a (much) smaller set of non-redundant matches. The second part can't easily be parallelized.

However, the first part can be done in parallel using the prefetch command. The idea is that you can do the sweep across different subsets or shards of the search databases in parallel, and then combine the results and feed them back into gather.

This is not yet implemented natively in sourmash (see #1752 for details on this effort), but you can hack it together at the command line or use a workflow system to do it - see #1664 if you're interested in a snakemake workflow. There are better/faster ways of doing this now, but I haven't updated #1664 with those; let me know if you're interested.

The only real problem with this solution is that it's not that much faster if you have a lot of matches - that is, if the prefetch sweep returns more than 30% of the database, as can happen with human microbiomes, I wouldn't expect a big speed increase.

@krastegar
Copy link
Author

Thanks for getting back to me @ctb,

Sorry for not being specific with my question. I have a relatively large amount of kmer signatures and I was tasked with using gather on every single signature. I was able to speed up the process by using GNU parallel.
ls *.sig | parallel -j33 --verbose --max-args=1 'sourmash gather {} gtdb-rs207.genomic.k31.lca.json.gz -o {}.csv'
Since this would have been a sequential process, parallel worked great (and its a nice one liner). Hopefully this can help anyone else who runs into the issue

@ctb
Copy link
Contributor

ctb commented Jun 1, 2022

excellent, glad to hear it!

@ctb ctb added the faq things to add to an FAQ or docs label Jul 23, 2022
@ctb ctb changed the title Running Sourmash in Parallel Running sourmash gather in parallel/on many queries Jul 23, 2022
@ctb
Copy link
Contributor

ctb commented Sep 4, 2023

https://github.com/sourmash-bio/pyo3_branchwater is a plugin with a fast (multithreaded) implementation of multigather.

@ctb
Copy link
Contributor

ctb commented Jun 29, 2024

as of sourmash_plugin_branchwater v0.9.5, sourmash scripts fastmultigather is a feature-complete multithreaded multi-query gather, and sourmash scripts fastgather is a feature-complete multithreaded single-query gather 🎉

I'll close this once I update the FAQ and documentation here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
faq things to add to an FAQ or docs
Projects
None yet
Development

No branches or pull requests

2 participants