Running `sourmash gather` in parallel/on many queries #2066

krastegar · 2022-05-30T00:37:36Z

Hi,

I have been struggling with finding an option to run on the sourmash gather command using multiple threads. Is this not available?

ctb · 2022-05-31T13:11:19Z

hi @krastegar, the final part of the gather algorithm itself is not directly parallelizable, or at least not easily so. But there are things you can do. Read on...

As of #1370, the default implementation of gather consists of two parts: first, a sweep across the provided databases that finds all genomes that have an overlap with the query, and then an implementation of a greedy minimum set cover algorithm that chooses a (much) smaller set of non-redundant matches. The second part can't easily be parallelized.

However, the first part can be done in parallel using the prefetch command. The idea is that you can do the sweep across different subsets or shards of the search databases in parallel, and then combine the results and feed them back into gather.

This is not yet implemented natively in sourmash (see #1752 for details on this effort), but you can hack it together at the command line or use a workflow system to do it - see #1664 if you're interested in a snakemake workflow. There are better/faster ways of doing this now, but I haven't updated #1664 with those; let me know if you're interested.

The only real problem with this solution is that it's not that much faster if you have a lot of matches - that is, if the prefetch sweep returns more than 30% of the database, as can happen with human microbiomes, I wouldn't expect a big speed increase.

krastegar · 2022-06-01T01:47:29Z

Thanks for getting back to me @ctb,

Sorry for not being specific with my question. I have a relatively large amount of kmer signatures and I was tasked with using gather on every single signature. I was able to speed up the process by using GNU parallel.
ls *.sig | parallel -j33 --verbose --max-args=1 'sourmash gather {} gtdb-rs207.genomic.k31.lca.json.gz -o {}.csv'
Since this would have been a sequential process, parallel worked great (and its a nice one liner). Hopefully this can help anyone else who runs into the issue

ctb · 2022-06-01T12:47:18Z

excellent, glad to hear it!

ctb · 2023-09-04T13:25:35Z

https://github.com/sourmash-bio/pyo3_branchwater is a plugin with a fast (multithreaded) implementation of multigather.

ctb · 2024-06-29T19:41:05Z

as of sourmash_plugin_branchwater v0.9.5, sourmash scripts fastmultigather is a feature-complete multithreaded multi-query gather, and sourmash scripts fastgather is a feature-complete multithreaded single-query gather 🎉

I'll close this once I update the FAQ and documentation here.

ctb mentioned this issue Jun 1, 2022

Search a database with multiple genomes #2069

Open

mr-eyes mentioned this issue Jun 1, 2022

Parallelization of sourmash search #2071

Open

ctb added the faq things to add to an FAQ or docs label Jul 23, 2022

ctb changed the title ~~Running Sourmash in Parallel~~ Running sourmash gather in parallel/on many queries Jul 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running `sourmash gather` in parallel/on many queries #2066

Running `sourmash gather` in parallel/on many queries #2066

krastegar commented May 30, 2022

ctb commented May 31, 2022

krastegar commented Jun 1, 2022

ctb commented Jun 1, 2022

ctb commented Sep 4, 2023

ctb commented Jun 29, 2024

Running sourmash gather in parallel/on many queries #2066

Running sourmash gather in parallel/on many queries #2066

Comments

krastegar commented May 30, 2022

ctb commented May 31, 2022

krastegar commented Jun 1, 2022

ctb commented Jun 1, 2022

ctb commented Sep 4, 2023

ctb commented Jun 29, 2024

Running `sourmash gather` in parallel/on many queries #2066

Running `sourmash gather` in parallel/on many queries #2066