-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multithread support for bwa index #104
Comments
No, there is no pull request on multi-threaded indexing. Implementing one may take quite some time but might not dramatically improve the performance, especially when you try to build the index within limited space. Generally, to build a large index, you may consider to use a large block size (option "-b"). This option defaults to 10,000,000. You may increase it to 100,000,000 or even larger, depending on your input. This may save you some time. |
@lh3 Thanks, increasing However I don't quite understand the impact of changing this option. At least during indexing, I don't see any significant memory increase even with values as large as 10,000,000,000. What's the trade-off or otherwise, why isn't the default value larger? |
-b specifies how many bases to process in a batch. The memory used by one batch is 8*{-b}. If you have a "reference genome" larger than 200Gb, you won't observe obvious memory increase with -b set to 10G. For a 3Gb human genome, setting -b to 10G will make the peak RAM 8 times as high at the BWT construction phase. |
So if I understand correctly, the ideal On the other hand, if finding the ideal
|
-b is only used when bwa generate "ref.fa.bwt". At that step, bwa index already knows the total length of the reference. -b was added when I wanted to index nt. I have only done that once, so did not bother to explore the optimal -b in general. Yes, it should be possible to automatically adjust -b, but before that I need to do some experiment to see how speed is affected by -b. Thanks for the suggestion anyway. |
From the tests I've been running, changing the |
Thanks for the data. 6 times is a lot, much larger than my initial guess. I will consider to automatically adjust -b in a future version of bwa. |
BWA's default indexing parameters are quite conservative. This leads to a small memory footprint at the cost of more CPU hours. With large databases (~100GB) default settings require over 2 weeks of CPU time. Increasing the default blocksize will increase the memory footprint but will reduce indexing time 3 to 6 fold. This patch increases the blocksize to roughly 1/10th of the filesize. The memory footprint should be about the size of the database. As per lh3/bwa#104 this patch may become obsolete once this functionality is built into bwa.
BWA's default indexing parameters are quite conservative. This leads to a small memory footprint at the cost of more CPU hours. With large databases (~100GB) default settings require over 2 weeks of CPU time. Increasing the default blocksize will increase the memory footprint but will reduce indexing time 3 to 6 fold. This patch increases the blocksize to roughly 1/10th of the filesize. The memory footprint should be about the size of the database. As per lh3/bwa#104 this patch may become obsolete once this functionality is built into bwa.
BWA's default indexing parameters are quite conservative. This leads to a small memory footprint at the cost of more CPU hours. With large databases (~100GB) default settings require over 2 weeks of CPU time. Increasing the default blocksize will increase the memory footprint but will reduce indexing time 3 to 6 fold. This patch increases the blocksize to roughly 1/10th of the filesize. The memory footprint should be about the size of the database. As per lh3/bwa#104 this patch may become obsolete once this functionality is built into bwa.
BWA's default indexing parameters are quite conservative. This leads to a small memory footprint at the cost of more CPU hours. With large databases (~100GB) default settings require over 2 weeks of CPU time. Increasing the default blocksize will increase the memory footprint but will reduce indexing time 3 to 6 fold. This patch increases the blocksize to roughly 1/10th of the filesize. The memory footprint should be about the size of the database. As per lh3/bwa#104 this patch may become obsolete once this functionality is built into bwa.
Hello! Any news in this stream? |
Hi, I hope everyone is OK in this thread. I am working with large fasta files and I am wondering if this feature is implemented in the current version? Or will it be implemented any time soon? Or should I continue optimising it? |
Hi! I am also curious to know if anything changed since this thread was started. Cheers |
Hi! I will be very happy to see any news in this threads! I am just dreaming about the threads option! It would be great! Cheers! |
any news on threads?! |
There won't be multi-threading indexing. I have explained the rationale above. |
Hi all,
Current databases are becoming increasingly large. Recently I've found myself indexing a large FASTA file and taking over 200CPU hours (single thread).
Searching for multithreaded support for
bwa index
I've landed on a 5 year old mailing-list thread that mentions the existence of some sort of patch. I couldn't find any reference to this patch though.Regardless, is there any ongoing or planned work to make
bwa index
parallelizable in some form?The text was updated successfully, but these errors were encountered: