-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nt index incomplete build #275
Comments
How much memory do you have on your machine? I guess the job got killed due to the out-of-memory issue. |
I used 150G. What would be a reasonable amount? It did not give any kind of memory error. |
150G is not enough for nt database. I don't remember the total size of the current nt, but I think you may need a machine with more than 1TB memory to build the nt index. |
|
You may need about 3TB memory for that then.. |
If I don't have that much, are there any alternatives? |
How much memory do you have? |
Theoretically 1.5TB, but it's a shared environment and unclear how long I would need to wait to actually get that. |
If you have the files ready, you may try centrifuger (https://github.com/mourisl/centrifuger). It has the option "--build-mem" in "centrifuger-build". Perhaps you can try something like "--build-mem 1200G" or slightly more, which will try to find appropriate parameters so the memory usage is roughly within the given range. If you want to try centrifuger, please use the "git clone" to get the package. I recently accelerated the index building time efficiency, and the updated code will be in the next formal release. You can use "-t 16" for parallelization too. |
Is the Centrifuger index compatible with Centrifuge? |
No, the underlying data structure is quite different, so the index is not compatible. |
It was not able to finish in 5 days with 500G. Does that seem reasonable? |
500G might not be enough in the end. For 1.5T sequence, storing the raw sequences takes about 400G space, and representing the BWT can take another 400G, which is well over the memory allocation. I think much of the time will be spent on memory page swapping. I would still recommend to allocate memory as much as possible. |
I tried Centrifuger 1.0.2 with 1400G mem and 16 threads. In 15 days, it was able to extract 368/86745 chunks, so still far from complete. Probably a lot more memory is needed for this to be feasible in a reasonable amount of time. |
Thank you for the updates. Indeed, this is too slow. Speeding this up is one of the next goals..How long does it take to process one batch (16 chunks?). What are the inferred "Esimtated block size" and "dcv"? |
This is what the last batch looks like:
|
FYI, our pre-print accompanying the release of a new Centrifuge nt database is online now: Addressing the dynamic nature of reference data: a new nt database for robust metagenomic classification. Any feedback will be welcome! |
Hi @khyox , I tried to download the db using best nic |
Hi @nicolo-tellini, You should have them all then! We added the next line to the Data availability section of the manuscript to clarify that:
Let me know if you aren't seeing that line in the version of the pre-print that you're working with. The idea of splitting in 4 GiB files is that it should be easy to recover after a failure without a big loss, as you just keep all the downloaded files except the last partially downloaded file and download from that one ahead. |
Hi @khyox , I see, yeah thanks. I am sorry my bad, I am not familiar with 7z. |
No problem, @nicolo-tellini, thanks for asking! :) |
I was trying to build my own nt index with
make THREADS=16 nt
. It looked like it completed without errors, but there are only twont.*.cf
files andnt.2.cf
is empty.This is
centrifuge-build-nt.log
:What went wrong?
The text was updated successfully, but these errors were encountered: