nt index incomplete build #275

igordot · 2024-04-02T15:10:10Z

I was trying to build my own nt index with make THREADS=16 nt. It looked like it completed without errors, but there are only two nt.*.cf files and nt.2.cf is empty.

This is centrifuge-build-nt.log:

Settings:
  Output files: "tmp_nt/nt.*.cf"
  Line rate: 7 (line is 128 bytes)
  Lines per side: 1 (side is 128 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 14
  Strings: unpacked
  Local offset rate: 3 (one in 8)
  Local fTable chars: 6
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  nt-dusted.fna
Reading reference sizes
Warning: Encountered reference sequence with only gaps
Warning: Encountered reference sequence with only gaps
Warning: Encountered reference sequence with only gaps
...
Warning: Encountered reference sequence with only gaps
Warning: Encountered reference sequence with only gaps
Warning: Encountered reference sequence with only gaps
  Time reading reference sizes: 01:37:54
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences

What went wrong?

The text was updated successfully, but these errors were encountered:

mourisl · 2024-04-02T16:10:44Z

How much memory do you have on your machine? I guess the job got killed due to the out-of-memory issue.

igordot · 2024-04-02T17:21:12Z

I used 150G. What would be a reasonable amount?

It did not give any kind of memory error.

mourisl · 2024-04-02T17:23:28Z

150G is not enough for nt database. I don't remember the total size of the current nt, but I think you may need a machine with more than 1TB memory to build the nt index.

igordot · 2024-04-02T17:39:00Z

nt.fna is 1.5TB. Should I provide more memory than that?

mourisl · 2024-04-02T17:42:16Z

You may need about 3TB memory for that then..

igordot · 2024-04-02T17:45:06Z

If I don't have that much, are there any alternatives?

mourisl · 2024-04-02T18:19:21Z

How much memory do you have?

igordot · 2024-04-04T03:21:35Z

Theoretically 1.5TB, but it's a shared environment and unclear how long I would need to wait to actually get that.

mourisl · 2024-04-04T03:36:23Z

If you have the files ready, you may try centrifuger (https://github.com/mourisl/centrifuger). It has the option "--build-mem" in "centrifuger-build". Perhaps you can try something like "--build-mem 1200G" or slightly more, which will try to find appropriate parameters so the memory usage is roughly within the given range. If you want to try centrifuger, please use the "git clone" to get the package. I recently accelerated the index building time efficiency, and the updated code will be in the next formal release. You can use "-t 16" for parallelization too.

igordot · 2024-04-04T16:06:23Z

Is the Centrifuger index compatible with Centrifuge?

mourisl · 2024-04-05T23:28:47Z

No, the underlying data structure is quite different, so the index is not compatible.

igordot · 2024-04-09T03:35:36Z

It was not able to finish in 5 days with 500G. Does that seem reasonable?

mourisl · 2024-04-09T03:47:34Z

500G might not be enough in the end. For 1.5T sequence, storing the raw sequences takes about 400G space, and representing the BWT can take another 400G, which is well over the memory allocation. I think much of the time will be spent on memory page swapping. I would still recommend to allocate memory as much as possible.

igordot · 2024-05-08T16:59:42Z

I tried Centrifuger 1.0.2 with 1400G mem and 16 threads. In 15 days, it was able to extract 368/86745 chunks, so still far from complete. Probably a lot more memory is needed for this to be feasible in a reasonable amount of time.

mourisl · 2024-05-08T19:44:13Z

Thank you for the updates. Indeed, this is too slow. Speeding this up is one of the next goals..How long does it take to process one batch (16 chunks?). What are the inferred "Esimtated block size" and "dcv"?

igordot · 2024-05-08T20:18:47Z

This is what the last batch looks like:

[Tue May  7 20:21:06 2024] Postprocess 16 chunks.
[Tue May  7 20:26:58 2024] Extract 16 chunks. (352/86745 chunks finished)                                                                  
[Tue May  7 20:26:58 2024] Wait for the chunk extraction to finish.
[Tue May  7 22:39:48 2024] Submit 16 chunks.
[Tue May  7 22:39:48 2024] Chunk 0 elements: 16800433
[Tue May  7 22:39:48 2024] Chunk 1 elements: 16819322
[Tue May  7 22:39:48 2024] Chunk 2 elements: 16771855
[Tue May  7 22:39:48 2024] Chunk 3 elements: 16793011
[Tue May  7 22:39:48 2024] Chunk 4 elements: 16777181
[Tue May  7 22:39:48 2024] Chunk 5 elements: 16728464
[Tue May  7 22:39:48 2024] Chunk 6 elements: 16810964
[Tue May  7 22:39:48 2024] Chunk 7 elements: 16769117
[Tue May  7 22:39:48 2024] Chunk 8 elements: 16782750
[Tue May  7 22:39:48 2024] Chunk 9 elements: 16755439
[Tue May  7 22:39:48 2024] Chunk 10 elements: 16778579
[Tue May  7 22:39:48 2024] Chunk 11 elements: 16777549
[Tue May  7 22:39:48 2024] Chunk 12 elements: 16811242
[Tue May  7 22:39:48 2024] Chunk 13 elements: 16760250
[Tue May  7 22:39:48 2024] Chunk 14 elements: 16764790
[Tue May  7 22:39:48 2024] Chunk 15 elements: 16777083
[Tue May  7 22:39:48 2024] Wait for the chunk sort to finish.
[Tue May  7 23:47:24 2024] Postprocess 16 chunks.
[Tue May  7 23:55:30 2024] Extract 16 chunks. (368/86745 chunks finished)                                                                  
[Tue May  7 23:55:30 2024] Wait for the chunk extraction to finish.

khyox · 2024-06-15T06:19:39Z

FYI, our pre-print accompanying the release of a new Centrifuge nt database is online now: Addressing the dynamic nature of reference data: a new nt database for robust metagenomic classification. Any feedback will be welcome!

nicolo-tellini · 2024-06-16T07:55:34Z

Hi @khyox ,

I tried to download the db using wget but it stopped at 071 compressed files.
In how many files is the db divided ? Is there a more appropriate way to download so that it can recover from where it stopped ?

best

nic

khyox · 2024-06-16T08:39:39Z

Hi @nicolo-tellini,

You should have them all then! We added the next line to the Data availability section of the manuscript to clarify that:

To ease the download process, the database is split in 71 ultra-compressed 7z files of 4 GiB or less with name format nt_wntr23_filt.cf.7z.*

Let me know if you aren't seeing that line in the version of the pre-print that you're working with.

The idea of splitting in 4 GiB files is that it should be easy to recover after a failure without a big loss, as you just keep all the downloaded files except the last partially downloaded file and download from that one ahead.

nicolo-tellini · 2024-06-16T09:02:35Z

Hi @khyox ,

I see, yeah thanks. I am sorry my bad, I am not familiar with 7z.

khyox · 2024-06-16T09:37:07Z

No problem, @nicolo-tellini, thanks for asking! :)

igordot mentioned this issue Jun 13, 2024

NCBI nt index #273

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nt index incomplete build #275

nt index incomplete build #275

igordot commented Apr 2, 2024

mourisl commented Apr 2, 2024

igordot commented Apr 2, 2024 •

edited

Loading

mourisl commented Apr 2, 2024

igordot commented Apr 2, 2024

mourisl commented Apr 2, 2024

igordot commented Apr 2, 2024

mourisl commented Apr 2, 2024

igordot commented Apr 4, 2024

mourisl commented Apr 4, 2024

igordot commented Apr 4, 2024

mourisl commented Apr 5, 2024

igordot commented Apr 9, 2024

mourisl commented Apr 9, 2024

igordot commented May 8, 2024

mourisl commented May 8, 2024

igordot commented May 8, 2024 •

edited

Loading

khyox commented Jun 15, 2024

nicolo-tellini commented Jun 16, 2024 •

edited

Loading

khyox commented Jun 16, 2024

nicolo-tellini commented Jun 16, 2024

khyox commented Jun 16, 2024

nt index incomplete build #275

nt index incomplete build #275

Comments

igordot commented Apr 2, 2024

mourisl commented Apr 2, 2024

igordot commented Apr 2, 2024 • edited Loading

mourisl commented Apr 2, 2024

igordot commented Apr 2, 2024

mourisl commented Apr 2, 2024

igordot commented Apr 2, 2024

mourisl commented Apr 2, 2024

igordot commented Apr 4, 2024

mourisl commented Apr 4, 2024

igordot commented Apr 4, 2024

mourisl commented Apr 5, 2024

igordot commented Apr 9, 2024

mourisl commented Apr 9, 2024

igordot commented May 8, 2024

mourisl commented May 8, 2024

igordot commented May 8, 2024 • edited Loading

khyox commented Jun 15, 2024

nicolo-tellini commented Jun 16, 2024 • edited Loading

khyox commented Jun 16, 2024

nicolo-tellini commented Jun 16, 2024

khyox commented Jun 16, 2024

igordot commented Apr 2, 2024 •

edited

Loading

igordot commented May 8, 2024 •

edited

Loading

nicolo-tellini commented Jun 16, 2024 •

edited

Loading