-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GenomicsDBImport extremely inefficient output #6487
Comments
The Hail team noticed this as well. I think Cotton told me that GDB appeared to be writing 3X more data than he expected. @nalinigans how hard would this be to optimize? |
Just to provide some concrete logs as to exactly what is happening. I'm using systemtap on a NFS server to track exactly what is going on during these runs. For one second (the first column), for one run, for one file:
Note the following about the above I/O:
The solution I propose is just to have two 64KB buffers and only write 64KB when the first buffer is full with the overflow going into the second buffer. This would not only eliminate the seeks, but also reduce the IOP rate by a factor of 600-700 and change this random workload to a mostly sequential. I can write an example code that outputs 100 bytes into a buffer and then writes 64KB at a time if that's helpful. |
@mishaploid, I am assuming the 295 to be single sample vcfs. What do the @spikebike, thanks for the systemtap output. We do have large internal buffers to help with this type of usage, but will revisit the code to figure out the behavior you are seeing. We do have some experimental optimizations not rolled out yet for writing minimally to shared filesystems. Would you be able to run your tests if we create a gatk branch next week with those changes? @ldgauthier, I think the Hail team used multi sample vcfs as well. We do have some optimizations(work-in-progress) for importing multi sample vcfs that will get rolled out in the next GenomicsDB release. |
Thanks for the quick reply @nalinigans! The 295 does refer to individual sample vcfs. The intervals are small at about 2.6 Mb each (example file below). I was originally running by chromosome, but the amount of time to complete GenomicsDBImport was intractable. Example interval file:
|
@nalinigans the particular Hail experiment I'm thinking of was importing three single-sample VCFs, although it was many versions ago (just after the fix for the absolute path requirement.) |
@nalinigans Sure, we would be happy to run any code you suggest. Maybe your internal buffers are only for reads? Definitely seeing not only tiny writes, but tiny writes that overwrite previous writes. Only 1/40th or so of the writes are ending up in the final files. SSDs don't particularly care about sequential vs random writes, but our spinning disks do. The users workload was resulting in 12,000 IO operations per second for hours. |
@spikebike, just started looking at this issue again. We are benchmarking operations with NFS and will put out an optimized library soon. But, GenomicsDB does use filesystem locking to allow for simultaneous reads/writes.
|
(related to Zendesk ticket #5153) |
Hi @nalinigans, sorry I missed your recent post. Just adding that I haven't tried |
@mishaploid and @spikebike. 4.1.8.0 has a new option - |
Bug Report
Affected tool(s) or class(es)
GenomicsDBImport
Affected version(s)
Description
Running GenomicsDBImport on an HPC cluster using SLURM, admin mentioned that the jobs are writing inefficiently to shared storage (@spikebike will follow up with HPC specifics and logs).
Steps to reproduce
Expected behavior
My understanding is that it may be more efficient to use a small buffer and write the final database in full.
Actual behavior
Again my (limited) understanding is that the tool is writing output multiple times and throwing out all but the last write. Here is an example of a log for a 2.6 Mb region and 295 samples:
And the SLURM log:
The text was updated successfully, but these errors were encountered: