-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GenotypeGVCFs/GenomicsDB and Cannot allocate memory error #7674
Comments
@bbimber, did you use |
@nalinigans We almost certainly used the prior version for GenomicsDBImport (4.2.3.0 or 4.2.4.0). Is that a problem? That file is ~66mb. I did a quick 'zcat | wc -l' and there was no error (like a gzip corruption or something). If you want the file I can arrange to share it |
There was an issue with the compressed write of book-keeping files. The import would seem to have succeeded, but the issue would show up during queries sometimes. This was fixed in Yes please, can you arrange to share the |
Another question, did |
@nalinigans We have a very large WGS dataset (<2K subjects). We incrementally add to it over time. After each new batch of data is added, we typically run GenotypesGVCFs. We have run GenotypesGVCFs on prior iterations of this GenomicsDB workspace; however, we have never run it on this particular workspace, after the addition of new samples. You can get those files here: https://prime-seq.ohsu.edu/_webdav/Labs/Bimber/Collaborations/GATK/%40files/Issue7674/ I think I was mistaken above. According to the job logs, we did run GATK GenomicsDBImport v4.2.5.0 when we did our last append. However, prior append operations would have used earlier GATK versions. I believe we have 79 fragments. We rarely do --consolidate, primarily because those jobs essentially never finish. We've had a lot of issues getting GATK/GenomicsDB to run effectively on this sample set. We have settled on doing the GenomicsDBImport/append operation with a moderate batch size. I realize newer GATK/GenomicsDB versions have been addressing performance, and it is possible we should re-evaluate this. |
Thanks @bbimber. Just tested with the files you shared and the book keeping file uncompresses and loads fine. So, no problem on that count. As you have noticed, consolidation is very resource-intensive in GenomicsDBImport. When fragments are not consolidated on-disk during import, they get consolidated in-memory during queries. In this case, it is possible that with 79 fragments and with 10MB(hardcoded currently) being used per fragment per attribute just for consolidation, with memory fragmentation and other internal buffers, we may have run out of memory. Possible solutions -
@mlathara, any other ideas? |
@nalinigans We actually already leave a 60G buffer between what we request for the cluster job and what is given to GATK's -Xmx. |
Yes, 60G should be sufficient. I will run a native consolidation with some data we have for profiling and see what I come with. Also, what OS and version are you running on? |
@nalinigans Can we run consolidate on an existing workspace without appending new samples? Our OS is CentOS 7, and this is on a lustre filesystem. |
@nalinigans We are getting another type of error now. I should add that most of these jobs (we run them per-chromosome) import 4/5 batches, and then die with no errors. Once in a while it prints an error, and this is one example:
Does that give anything to suggest troubleshooting steps? The full command is:
this is GATK v4.2.5.0. Thanks i advance for any ideas. |
@bbimber, this is another manifestation of running out of memory - |
@bbimber, thanks to @mlathara and @kgururaj, here is a suggestion. With
|
Thanks, we'll try this. |
An update on this, I restarted with genomicsdb-segment-size as you suggested. here's the timing thus far:
So it progressed through the first 4 batches, which is what it did originally. Previously, it would always die after logging this line. It's been at this point for ~24H, but at least it appears to still be running. Do you expect it to log 'done importing batch 5/5'? Any idea what GenomicsDB is doing in this phase? |
Yes, we do expect a Is the import finished at this point? If not, would it be possible to run a utility like |
@nalinigans No, not finished but also not dead. This is all running on a slurm cluster. I will see about connecting to the node. I've never done this, but I know it's possible. |
@nalinigans I'm afraid we're back to failing. You can see from the timestamp the time between Batch 4 and failure:
|
@bbimber, sorry that the import with consolidate did not complete. If you are amenable to using a native tool, please download the tool from here for consolidation. This executable will consolidate a given array in a GenomicsDB workspace, it has been instrumented to output memory stats to help tune the segment size. Note that the executable is for Centos 7, if you find any unresolved shared library dependencies during usage, please let me know and I will work on getting another one to you. For usage from a bash shell:
|
@nalinigans We really appreciate you help and suggestions on this. I'll try consolidate_genomicsdb_array. One question: does this modify the workspace in-place? If so, I assume we should clone the workspace, similar to the recommendation for GenomicsDBImport/Append? |
Yes, the consolidation will be in-place and you can clone the workspace before trying the consolidate. |
@nalinigans OK, I just added it to our code to optionally run this consolidate tool prior to running GenomicsDBImport. I'll start a trial later this morning. |
@nalinigans OK, so most of these jobs are still going (we run per contig); however, one just died as follows:
Is there any information from this, or information i could gather, that's helpful here? |
@nalinigans another update. i've been running the standalone consolidate tool, per chromosome. Below is chr 9. As you can see, it seems to take nearly a full day per attribute. Chr 9 is among the smaller contigs. In contrast, chr 1 has been stuck on the first attribute (END) for ~4 days at this point. I'm not sure if this was the right choice, but you will see this included "--segment-size 32768", based on the conversation above.
and then chr 1 (like most larger chromosomes) is still stuck on the first one:
|
@bbimber, we are investigating some scalable solutions for you. Meanwhile, can you provide the following information?
|
@nalinigans Thanks for following up. We requested 256GB of RAM on these jobs. The size of that folder is 754G |
@nalinigans I dont know if this adds a lot of new information, but below is another example consolidate. This is chr 7, and you can see the duration per step from the timestamps. This still died after progressing through several annotations:
|
@nalinigans With the GenomicsDB workspace we have, in its current state, pretty much all attempts to run consolidate ultimately die. Since we have computational capacity, we're trying to iteratively remake it. I am wondering if you can comment or confirm this theory:
I am assuming that since we're constantly consolidating, even as the workspace grows, the fact that we have been doing this cleanup reduces the total work for each subsequent consolidate. Is this true? Because the standalone genomicsDB consolidate tool seems better able to run than GATK with the --consolidate argument (which tends to essentially stall in our hands), we've begun to remake the ~2000 WGS sample workspace, consolidating after each iteration of new samples. |
@bbimber, your approach should mostly work, this is exactly what I am going to allow with the standalone tool, a new arg for |
@nalinigans This would be much appreciated. We're perfectly happy to try out draft versions as well. |
@bbimber, I have placed another version of consolidate_genomicsdb_array here. This allows for batch-wise consolidation with the Please do let me know the total size of all the |
@nalinigans thank you very much! do you have any guidance on what a reasonable batch size might be? |
You can start with consolidating all of them at once (with no batch size) as your fragments are more or less the same size and it should perform OK with the new tool. You can even use the default buffer size as it should not be a factor with the new tool. If that does not work you can probably use |
@nalinigans Mixed results so far. I'm running the new consolidate tool, per chromosome as before. I ran it with defaults (no custom arguments). It is running longer than previously, but chr 1, the largest, died after consolidating 2 attributes. This job had 248G of RAM allocated. Are there optimizations you'd suggest?
|
@bbimber, what is the average size of your |
@nalinigans This iteration is on a smaller input (~600 WGS samples). Based on the info below, do you suggest changing --buffer-size or --batch-size? To your questions: In chromosome 1's folder (1$1$223616942), there are 26 fragments (the GUID-named folders). The sizes of book_keeping files are: 168M The last job failed an OOM error (the job requested 256GB and the slurm controller killed it). This is the command and output (with timestamps):
|
@nalinigans Thanks for the help on this. I wanted to let you know that even with the standalone tool and improvements, we've basically given up on trying to aggregate whole genome GenomicsDB workspaces with more than ~1000 samples. We can get virtually all the chromosomes to consolidate, but the largest (chr 1) can go for 10 days and eventually dies. It's just too unwieldy to be practical. My current idea is to stage our data into more reasonable workspaces of ~500 samples. These have the benefit of allowing existing data to be static, and we just keep appending/making new workspaces as we get more samples. Since GenotypeGVCFs only allows one input, the plan is:
The latter is currently being tested. |
Hello,
We're trying to run GenotypeGVCFs on a large genomicsDB workspace. The command is below, with the output. We run these jobs scatter/gather, with each job getting a define, small interval set. Despite being given a huge amount of RAM (testing >250GB), virtually all of the jobs die without any messages right after the "Starting traversal' message. A few gave error messages like the one below.
In this example, you'll see it's running with Xmx178g. We added 60G to the cluster memory request to leave buffer for the C layer. We're on v4.2.5.0.
Does this error look familiar, and/or do you have any troubleshooting suggestions? Thanks in advance.
The text was updated successfully, but these errors were encountered: