-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extremely high (>250GB) memory usage from GenotypeGVCFs w/ moderate-sized GenomicsDb input #7968
Comments
Joint genotyping runs on jvm and does require sufficient RAM to complete unlike |
@nalinigans Yes, it's been surprising me quite a bit too. When you say 'can you run SelectVariants', do you mean simply trying to select from the source GenomicsDB workspace as a test to see if java has enough resources? I can try this. |
Does your pipeline reblock the gVCFs before merging into genomicsDB? I found this helped quite a bit with memory issues. The reblock decreases the size of the gVCF quite and the memory required for processing downstream. |
That's an interesting idea. The process of aggregating 2000 per sample gvcfs into workspaces (we can't get these to successfully combine into one workspace) is a lot of computation just by itself. Is there a reblock that can be executed on the already combined genomics db workspace? |
@nalinigans, to your question about SelectVariants: it was better than GenotypeGVCFs. It worked once, but died with memory errors (killed by our slurm scheduler) a second time. It is also painfully slow. I ran a basic SelectVariants using the workspace with 500 samples. This workspace was processed with the standalone consolidate tool. It's running on an interval set that's only ~2m sites. The output is like this:
So you'll see it's progressing, but ~38 variants/min if I read this right. A few other things to note:
|
@bbimber, not sure what is happening here - the total of the GenomicsDB timers is |
@nalinigans I've used jprofiler locally for profiling, but in this instance I'd need to execute that remotely on the cluster and save the result to a file for local inspection. is there a tool you'd recommend for this? Note: it seems like I might be able to use the native java one? Something like:
|
@droazen, @lbergelson, any pointers for remote java profiling for @bbimber? |
@nalinigans on a related perf question: there are posts about workspaces with lots of small contigs being a problem. There are some recommendations out there about creating multiple workspaces where each has one contig or a subset of contigs. Can you say any more about where that overhead comes from? Given we have an existing multi-contig workspace, and aggregating this many samples into a workspace is pretty big task, are there any ways to separate the existing workspace into a bunch of single-contig workspaces? The only metadata that I see referring to contigs is vidmap.json. For example, subsetting a workspace could be something simple like this:
Using this subset workspace seems to execute just fine as an input for SelectVariants. |
@nalinigans I was looking over the contents of this workspace and thought I'd pass along a couple observations. This is focusing on chromosome 20. This workspace has 573 WGS samples. When I inspect the contents of 20$1$77137495, there is one sub-folder with a GUID-based name. This make sense b/c we previously ran the standalone consolidate tool on it. Within this folder, a lot of these .tdb files are 10GB or larger. The biggest is PL_var.tdb (34G). END is 15GB, GQ is 15, etc. I dont really now how GenomicsDB/GATK handles reads, but do those sizes stand out to you? |
I've never used these, but https://github.com/jvm-profiling-tools could potentially be a source for java profilers. From reading a bit about hprof, it seems to add a lot of overhead and has questionable accuracy. About workspaces with lots of contigs/smaller contigs -- the performance issue there is mostly during import. In your experiment above to subset the workspace, did the subsetted workspace return faster for SelectVariants? Or use less memory? I'm a bit surprised if so since your query is restricted to just that single array anyway. Regarding @jjfarrell's suggestion of ReblockGVCFs -- I can't speak to any loss of precision, etc there but I would be curious to see if you could run some of your input GVCFs through that, just to see how much smaller they seem to get. |
@mlathara I'm not able to get this workspace to work at all for any of the permutations. GATK more or less dies with any operation that tries to use it. Again, one difference is that this is the first time I've used the standalone consolidate tool from @nalinigans. I wonder if that has actually backfired from the perspective to reading this for GenotypeGVCFs or SelectVariants? |
OK -- got it. I wasn't sure if the below comment was implying any better performance.
Sounds like you were just saying it worked, which is expected. As I said, I wouldn't expect the query to work any faster with such a single contig. I think some sort of profiling run, and trying ReBlockGVCFs on a few examples inputs are probably the best next steps. |
@mlathara Apologies, my comment was confusing. See replies below:
I gotta be honest, I'm pretty close to abandoning GenomicsDB and looking at other solutions. |
To be clear: I really appreciate the help from the GATK and GenomicsDB teams. There has just been a lot of problem and issues trying to make GenomicsDB work for this dataset |
Would it still be possible to try the profiler with this workspace for some insight? |
@nalinigans I will see how feasible that is on our cluster. Another question: I'm still baffled at the sort of issues we keep having if GenomicsDB is really used that widely. One question: I have been viewing the aggregated workspace as a semi-permanent store (more like a database). Rather than that, do most users just make the workspace on-the-fly, use it immediately, and then discard? I was thinking overnight about this, and I'm wondering if we should simply drop the idea of even trying to make workspaces with whole chromosomes. I think we could scatter 1000 jobs for the genome, give each a coordinate set, then import the 2000 gVCGs into a workspace of only 2m sites or so, do GenotypeGVCFs, and discard that workspace, and then merge all those VCFs. I thought in the past I read guidance that the GenomicsDB workspace needed to import intact contigs. However, if the only downstream application is to run GenotypeGVCFs on a the same targeted region, is there any reason that woudlnt work? I would hope that running GenomicsDbImport with -L would import any gVCF variant overlapping that interval, and therefore I dont think subsetting to a partial chromosome would matter. Any comments on this would be appreciated. |
I'm not sure what proportion of users leverage the incremental import functionality...it wasn't available when GenomicsDBImport was first made available, but has been around for ~3 years now. As for workspaces with whole chromosomes -- there is no requirement or performance benefits to using whole chromosomes. As you say, subsetting a chromosome to smaller regions will work and make the import and query parallelizable. (if you remember where the advice about whole chromosomes came from, let us know. That might be something that needs to be updated/clarified). Many small contigs does add overhead to import though and, till recently, multiple contigs couldn't be imported together (i.e., each contig would have it's own folder under the GenomicsDB workspace - which gets inefficient with many small contigs). For WGS, probably the best way to create the GenomicsDBImport interval list is to split based on where there are consecutive N's in the reference genome (maybe using Picard) and/or regions that you are blacklisting. I think you suggested that some of the blacklisted regions were especially gnarly - maybe ploidy or high alternate allele count? - depending on the frequency of those, we may save a bit on space/memory requirements. That may address your concern about overlap between variants and import intervals. In general, any variant that starts in a specified import interval will show up in a query to that workspace. I'm not sure if the blacklist regions contain any variants that start within but extend beyond the blacklist -- those may not show up if the regions are split up in this way. |
@mlathara One additional observation. I made a single-sample GenomicsDB workspace. I imported one gVCF, and using a single interval of ~7m BP. I then tried to run a basic SelectVariants on this. This is the log (note the timestamps):
You'll see it took nearly 10 minutes between when it first logs the 'starting traversal', and when it actually begins to traverse and report process. Is there some massive initialization step required by genomics DB? Just to reiterate, the input workspace in this case is tiny: one gVCF sample and only importing ~7m BP on one contig. |
That is interesting -- for comparison @nalinigans had done some SelectVariants runs on our servers with ~500 samples, for an interval with 51M base pairs (these were WES though, fwiw) and the period between starting traversal and the first progressmeter output was ~1min. I'm not sure why your example would take so much longer. How large was this workspace? Can you share it? And would it be any easier to run the profiler with this smaller case...maybe you don't need to submit it as a remote job? |
I don't have much/any experience with |
@mlathara and @nalinigans A couple quick updates:
So some open questions:
Anyway, thanks for your continued help on this. |
Glad to hear you were able to make progress. We're open to suggestions around improving the tooling for this. For instance, you mentioned wanting to redo samples -- we already have support in GenomicsDB for querying by sample. We should be able to expose that at the GATK level. As long as you're okay with renaming the sample when you re-generate the gVCFs that should work. Technically we could expose support to modify existing samples, but that get's a bit hairy because of the way data is retrieved. I'm not sure why the queries for intact chromosomes take so much longer. Since you were able to replicate with a single sample, ~7m interval is there any chance you can share just that bit (workspace, or even better that portion of the gvcf) and we can take a deeper look? To your question about whether GenomicsDBImport includes variants that span the specified import interval: it will definitely include variants that start in those intervals, but it won't always store variants that start before the import interval. For deletions, we have some special handling for variants that start before the interval - they should show up represented by the star allele, but I don't think this is the case for insertions starting before the import interval. |
@mlathara Thanks for the reply. To add to that: The use case around adding/removing/replacing samples can include any of:
I'll look into sharing the workspace, but it's quite large. As far as spanning the genotyping interval: My thinking is that the gVCF can potentially have large blocks, where the start might be far upstream of the genotyping interval, but the end is within the interval. When one does SelectVariants with -L, I am pretty sure (but need to double-check this), that any spanning variant would get included in the output. A less optimal but probably effective approach might be to do SelectVariants on the input gVCFs with the interval(s), and then import these subset gVCFs (which ought to contain any spanning variants, not just any variants starting within the interval). Do you have thoughts on how this should operate from the GenomicsDbImport perspective? Again, while I appreciate the rationale around special-design for your intervals, it seems like including those overlapping records also gets at this problem? |
Regarding GenomicsDbImport with intervals, here is a quick test. Again, the thing I'm trying to evaluate is whether it matters how I chunk the genome for GenomicsDBImport->GenotypeGVCFs. Downstream of this, I would pass the workspace to GenotypeGVCFs, with --only-output-calls-starting-in-intervals. The concern is whether we have variants spanning the intervals of two jobs, and whether separating the jobs would impact calls. In this example, GenotypeGVCFs would run over 1:1050-1150. For example, if we had a multi-NT variant that spanned 1148-1052, we'd want that called correctly no matter what intervals were used for the jobs. I tried using running GenomicsDBImport with -L over a small region, or I ran SelectVariants on the gVCF first (which behaves a little differently), and then used that subset gVCF as input to GenomicsDBImport, where GenomicsDBImport is given the entire contig as the interval. The resulting workspaces will be slightly different, with the latter containing information over a wider region (GenomicsDBIport truncates start/end of the input records to just the target interval). So if either of these workspaces is passed to GenotypeGVCFs, using --only-output-calls-starting-in-intervals and -L 1:1050-1150: I think any upstream padding doesnt matter. If you have a multi-nucleotide polymorphism that starts upstream of 1050 but spans 1050, this job wouldnt be responsible for calling that. The prior job, which has an interval set upstream of this one should call it. I think GenomicsDbImport's behavior is fine here. If you have a multi-NT variant that starts within 1050-1150, but extends outside (i.e. deletion or insertion starting at 1148), this could be a problem. The GenomicsDB workspace created with the interval 1:1050-1150 lacks the information to score that, right? The workspace created using the more permissive SelectVariants->GenomicsDBImport contains that downstream information and presumably would make the same call as if GenotypeGVCFs was given the intact chromosome as input, right? However, it seems that if I simply create the workspace with a reasonably padded interval (adding 1kb should be more than enough for Illumina, right?), and then run GenotypeGVCFs with the original, unpassed interval, then the resulting workspace should contain all available information and GenotypeGVCFs should be able to make the same call as if it was given a whole-chromosome workspace as input. Does that logic seem right?
|
GenomicsDB does store the END as a separate attribute for the interval, so the information is present even if the GenomicsDB array region does not span that far. The other questions I will leave it to @droazen and/or @mlathara to answer. Hopefully, you are able to make progress. |
Here is a script I ran to run the import on 3500 unblocked gvcfs. The script imports one chromosome per workspace. As the chromosomes get larger --more and more memory is needed. chr4 through 22 ran fine. The chr3 (see log below) ends without an error BUT with the callset.json NOT being written out. I could split the chr1-3 at the centromere to try it again. Any other suggestions? Would increasing -Xmx150g to 240g help? For chromosome 1, which is still running, top indicates is using about 240g (after importing the 65 batches).
End of log on chr3
It never indicates that it imported batch 65/65. No error and the callset.json is missing which we found in chr4 to chr22. __tiledb_workspace.tdb callset.json chr4$1$190214555 vcfheader.vcf vidmap.json
|
@bbimber @mlathara Here is a pretty good article for optimizing the GenomicsDBImport [https://gatk.broadinstitute.org/hc/en-us/articles/360056138571-GDBI-usage-and-performance-guidelines] There is some advice about handling many small contigs that may be useful. To troubleshoot the GenomicsDBImport high memory issue my script have, I reran the script on chr1 to narrow down the source of the high memory issue. These are running on reblocked gvcfs.
Test 2 ran the fastest with the lowest memory requirements (Wall clock 76 hours) The -consolidate option was the culprit. So rerunning chr1-3 with just the --bypass-feature-reader option (test2) ran fine without lots of memory being used. Below is the time output from chr1. The output shows the Maximum resident set size (kbytes): 2630440 Using GATK jar /share/pkg.7/gatk/4.2.6.1/install/gatk-4.2.6.1/gatk-package-4.2.6.1-local.jar defined in environment variable GATK_LOCAL_JAR
So using the import on reblocked gvcfs using --bypass-feature-reader was the fastest way to import our 3500 gVCFs and minimize memory. |
@jjfarrell Glad you found that article useful! In general, The only other thing that would help scale here would be to break up your intervals so that larger contigs are split up into multiple regions. Less memory required and you can throw more cores at it (if you have them). What sort of performance did you see on |
@jjfarrell and @mlathara: thanks for running these tests and sorry I havent been able to do anything yet with our data. I'm under some grant deadlines until into Oct, but I do hope to add to this. A couple comments:
What we're seeing in option 3 is consistent with some kind of problem in GenotypeGVCFs/SelectVariants when trying to read from GenomicsDB workspaces that have large chromosomes with highly consolidated data. In those workspaces, I was seeing single files with size of >30GB (like PL_var.tdb). I dont know the read pattern of GATK/GenomicsDB, but maybe over-consolidating is deleterious? |
I do agree that if the source gVCFS are being remade often there isn't much use of keeping genomicsdb as a permanent store. If it is just a few samples here and there, we could add some tooling to ignore and/or rename samples which should save you a lot of compute. But as you say, with something like reblocking the whole store effectively needs to be remade. |
@bbimber @mlathara @nalinigans The sequences of 150,119 genomes in the UK Biobank.](https://pubmed.ncbi.nlm.nih.gov/35859178/ On page 69+ of this pdf, they describe the problem and how they cleverly worked around it. It should be noted that running GATK out of the box will cause every job to read the entire This explains why chr1 requires more memory than chr22 despite running on the same number of samples. The larger chr1 tbi index is the source of the memory problem. The Decode solution is too limit the reading of the tbi index to the part that indexes the scattered region. There is a long pause at the beginning of the running GenotypeGVCFs which I never understood. GATK must be the reading of all the sample's gvcfs tbi into memory during that pause. So the reblocking of the gvcfs above reduced the memory foot print by decreasing the tbi size. Decode reduced it by chopping up the index so for each scattered region, GATK could only read a small subset of the index needed for that region. The combination of reblocking and chopping up the tbi would help with the memory requirements even more. However, it is clear that GATK's present reading of the full tbi is not scalable given the memory requirements. |
Hi @jjfarrell , Thanks for the great explanation. Do you know how to chop the index into scattered regions? I searched for the manual of tabix and bcftools but cannot find a way to do that. |
I am running GATK GenotypeGVCFs, v4.2.6.1. I am trying to call Genotypes on a GenomicsDB workspace with about 500 WGS samples. Note, this is the macaque MMul10 genome, so it has 2,939 contigs (including unplaced). We've run commands like this quite a lot before, though we periodically do have issues like this. We can consolidate on this workspace prior to running this (using a standalone tool @nalinigans provided on #7674). As you can see we ran java with relatively low RAM, but left ~150G for the C++ layer. I'm surprised this isnt good enough.
I'm going to try to interactively inspect this, but the error is from slurm killing my job, not a java memory error, which I believe means the -Xmx 92G isnt getting exceeded. I could be mistaken though.
You'll also see: 1) I'm using --force-output-intervals, 2) I'm giving it -XL to excluded repetitive regions (and therefore also skipping some of the more gnarly and memory-intensive sites), and 3) I'm giving it a fairly small -L interval list (this is split into 750 jobs/genome).
Each job gets about 250K to 800K variants into the data, and then they pretty consistently start to exceed memory and get killed.
Does anyone have suggestions on debugging or troubleshooting steps? Thanks in advance for any help.
The text was updated successfully, but these errors were encountered: