Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GenotypeGVCFs/GenomicsDB and Cannot allocate memory error #7674

Open
bbimber opened this issue Feb 15, 2022 · 36 comments
Open

GenotypeGVCFs/GenomicsDB and Cannot allocate memory error #7674

bbimber opened this issue Feb 15, 2022 · 36 comments

Comments

@bbimber
Copy link
Contributor

bbimber commented Feb 15, 2022

Hello,

We're trying to run GenotypeGVCFs on a large genomicsDB workspace. The command is below, with the output. We run these jobs scatter/gather, with each job getting a define, small interval set. Despite being given a huge amount of RAM (testing >250GB), virtually all of the jobs die without any messages right after the "Starting traversal' message. A few gave error messages like the one below.

In this example, you'll see it's running with Xmx178g. We added 60G to the cluster memory request to leave buffer for the C layer. We're on v4.2.5.0.

Does this error look familiar, and/or do you have any troubleshooting suggestions? Thanks in advance.

/home/exacloud/gscratch/prime-seq/java/java8/bin/java \
	-Djava.io.tmpdir=<path> \
	-Xmx178g -Xms178g \
	-Xss2m \
	-jar GenomeAnalysisTK4.jar \
	GenotypeGVCFs \
	-R 128_Mmul_10.fasta \
	--variant gendb:///home/exacloud/gscratch/prime-seq/cachedData/16b9ede7-6db8-103a-9262-f8f3fc86a851/WGS_Feb22_1852.gdb \
	-O /home/exacloud/gscratch/prime-seq/workDir/1bb5295c-6ec5-103a-8692-f8f3fc86cd3f/Job1.work/WGS_pre-mGAPv2.3_1852.vcf.gz \
	--annotate-with-num-discovered-alleles \
	-stand-call-conf 30 \
	--max-alternate-alleles 6 \
	--force-output-intervals mmul10.WGS-WXS.whitelist.v2.3.sort.merge.bed \
	-L 1:1-3714165 \
	--only-output-calls-starting-in-intervals \
	--genomicsdb-shared-posixfs-optimizations

12:31:14.647 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/exacloud/gscratch/prime-seq/bin/GenomeAnalysisTK4.jar!/com/intel/gkl/native/libgkl_compression.so
Feb 15, 2022 12:31:14 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
12:31:14.783 INFO  GenotypeGVCFs - ------------------------------------------------------------
12:31:14.783 INFO  GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.2.5.0
12:31:14.783 INFO  GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
12:31:14.784 INFO  GenotypeGVCFs - Executing as labkey_submit@exanode-6-25 on Linux v3.10.0-1062.18.1.el7.x86_64 amd64
12:31:14.784 INFO  GenotypeGVCFs - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_60-b27
12:31:14.784 INFO  GenotypeGVCFs - Start Date/Time: February 15, 2022 12:31:14 PM PST
12:31:14.784 INFO  GenotypeGVCFs - ------------------------------------------------------------
12:31:14.784 INFO  GenotypeGVCFs - ------------------------------------------------------------
12:31:14.784 INFO  GenotypeGVCFs - HTSJDK Version: 2.24.1
12:31:14.784 INFO  GenotypeGVCFs - Picard Version: 2.25.4
12:31:14.784 INFO  GenotypeGVCFs - Built for Spark Version: 2.4.5
12:31:14.784 INFO  GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
12:31:14.784 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
12:31:14.785 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
12:31:14.785 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
12:31:14.785 INFO  GenotypeGVCFs - Deflater: IntelDeflater
12:31:14.785 INFO  GenotypeGVCFs - Inflater: IntelInflater
12:31:14.785 INFO  GenotypeGVCFs - GCS max retries/reopens: 20
12:31:14.785 INFO  GenotypeGVCFs - Requester pays: disabled
12:31:14.785 INFO  GenotypeGVCFs - Initializing engine
12:31:15.766 INFO  GenomicsDBLibLoader - GenomicsDB native library version : 1.4.3-6069e4a
12:31:17.675 info  NativeGenomicsDB - pid=3151 tid=3153 No valid combination operation found for INFO field InbreedingCoeff  - the field will NOT be part of INFO fields in the generated VCF records
12:31:17.675 info  NativeGenomicsDB - pid=3151 tid=3153 No valid combination operation found for INFO field MLEAC  - the field will NOT be part of INFO fields in the generated VCF records
12:31:17.675 info  NativeGenomicsDB - pid=3151 tid=3153 No valid combination operation found for INFO field MLEAF  - the field will NOT be part of INFO fields in the generated VCF records
12:31:19.634 INFO  IntervalArgumentCollection - Processing 3714165 bp from intervals
12:31:19.665 INFO  GenotypeGVCFs - Done initializing engine
12:31:19.700 INFO  FeatureManager - Using codec BEDCodec to read file file:///home/groups/pgpdata/ColonyData/207/@files/sequenceOutputs/mmul10.WGS-WXS.whitelist.v2.3.sort.merge.bed
12:34:40.705 INFO  ProgressMeter - Starting traversal
12:34:40.705 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute

and then at: 15 Feb 2022 12:37:38,923:
[TileDB::StorageBuffer] Error: (gzip_read_buffer) Cannot read to buffer; Mem allocation error errno=12(Cannot allocate memory)
[TileDB::StorageBuffer] Error: (read_buffer) Cannot decompress and/or read bytes path=/home/exacloud/gscratch/prime-seq/cachedData/16b9ede7-6db8-103a-9262-f8f3fc86a851/WGS_Feb22_1852.gdb/1$1$223616942/__9b9a9e96-139c-4105-81ec-ab1455d1f01d140490873108224_1597099702436/__book_keeping.tdb.gz errno=12(Cannot allocate memory)
[TileDB::BookKeeping] Error: Cannot load book-keeping; Reading domain size failed.
@nalinigans
Copy link
Collaborator

@bbimber, did you use GenomicsDBImport from v4.2.5.0 as well? How large is your /home/exacloud/gscratch/prime-seq/cachedData/16b9ede7-6db8-103a-9262-f8f3fc86a851/WGS_Feb22_1852.gdb/1$1$223616942/__9b9a9e96-139c-4105-81ec-ab1455d1f01d140490873108224_1597099702436/__book_keeping.tdb.gz file? Anyway, you can share it with us?

@bbimber
Copy link
Contributor Author

bbimber commented Feb 16, 2022

@nalinigans We almost certainly used the prior version for GenomicsDBImport (4.2.3.0 or 4.2.4.0). Is that a problem?

That file is ~66mb. I did a quick 'zcat | wc -l' and there was no error (like a gzip corruption or something). If you want the file I can arrange to share it

@nalinigans
Copy link
Collaborator

nalinigans commented Feb 16, 2022

There was an issue with the compressed write of book-keeping files. The import would seem to have succeeded, but the issue would show up during queries sometimes. This was fixed in 4.2.3.0 - see #7520. If you did use 4.2.3.0, it should not have been a problem.

Yes please, can you arrange to share the _book_keeping.tdb.gz file? Can you also share the __array_schema.tdb file if possible. Also, how many fragments(folders starting with __) do you see in /home/exacloud/gscratch/prime-seq/cachedData/16b9ede7-6db8-103a-9262-f8f3fc86a851/WGS_Feb22_1852.gdb/1$1$223616942, there may be multiple if you did not use the --consolidate option during import?

@nalinigans
Copy link
Collaborator

Another question, did GenotypesGVCFs succeed before with the same workspace?

@bbimber
Copy link
Contributor Author

bbimber commented Feb 16, 2022

@nalinigans We have a very large WGS dataset (<2K subjects). We incrementally add to it over time. After each new batch of data is added, we typically run GenotypesGVCFs. We have run GenotypesGVCFs on prior iterations of this GenomicsDB workspace; however, we have never run it on this particular workspace, after the addition of new samples.

You can get those files here:

https://prime-seq.ohsu.edu/_webdav/Labs/Bimber/Collaborations/GATK/%40files/Issue7674/

I think I was mistaken above. According to the job logs, we did run GATK GenomicsDBImport v4.2.5.0 when we did our last append. However, prior append operations would have used earlier GATK versions.

I believe we have 79 fragments. We rarely do --consolidate, primarily because those jobs essentially never finish. We've had a lot of issues getting GATK/GenomicsDB to run effectively on this sample set. We have settled on doing the GenomicsDBImport/append operation with a moderate batch size. I realize newer GATK/GenomicsDB versions have been addressing performance, and it is possible we should re-evaluate this.

@nalinigans
Copy link
Collaborator

Thanks @bbimber. Just tested with the files you shared and the book keeping file uncompresses and loads fine. So, no problem on that count.

As you have noticed, consolidation is very resource-intensive in GenomicsDBImport. When fragments are not consolidated on-disk during import, they get consolidated in-memory during queries. In this case, it is possible that with 79 fragments and with 10MB(hardcoded currently) being used per fragment per attribute just for consolidation, with memory fragmentation and other internal buffers, we may have run out of memory.

Possible solutions -

  1. Can you run with something less that 178 with java heap options -Xmx178g -Xms178g, so the native process gets a little more?
  2. There is no way to perform consolidate only from gatk currently. You could try importing another set with the --consolidate and possibly the --bypass-feature-reader option.
  3. I am working on a fix for consolidation for Cloud Storage mainly, it may perform a little better generally and can point to a gatk branch when created if you are interested.

@mlathara, any other ideas?

@bbimber
Copy link
Contributor Author

bbimber commented Feb 16, 2022

@nalinigans We actually already leave a 60G buffer between what we request for the cluster job and what is given to GATK's -Xmx.

@nalinigans
Copy link
Collaborator

nalinigans commented Feb 16, 2022

Yes, 60G should be sufficient. I will run a native consolidation with some data we have for profiling and see what I come with. Also, what OS and version are you running on?

@bbimber
Copy link
Contributor Author

bbimber commented Feb 18, 2022

@nalinigans Can we run consolidate on an existing workspace without appending new samples?

Our OS is CentOS 7, and this is on a lustre filesystem.

@bbimber
Copy link
Contributor Author

bbimber commented Feb 25, 2022

@nalinigans We are getting another type of error now. I should add that most of these jobs (we run them per-chromosome) import 4/5 batches, and then die with no errors. Once in a while it prints an error, and this is one example:

23 Feb 2022 15:33:58,590 DEBUG: 	15:33:58.590 INFO  GenomicsDBLibLoader - GenomicsDB native library version : 1.4.3-6069e4a
23 Feb 2022 15:33:59,328 DEBUG: 	15:33:59.328 INFO  GenomicsDBImport - Done initializing engine
23 Feb 2022 15:33:59,334 DEBUG: 	15:33:59.329 INFO  GenomicsDBImport - Callset Map JSON file will be re-written to /home/exacloud/gscratch/prime-seq/workDir/0950f56b-7565-103a-a738-f8f3fc8675d2/Job2.work/WGS_1852_consolidated.gdb/callset.json
23 Feb 2022 15:33:59,339 DEBUG: 	15:33:59.330 INFO  GenomicsDBImport - Incrementally importing to workspace - /home/exacloud/gscratch/prime-seq/workDir/0950f56b-7565-103a-a738-f8f3fc8675d2/Job2.work/WGS_1852_consolidated.gdb
23 Feb 2022 16:28:45,048 DEBUG: 	16:28:45.047 INFO  GenomicsDBImport - Done importing batch 1/5
23 Feb 2022 17:29:49,240 DEBUG: 	17:29:49.226 INFO  GenomicsDBImport - Done importing batch 2/5
23 Feb 2022 18:26:19,107 DEBUG: 	18:26:19.105 INFO  GenomicsDBImport - Done importing batch 3/5
23 Feb 2022 19:20:18,500 DEBUG: 	19:20:18.478 INFO  GenomicsDBImport - Done importing batch 4/5
24 Feb 2022 16:51:19,017 DEBUG: 	[TileDB::utils] Error: (gzip_handle_error) Cannot decompress with GZIP: inflateInit error: Z_MEM_ERROR
24 Feb 2022 16:51:19,048 DEBUG: 	[TileDB::Codec] Error: Could not decompress with GZIP.
24 Feb 2022 16:51:19,056 DEBUG: 	[TileDB::ReadState] Error: Cannot decompress tile for /home/exacloud/gscratch/prime-seq/workDir/0950f56b-7565-103a-a738-f8f3fc8675d2/Job2.work/WGS_1852_consolidated.gdb/2$1$196197964/__e7217c9e-767d-4295-b75a-9162c22c6996139785909643008_1613563029631/END.tdb.
24 Feb 2022 16:51:51,388 DEBUG: 	16:51:51.388 erro  NativeGenomicsDB - pid=225263 tid=225739 VariantStorageManagerException exception : Error while consolidating TileDB array 2$1$196197964
24 Feb 2022 16:51:51,405 DEBUG: 	TileDB error message : 
24 Feb 2022 16:51:51,412 DEBUG: 	terminate called after throwing an instance of 'VariantStorageManagerException'
24 Feb 2022 16:51:51,419 DEBUG: 	  what():  VariantStorageManagerException exception : Error while consolidating TileDB array 2$1$196197964
24 Feb 2022 16:51:51,427 DEBUG: 	TileDB error message : 
24 Feb 2022 16:52:27,478 WARN : 	process exited with non-zero value: 134

Does that give anything to suggest troubleshooting steps?

The full command is:

/home/exacloud/gscratch/prime-seq/java/java8/bin/java \
	Xmx497g -Xms497g -Xss2m \
	-jar /home/exacloud/gscratch/prime-seq/bin/GenomeAnalysisTK4.jar \
	GenomicsDBImport \
	-V 25780.g.vcf.gz \
	-V <total of 92 gVCFs> \
	--genomicsdb-update-workspace-path WGS_1852_consolidated.gdb \
	--batch-size 10 \
	--reader-threads 12 \
	--consolidate \
	--genomicsdb-shared-posixfs-optimizations \
	--bypass-feature-reader \
	-R 128_Mmul_10.fasta

this is GATK v4.2.5.0. Thanks i advance for any ideas.

@nalinigans
Copy link
Collaborator

@bbimber, this is another manifestation of running out of memory - Error: (gzip_handle_error) Cannot decompress with GZIP: inflateInit error: Z_MEM_ERROR. We have been looking into these types of issues with heap profilers. Will share a libgenomicsdb.so to try out with gatk when we have something.

@nalinigans
Copy link
Collaborator

nalinigans commented Feb 25, 2022

@bbimber, thanks to @mlathara and @kgururaj, here is a suggestion.

With ~36 attributes+offsets and ~80 fragments in GenomicsDB parlance getting stored, with --genomicsdb-segment-size=1048576(1M default), we are looking at memory requirements in the range of 10G of memory for reading during consolidation. Just wondering, if you could try GenomicsDBImport with the following options to see if it helps.

      java heap options of say -Xmx100g -Xms 100g
       --genomicsdb-update-workspace-path WGS_1852_consolidated.gdb \
       --batch-size 10 \
       --consolidate \
       --genomicsdb-shared-posixfs-optimizations \
       --bypass-feature-reader \
       --genomicsdb-segment-size 32768 \

@bbimber
Copy link
Contributor Author

bbimber commented Feb 25, 2022

Thanks, we'll try this.

@bbimber
Copy link
Contributor Author

bbimber commented Feb 27, 2022

An update on this, I restarted with genomicsdb-segment-size as you suggested. here's the timing thus far:

26 Feb 2022 02:19:16,306 DEBUG: 	02:19:16.301 INFO  GenomicsDBImport - Callset Map JSON file will be re-written to /home/exacloud/gscratch/prime-seq/workDir/0950f56d-7565-103a-a738-f8f3fc8675d2/Job4.work/WGS_1852_consolidated.gdb/callset.json
26 Feb 2022 02:19:16,312 DEBUG: 	02:19:16.301 INFO  GenomicsDBImport - Incrementally importing to workspace - /home/exacloud/gscratch/prime-seq/workDir/0950f56d-7565-103a-a738-f8f3fc8675d2/Job4.work/WGS_1852_consolidated.gdb
26 Feb 2022 03:45:50,581 DEBUG: 	03:45:50.445 INFO  GenomicsDBImport - Done importing batch 1/5
26 Feb 2022 05:24:59,873 DEBUG: 	05:24:59.588 INFO  GenomicsDBImport - Done importing batch 2/5
26 Feb 2022 06:45:40,497 DEBUG: 	06:45:40.103 INFO  GenomicsDBImport - Done importing batch 3/5
26 Feb 2022 08:00:46,636 DEBUG: 	08:00:46.328 INFO  GenomicsDBImport - Done importing batch 4/5

So it progressed through the first 4 batches, which is what it did originally. Previously, it would always die after logging this line. It's been at this point for ~24H, but at least it appears to still be running. Do you expect it to log 'done importing batch 5/5'? Any idea what GenomicsDB is doing in this phase?

@nalinigans
Copy link
Collaborator

Yes, we do expect a done importing batch 5/5. The reason it is taking long is because it is doing the consolidation in this phase too. Should probably disassociate the import with the consolidate phases, so we can print out consolidating batches or something like that.

Is the import finished at this point? If not, would it be possible to run a utility like top to print out cpu and memory usage of gatk GenomicsDBImport?

@bbimber
Copy link
Contributor Author

bbimber commented Feb 28, 2022

@nalinigans No, not finished but also not dead. This is all running on a slurm cluster. I will see about connecting to the node. I've never done this, but I know it's possible.

@bbimber
Copy link
Contributor Author

bbimber commented Mar 1, 2022

@nalinigans I'm afraid we're back to failing. You can see from the timestamp the time between Batch 4 and failure:

26 Feb 2022 02:19:16,312 DEBUG: 	02:19:16.301 INFO  GenomicsDBImport - Incrementally importing to workspace - /home/exacloud/gscratch/prime-seq/workDir/0950f56d-7565-103a-a738-f8f3fc8675d2/Job4.work/WGS_1852_consolidated.gdb
26 Feb 2022 03:45:50,581 DEBUG: 	03:45:50.445 INFO  GenomicsDBImport - Done importing batch 1/5
26 Feb 2022 05:24:59,873 DEBUG: 	05:24:59.588 INFO  GenomicsDBImport - Done importing batch 2/5
26 Feb 2022 06:45:40,497 DEBUG: 	06:45:40.103 INFO  GenomicsDBImport - Done importing batch 3/5
26 Feb 2022 08:00:46,636 DEBUG: 	08:00:46.328 INFO  GenomicsDBImport - Done importing batch 4/5
28 Feb 2022 08:30:01,908 DEBUG: 	terminate called after throwing an instance of 'std::bad_alloc'
28 Feb 2022 08:30:01,984 DEBUG: 	  what():  std::bad_alloc
28 Feb 2022 08:30:59,827 WARN : 	process exited with non-zero value: 134

@nalinigans
Copy link
Collaborator

nalinigans commented Mar 3, 2022

@bbimber, sorry that the import with consolidate did not complete. If you are amenable to using a native tool, please download the tool from here for consolidation. This executable will consolidate a given array in a GenomicsDB workspace, it has been instrumented to output memory stats to help tune the segment size. Note that the executable is for Centos 7, if you find any unresolved shared library dependencies during usage, please let me know and I will work on getting another one to you.

For usage from a bash shell:

~/GenomicsDB: ./consolidate_genomicsdb_array
Usage: consolidate_genomicsdb_array [options]
where options include:
	 --help, -h Print a usage message summarizing options available and exit
	 --workspace=<GenomicsDB workspace URI>, -w <GenomicsDB workspace URI>
		 Specify path to GenomicsDB workspace
	 --array-name=<Array Name>, -a <Array Name>
		 Specify name of array that requires consolidation
	 --segment-size=<Segment Size>, -z <Segment Size>
		 Optional, default is 10M. Specify a buffer size for consolidation
	 --shared-posixfs-optimizations, -p
		 Optional, default is false. If specified, the array folder is not locked for read/write and file handles are kept open until a final close for write
	 --version Print version and exit
~/GenomicsDB.: ./consolidate_genomicsdb_array -w /Users/xxx/WGS.gdb/ -a "1\$1\$249250621" -z 1048576 -p
21:09:47.100 info  consolidate_genomicsdb_array - pid=30881 tid=30881 Starting consolidation of 1$1$249250621 in ws
Using buffer_size=1048576 for consolidation
21:9:47 Memory stats(pages) beginning consolidation size=45821 resident=18998 share=1824 text=3530 lib=0 data=16810 dt=0
21:9:47 Memory stats(pages) after alloc for attribute=END size=45821 resident=19009 share=1835 text=3530 lib=0 data=16810 dt=0
21:9:48 Memory stats(pages) after alloc for attribute=REF size=46788 resident=19743 share=1889 text=3530 lib=0 data=17773 dt=0
21:9:48 Memory stats(pages) after alloc for attribute=ALT size=47143 resident=20124 share=1889 text=3530 lib=0 data=18129 dt=0
21:9:48 Memory stats(pages) after alloc for attribute=QUAL size=47143 resident=20151 share=1889 text=3530 lib=0 data=18129 dt=0
21:9:48 Memory stats(pages) after alloc for attribute=FILTER size=47143 resident=20155 share=1889 text=3530 lib=0 data=18129 dt=0
...
21:9:50 Memory stats(pages) after alloc for attribute=GT size=47415 resident=20598 share=1890 text=3530 lib=0 data=18401 dt=0
21:9:50 Memory stats(pages) after alloc for attribute=__coords size=47415 resident=20606 share=1891 text=3530 lib=0 data=18401 dt=0
21:9:51 Memory stats(pages) after final consolidation size=47416 resident=20606 share=1891 text=3530 lib=0 data=18401 dt=0
21:09:51.364 info  consolidate_genomicsdb_array - pid=30881 tid=30881 Consolidation of 1$1$249250621 in ws completed

@bbimber
Copy link
Contributor Author

bbimber commented Mar 3, 2022

@nalinigans We really appreciate you help and suggestions on this. I'll try consolidate_genomicsdb_array. One question: does this modify the workspace in-place? If so, I assume we should clone the workspace, similar to the recommendation for GenomicsDBImport/Append?

@nalinigans
Copy link
Collaborator

Yes, the consolidation will be in-place and you can clone the workspace before trying the consolidate.

@bbimber
Copy link
Contributor Author

bbimber commented Mar 3, 2022

@nalinigans OK, I just added it to our code to optionally run this consolidate tool prior to running GenomicsDBImport. I'll start a trial later this morning.

@bbimber
Copy link
Contributor Author

bbimber commented Mar 5, 2022

@nalinigans OK, so most of these jobs are still going (we run per contig); however, one just died as follows:

03 Mar 2022 12:49:28,390 INFO : 	/home/exacloud/gscratch/prime-seq/bin/consolidate_genomicsdb_array -w /home/exacloud/gscratch/prime-seq/workDir/0950f56c-7565-103a-a738-f8f3fc8675d2/Job3.work/WGS_1852_consolidated.gdb --shared-posixfs-optimizations --segment-size 32768 -a 3$1$185288947
03 Mar 2022 12:49:28,444 DEBUG: 	12:49:28.444 info  consolidate_genomicsdb_array - pid=147768 tid=147768 Starting consolidation of 3$1$185288947 in /home/exacloud/gscratch/prime-seq/workDir/0950f56c-7565-103a-a738-f8f3fc8675d2/Job3.work/WGS_1852_consolidated.gdb
03 Mar 2022 12:58:29,209 DEBUG: 	Using buffer_size=32768 for consolidation
03 Mar 2022 12:58:29,222 DEBUG: 	12:58:29 Memory stats(pages) beginning consolidation size=16404102 resident=16377455 share=1814 text=3530 lib=0 data=16375282 dt=0
03 Mar 2022 12:58:29,228 DEBUG: 	12:58:29 Memory stats(pages) after alloc for attribute=END size=16404138 resident=16377465 share=1821 text=3530 lib=0 data=16375318 dt=0
04 Mar 2022 16:03:57,025 WARN : 	process exited with non-zero value: 137

Is there any information from this, or information i could gather, that's helpful here?

@bbimber
Copy link
Contributor Author

bbimber commented Mar 7, 2022

@nalinigans another update. i've been running the standalone consolidate tool, per chromosome. Below is chr 9. As you can see, it seems to take nearly a full day per attribute. Chr 9 is among the smaller contigs. In contrast, chr 1 has been stuck on the first attribute (END) for ~4 days at this point. I'm not sure if this was the right choice, but you will see this included "--segment-size 32768", based on the conversation above.

03 Mar 2022 12:51:23,371 INFO : Consolidating contig folder: /home/exacloud/gscratch/prime-seq/workDir/0950f572-7565-103a-a738-f8f3fc8675d2/Job9.work/WGS_1852_consolidated.gdb/9$1$134124166
03 Mar 2022 12:51:23,389 INFO : 	/home/exacloud/gscratch/prime-seq/bin/consolidate_genomicsdb_array -w /home/exacloud/gscratch/prime-seq/workDir/0950f572-7565-103a-a738-f8f3fc8675d2/Job9.work/WGS_1852_consolidated.gdb --shared-posixfs-optimizations --segment-size 32768 -a 9$1$134124166
03 Mar 2022 12:51:23,423 DEBUG: using path: /home/exacloud/gscratch/prime-seq/bin:/home/exacloud/gscratch/prime-seq/bin/:/home/exacloud/gscratch/prime-seq/java/current/bin/:/home/exacloud/gscratch/prime-seq/bin/:/usr/local/bin:/usr/bin
03 Mar 2022 12:51:23,510 DEBUG: 	12:51:23.510 info  consolidate_genomicsdb_array - pid=233371 tid=233371 Starting consolidation of 9$1$134124166 in /home/exacloud/gscratch/prime-seq/workDir/0950f572-7565-103a-a738-f8f3fc8675d2/Job9.work/WGS_1852_consolidated.gdb
03 Mar 2022 12:55:30,641 DEBUG: 	Using buffer_size=32768 for consolidation
03 Mar 2022 12:55:30,656 DEBUG: 	12:55:30 Memory stats(pages) beginning consolidation size=9350950 resident=9324278 share=1814 text=3530 lib=0 data=9322130 dt=0
03 Mar 2022 12:55:30,662 DEBUG: 	12:55:30 Memory stats(pages) after alloc for attribute=END size=9350984 resident=9324313 share=1821 text=3530 lib=0 data=9322164 dt=0
05 Mar 2022 08:43:03,491 DEBUG: 	8:43:3 Memory stats(pages) after alloc for attribute=REF size=109159142 resident=108901310 share=1425 text=3530 lib=0 data=109130249 dt=0
06 Mar 2022 06:28:14,322 DEBUG: 	6:28:14 Memory stats(pages) after alloc for attribute=ALT size=109953827 resident=109360557 share=1402 text=3530 lib=0 data=109924938 dt=0
06 Mar 2022 23:14:46,703 DEBUG: 	23:14:46 Memory stats(pages) after alloc for attribute=ID size=110740628 resident=109743238 share=1399 text=3530 lib=0 data=110711742 dt=0

and then chr 1 (like most larger chromosomes) is still stuck on the first one:

03 Mar 2022 14:23:09,051 DEBUG: 	14:23:09.050 info  consolidate_genomicsdb_array - pid=53855 tid=53855 Starting consolidation of 1$1$223616942 in /home/exacloud/gscratch/prime-seq/workDir/0950f56a-7565-103a-a738-f8f3fc8675d2/Job1.work/WGS_1852_consolidated.gdb
03 Mar 2022 14:30:34,976 DEBUG: 	Using buffer_size=32768 for consolidation
03 Mar 2022 14:30:34,985 DEBUG: 	14:30:34 Memory stats(pages) beginning consolidation size=15861025 resident=15834343 share=1813 text=3530 lib=0 data=15832205 dt=0
03 Mar 2022 14:30:34,989 DEBUG: 	14:30:34 Memory stats(pages) after alloc for attribute=END size=15861025 resident=15834363 share=1821 text=3530 lib=0 data=15832205 dt=0

@nalinigans
Copy link
Collaborator

@bbimber, we are investigating some scalable solutions for you. Meanwhile, can you provide the following information?

  1. What is the total and free amount of memory available to say consolidate_genomicsdb_array on your individual nodes?
  2. Sizes of all files under any one fragment say in 9$1$134124166?

@bbimber
Copy link
Contributor Author

bbimber commented Mar 8, 2022

@nalinigans Thanks for following up. We requested 256GB of RAM on these jobs. The size of that folder is 754G

@bbimber
Copy link
Contributor Author

bbimber commented Mar 15, 2022

@nalinigans I dont know if this adds a lot of new information, but below is another example consolidate. This is chr 7, and you can see the duration per step from the timestamps. This still died after progressing through several annotations:

03 Mar 2022 12:57:08,442 DEBUG: 	Using buffer_size=32768 for consolidation
03 Mar 2022 12:57:08,452 DEBUG: 	12:57:8 Memory stats(pages) beginning consolidation size=11996466 resident=11969802 share=1814 text=3530 lib=0 data=11967646 dt=0
03 Mar 2022 12:57:08,456 DEBUG: 	12:57:8 Memory stats(pages) after alloc for attribute=END size=11996500 resident=11969828 share=1821 text=3530 lib=0 data=11967680 dt=0
06 Mar 2022 05:37:03,068 DEBUG: 	5:37:3 Memory stats(pages) after alloc for attribute=REF size=139595627 resident=139423360 share=836 text=3530 lib=0 data=139566732 dt=0
07 Mar 2022 11:47:53,134 DEBUG: 	11:47:53 Memory stats(pages) after alloc for attribute=ALT size=140390278 resident=137278480 share=813 text=3530 lib=0 data=140361391 dt=0
08 Mar 2022 10:59:59,126 DEBUG: 	10:59:59 Memory stats(pages) after alloc for attribute=ID size=141177738 resident=136956246 share=813 text=3530 lib=0 data=141148851 dt=0
09 Mar 2022 00:49:40,244 DEBUG: 	0:49:40 Memory stats(pages) after alloc for attribute=QUAL size=141964174 resident=136233729 share=815 text=3530 lib=0 data=141935286 dt=0
09 Mar 2022 13:23:10,673 DEBUG: 	13:23:10 Memory stats(pages) after alloc for attribute=FILTER size=142226317 resident=135364172 share=808 text=3530 lib=0 data=142197431 dt=0
10 Mar 2022 03:18:26,727 DEBUG: 	3:18:26 Memory stats(pages) after alloc for attribute=DP size=143012754 resident=135994842 share=815 text=3530 lib=0 data=142983866 dt=0
10 Mar 2022 14:38:05,393 DEBUG: 	14:38:5 Memory stats(pages) after alloc for attribute=BaseQRankSum size=143275409 resident=135331044 share=808 text=3530 lib=0 data=143246523 dt=0
11 Mar 2022 02:34:19,708 DEBUG: 	2:34:19 Memory stats(pages) after alloc for attribute=ExcessHet size=143537555 resident=134804435 share=809 text=3530 lib=0 data=143508668 dt=0
11 Mar 2022 13:13:54,835 DEBUG: 	13:13:54 Memory stats(pages) after alloc for attribute=InbreedingCoeff size=143799699 resident=134672219 share=808 text=3530 lib=0 data=143770813 dt=0
11 Mar 2022 23:22:21,918 DEBUG: 	23:22:21 Memory stats(pages) after alloc for attribute=MLEAC size=144061846 resident=134864673 share=810 text=3530 lib=0 data=144032958 dt=0
12 Mar 2022 18:38:09,972 DEBUG: 	18:38:9 Memory stats(pages) after alloc for attribute=MLEAF size=144848279 resident=135442824 share=812 text=3530 lib=0 data=144819393 dt=0
13 Mar 2022 08:01:22,139 WARN : 	process exited with non-zero value: 137

@bbimber
Copy link
Contributor Author

bbimber commented Mar 19, 2022

@nalinigans With the GenomicsDB workspace we have, in its current state, pretty much all attempts to run consolidate ultimately die. Since we have computational capacity, we're trying to iteratively remake it. I am wondering if you can comment or confirm this theory:

  • Let's assume we initialize a workspace by importing X samples. We run consolidate.
  • We append X new samples. We run consolidate.
  • Keep repeating

I am assuming that since we're constantly consolidating, even as the workspace grows, the fact that we have been doing this cleanup reduces the total work for each subsequent consolidate. Is this true?

Because the standalone genomicsDB consolidate tool seems better able to run than GATK with the --consolidate argument (which tends to essentially stall in our hands), we've begun to remake the ~2000 WGS sample workspace, consolidating after each iteration of new samples.

@nalinigans
Copy link
Collaborator

nalinigans commented Mar 19, 2022

@bbimber, your approach should mostly work, this is exactly what I am going to allow with the standalone tool, a new arg for batch-wise consolidation. The tool is also better optimized with memory allocations and you can specify the batch size to the tool for batch-wise consolidation which should clamp down the memory use. Still testing out the tool, hopefully I can get some version to you over the weekend or on Monday.

@bbimber
Copy link
Contributor Author

bbimber commented Mar 22, 2022

@nalinigans This would be much appreciated. We're perfectly happy to try out draft versions as well.

@nalinigans
Copy link
Collaborator

nalinigans commented Mar 28, 2022

@bbimber, I have placed another version of consolidate_genomicsdb_array here. This allows for batch-wise consolidation with the --batch-size or -b option. The tool also does better reuse of the consolidation buffers between reads, so might work a little better. Be aware that the total time to consolidate increases with the batch option and the final batches require almost as much memory as consolidating all the fragments at once. Please try consolidating any one GenomicsDB array to see how it functions as we are still working at making this scalable in a better way. This is just a draft version and if you can share some of the resulting logs, that will be very helpful.

Please do let me know the total size of all the __book_keeping.tdb.gz files in your fragments. Just a back-of-envelope calculation, you probably need about 40 times that total size of memory to successfully consolidate.

@bbimber
Copy link
Contributor Author

bbimber commented Mar 28, 2022

@nalinigans thank you very much! do you have any guidance on what a reasonable batch size might be?

@nalinigans
Copy link
Collaborator

You can start with consolidating all of them at once (with no batch size) as your fragments are more or less the same size and it should perform OK with the new tool. You can even use the default buffer size as it should not be a factor with the new tool. If that does not work you can probably use number-of-fragments/2 or number-of-fragments/4 for a batch size.

@bbimber
Copy link
Contributor Author

bbimber commented Apr 4, 2022

@nalinigans Mixed results so far. I'm running the new consolidate tool, per chromosome as before. I ran it with defaults (no custom arguments). It is running longer than previously, but chr 1, the largest, died after consolidating 2 attributes. This job had 248G of RAM allocated. Are there optimizations you'd suggest?

02 Apr 2022 16:34:31,433 DEBUG: 	[April 2, 2022 4:34:31 PM PDT] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 5,993.06 minutes.
02 Apr 2022 16:34:31,438 DEBUG: 	Runtime.totalMemory()=178017796096
02 Apr 2022 16:34:31,443 DEBUG: 	Tool returned:
02 Apr 2022 16:34:31,448 DEBUG: 	true
02 Apr 2022 16:34:34,663 INFO : Will consolidate the workspace using consolidate_genomicsdb_array
02 Apr 2022 16:34:34,723 INFO : Consolidating contig folder: /home/exacloud/gscratch/prime-seq/workDir/344c6137-8a85-103a-821a-f8f3fc86deba/Job1.work/WGS_v2_713_2ndMerge.gdb/1$1$223616942
02 Apr 2022 16:34:34,748 INFO : 	/home/exacloud/gscratch/prime-seq/bin/consolidate_genomicsdb_array -w /home/exacloud/gscratch/prime-seq/workDir/344c6137-8a85-103a-821a-f8f3fc86deba/Job1.work/WGS_v2_713_2ndMerge.gdb --shared-posixfs-optimizations -a 1$1$223616942
02 Apr 2022 16:34:34,754 DEBUG: using path: /home/exacloud/gscratch/prime-seq/bin:/home/exacloud/gscratch/prime-seq/bin/:/home/exacloud/gscratch/prime-seq/java/current/bin/:/home/exacloud/gscratch/prime-seq/bin/:/usr/local/bin:/usr/bin
02 Apr 2022 16:34:35,059 DEBUG: 	16:34:35.059 info  consolidate_genomicsdb_array - pid=34848 tid=34848 Starting consolidation of 1$1$223616942 in /home/exacloud/gscratch/prime-seq/workDir/344c6137-8a85-103a-821a-f8f3fc86deba/Job1.work/WGS_v2_713_2ndMerge.gdb
02 Apr 2022 16:34:36,091 DEBUG: 	Using buffer_size=10485760 for consolidation
02 Apr 2022 16:34:36,097 DEBUG: 	Number of fragments to consolidate=26
02 Apr 2022 16:34:36,101 DEBUG: 	Sat Apr  2 16:34:36 2022 Memory stats beginning consolidation size=483MB resident=379MB share=6MB text=13MB lib=0 data=371MB dt=0
02 Apr 2022 16:34:36,105 DEBUG: 	Sat Apr  2 16:34:36 2022 Memory stats Start: batch 1/1 size=503MB resident=379MB share=6MB text=13MB lib=0 data=391MB dt=0
02 Apr 2022 16:38:15,825 DEBUG: 	Sat Apr  2 16:38:15 2022 Memory stats after alloc for attribute=END size=25GB resident=25GB share=7MB text=13MB lib=0 data=25GB dt=0

@nalinigans
Copy link
Collaborator

@bbimber, what is the average size of your __book_keeping.tdb.gz? And you mentioned that you had about 80 fragments, right?

@bbimber
Copy link
Contributor Author

bbimber commented Apr 11, 2022

@nalinigans This iteration is on a smaller input (~600 WGS samples). Based on the info below, do you suggest changing --buffer-size or --batch-size?

To your questions:

In chromosome 1's folder (1$1$223616942), there are 26 fragments (the GUID-named folders). The sizes of book_keeping files are:

168M
131M
155M
149M
136M
142M
216M
147M
150M
134M
127M
75M
172M
122M
207M
122M
581M
149M
150M
141M
149M
143M
143M
165M
160M
163M

The last job failed an OOM error (the job requested 256GB and the slurm controller killed it). This is the command and output (with timestamps):

10 Apr 2022 01:56:21,275 INFO : 	/home/exacloud/gscratch/prime-seq/bin/consolidate_genomicsdb_array -w /home/exacloud/gscratch/prime-seq/workDir/344c6137-8a85-103a-821a-f8f3fc86deba/Job1.work/WGS_v2_713_2ndMerge.gdb --shared-posixfs-optimizations -a 1$1$223616942

10 Apr 2022 01:56:21,556 DEBUG: 	01:56:21.556 info  consolidate_genomicsdb_array - pid=146087 tid=146087 Starting consolidation of 1$1$223616942 in /home/exacloud/gscratch/prime-seq/workDir/344c6137-8a85-103a-821a-f8f3fc86deba/Job1.work/WGS_v2_713_2ndMerge.gdb
10 Apr 2022 01:56:22,385 DEBUG: 	Using buffer_size=10485760 for consolidation
10 Apr 2022 01:56:22,391 DEBUG: 	Number of fragments to consolidate=26
10 Apr 2022 01:56:22,396 DEBUG: 	Sun Apr 10 01:56:22 2022 Memory stats beginning consolidation size=483MB resident=379MB share=6MB text=13MB lib=0 data=371MB dt=0
10 Apr 2022 01:56:22,400 DEBUG: 	Sun Apr 10 01:56:22 2022 Memory stats Start: batch 1/1 size=503MB resident=379MB share=6MB text=13MB lib=0 data=391MB dt=0
10 Apr 2022 01:59:22,970 DEBUG: 	Sun Apr 10 01:59:22 2022 Memory stats after alloc for attribute=END size=25GB resident=25GB share=7MB text=13MB lib=0 data=25GB dt=0
11 Apr 2022 03:36:27,065 WARN : 	process exited with non-zero value: 137

@bbimber
Copy link
Contributor Author

bbimber commented Jun 8, 2022

@nalinigans Thanks for the help on this. I wanted to let you know that even with the standalone tool and improvements, we've basically given up on trying to aggregate whole genome GenomicsDB workspaces with more than ~1000 samples. We can get virtually all the chromosomes to consolidate, but the largest (chr 1) can go for 10 days and eventually dies. It's just too unwieldy to be practical.

My current idea is to stage our data into more reasonable workspaces of ~500 samples. These have the benefit of allowing existing data to be static, and we just keep appending/making new workspaces as we get more samples. Since GenotypeGVCFs only allows one input, the plan is:

  • Run GenotypeGVCFs once per workspace. This is scattered into a huge number of jobs, so each job is reasonable.
  • Take the VCFs from each of these and take the union of all sites.
  • Repeat GenotypeGVCFs per workspace, giving it those sites as the whitelist of sites to force genotyping
  • Merge the resulting VCFs into one

The latter is currently being tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants