GenomicsDB Import/Select fails on record containing a spanning deletion allele #4716

cwhelan · 2018-04-30T16:01:29Z

Reading from GenomicsDB fails when a some records containing spanning deletion alleles are imported into a workspace. Not all records seem to cause this to fail; I haven't been able to figure out what specific properties of the records cause the error. Here's the contents (minus header) of a VCF file that causes the error:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA12878
20	10097436	.	CTTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTT	C,<NON_REF>	1054.73	.	BaseQRankSum=1.820;ClippingRankSum=0.000;DP=89;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-6.464;RAW_MQ=262143.00;ReadPosRankSum=-3.231	GT:AD:DP:GQ:PL:SB	0/1:57,32,0:89:99:1092,0,2241,1263,2338,3601:23,34,11,21
20	10097437	.	TTTTC	*,T,<NON_REF>	2089.73	.	DP=76;ExcessHet=3.0103;MLEAC=1,1,0;MLEAF=0.500,0.500,0.00;RAW_MQ=217330.00	GT:AD:DP:GQ:PL:SB	1/2:0,32,23,0:55:99:2127,940,1799,1195,0,1125,2201,1453,1262,2642:0,0,16,39

Steps to reproduce:

./gatk GenomicsDBImport -R src/test/resources/large/human_g1k_v37.20.21.fasta -L 20 -V test_gdb_import.vcf.gz -genomicsdb-workspace-path spanDelWorkspace
./gatk SelectVariants -V gendb://spanDelWorkspace -R src/test/resources/large/human_g1k_v37.20.21.fasta -O test.vcf -L 20

Error:

java.lang.IllegalArgumentException: Duplicate allele added to VariantContext: T
    at htsjdk.variant.variantcontext.VariantContext.makeAlleles(VariantContext.java:1490)
    at htsjdk.variant.variantcontext.VariantContext.<init>(VariantContext.java:380)
    at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:494)
    at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:488)
    at htsjdk.variant.bcf2.BCF2Codec.decode(BCF2Codec.java:132)
    at htsjdk.variant.bcf2.BCF2Codec.decode(BCF2Codec.java:58)
    at com.intel.genomicsdb.GenomicsDBFeatureReader$GenomicsDBFeatureIterator.next(GenomicsDBFeatureReader.java:357)
    at com.intel.genomicsdb.GenomicsDBFeatureReader$GenomicsDBFeatureIterator.next(GenomicsDBFeatureReader.java:291)

This issue was discovered while trying to add spanning deletion genotyping support to HaplotypeCaller for #2960 and resolution seems to be necessary to support the fix for that issue.

The text was updated successfully, but these errors were encountered:

droazen · 2018-04-30T16:04:46Z

@kgururaj Could you please make this ticket your top priority?

droazen · 2018-04-30T17:25:02Z

@cwhelan @ldgauthier @kgururaj informs me that he's already fixed this in GenomicsDB, and will do a release soon. Once it's out, there will be a PR later this week to update GATK to the latest version.

… sites-only query support, and bug fixes (#4645) This PR addresses required changes in order to use latest version of GenomicsDB which exposes new functionality such as: * Multi interval import and query support: * We create multiple arrays (directories) in a single workspace - one per interval. So, if you wish to import intervals ("chr1", [ 1, 100M ]) and ("chr2", [ 1, 100M ]), you end up with 2 directories/arrays in the workspace with names chr1$1$100M and chr2$1$100M. The array names depend on the partition bounds. * During the read phase, the user only supplies the workspace. The array names are obtained by scanning the entries in the workspace and reading the right arrays. For example, if you wish to read ("chr2", [ 50, 50M] ), then only the second array is queried. In the previous version of the tool, the array name was a constant - genomicsdb_array. The new version will be backward compatible with respect to reads. Hence, if a directory named genomicsdb_array is found in the workspace directory, it's passed as the array for the GenomicsDBFeatureReader otherwise the array names are generated from the directory entry names. * Parallel import based on chromosome intervals. The number of threads to use can be specified as an integer argument to the executeImport call. If no argument is specified, the number of threads is determined by Java's ForkJoinPool (typically equal to the #cores in the system). * The max number of intervals to import in parallel can be controlled by the command line argument --max-num-intervals-to-import-in-parallel (default 1) Note that increasing parallelism increases the number of FeatureReaders opened to feed data to the importer. So, if you are using N threads and your batch size is B, you will have N*B feature readers open. * Protobuf based API for import and read #3688 #2687 * Option to produce GT field * Option to produce GT for spanning deletion based on min PL value * Doesn't support #4541 or #3689 yet - next version * Bug fixes * Fix for #4716 * More error messages

droazen · 2018-07-06T17:35:28Z

Fixed in #4645

droazen assigned kgururaj Apr 30, 2018

droazen added GenomicsDB bug PRIORITY_HIGH labels Apr 30, 2018

droazen added this to the Engine-2Q2018 milestone Apr 30, 2018

kgururaj mentioned this issue Apr 30, 2018

lb_row_idx and spanning deletion fix Intel-HLS/GenomicsDB#189

Merged

kgururaj mentioned this issue Apr 30, 2018

GenomicsDBImport: Modifications in order to address GATK issue #3269 #4645

Merged

cwhelan mentioned this issue Jun 28, 2018

Make HaplotypeCaller genotype and output spanning deletions #4963

Merged

droazen closed this as completed Jul 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GenomicsDB Import/Select fails on record containing a spanning deletion allele #4716

GenomicsDB Import/Select fails on record containing a spanning deletion allele #4716

cwhelan commented Apr 30, 2018

droazen commented Apr 30, 2018

droazen commented Apr 30, 2018

droazen commented Jul 6, 2018

GenomicsDB Import/Select fails on record containing a spanning deletion allele #4716

GenomicsDB Import/Select fails on record containing a spanning deletion allele #4716

Comments

cwhelan commented Apr 30, 2018

droazen commented Apr 30, 2018

droazen commented Apr 30, 2018

droazen commented Jul 6, 2018