Bug in HaplotypeCaller: lack of consistency in the overlap criteria applied to reads to be considered for PL and AD/DP annotations. #5434

vruano · 2018-11-19T06:17:08Z

Instructions

Initially reported by a user on the forum... Aparently some variants with non-zero quals have 0 AD and DPs. Other annotations are also missing from the INFO columns.

After some debugging it turns out that the criteria to determine whether a read should be considered for a variant in terms of alignment overlap are different for taking part of PL calculation and AD/DP calculation.

Where is not totally clear what is the best way to go in practice. It seems to me that we should be consistent here and both PL and AD/DP should use the same criterion. The offending code lines:

HaplotypeCallerGenotypingEngine.java ln171:

ReadLikelihoods<Allele> readAlleleLikelihoods = readLikelihoods.marginalize(alleleMapper, 
         new SimpleInterval(mergedVC).expandWithinContig(ALLELE_EXTENSION, header.getSequenceDictionary()));
            if (configuration.isSampleContaminationPresent()) {
                readAlleleLikelihoods.contaminationDownsampling(configuration.getSampleContamination());
            }

The code above decides the involvement in PL calculations. Notice that ALLELE_EXTENSION is set to 2.

For the AD/DP and so on the code responsible is in AssemblyBasedCallerGenotypingEngine.java ln366:

    // Otherwise (else part) we need to do it again.
        if (configuration.useFilteredReadMapForAnnotations || !configuration.isSampleContaminationPresent()) {
            readAlleleLikelihoodsForAnnotations = readAlleleLikelihoodsForGenotyping;
            readAlleleLikelihoodsForAnnotations.filterToOnlyOverlappingReads(loc);
        } else {
            readAlleleLikelihoodsForAnnotations = readHaplotypeLikelihoods.marginalize(alleleMapper, loc);
            if (emitReferenceConfidence) {
                readAlleleLikelihoodsForAnnotations.addNonReferenceAllele(Allele.NON_REF_ALLELE);
            }
        }

The filterToOnlyOverlappingReads(loc) is called then the overlap criterion is strict. (e.g. 0bp padding). This is also the case for the marginalize call if the conditional is false as the loc passed has not been padded.

It seems to me that setting the ALLELE_EXTENSION == 2 is a very deliberative action (so it was done for a reason) and perhaps this is the way to go... but in deed if the read really does not overlap the variant should be considered at all.

This come from a more complex discussion whether the in cases whether variants are totally linked in the assembly graph we should consider reads supporting another variant alleles as supporting this other variant linked allele or not. I think that user found it a bit strange that this would be the case and perhaps this is the reason why we are doing this read filtering in the first place.

The text was updated successfully, but these errors were encountered:

vruano · 2018-11-19T06:22:50Z

Please whatever the course of action, report it back on the forum thread using the link above.

vruano · 2018-11-19T17:01:24Z

@ldgauthier what do you think would be the implications of fixing this by either keeping the 2bp ALLELE_EXTENSION overlap or remove it. I guess that most of the time the variant is supported by a healthy number of reads and the AD/DP is perhaps a couple of reads lower that is supposed based on the PL if anything.

It is more parsimonious to simply don't consider reads that don't overlap the variant but it seems to me that the 2bp was put there for a reason (increase sensitivity?)

vruano · 2018-11-19T17:06:00Z

A third option is allow for the 2bp padding but conditional that there are at least a read or a few reads that actually overlap the variant so that no variant is supported exclusively by "ghost" reads (i.e. those not counted in AD/DP).

ldgauthier · 2018-11-29T14:06:17Z

I would rather revert the 2bp padding by default. I suspect it was put in to increase sensitivity as you suggest, in cases where there are phased variants in close proximity. If two reads supported the event during assembly, but then neither of them pass filterPoorlyMappingReads then that's pretty suspicious. As you suggested yesterday, I'm in favor of making it an option so that people with low coverage data don't lose sensitivity.

ldgauthier · 2018-12-07T14:35:27Z

So another inconsistency that arises from the support of non-overlapping reads is that the variants don't get annotated properly. For example:

chr4    69841856        .       AAGTTAAAC       A,<NON_REF>     11.37   .       DP=1;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;RAW_MQandDP=3600,1 GT:AD:DP:GQ:PGT:PID:PL:SB       1/1:0,1,0:1:3:0|1:69841856_AAGTTAAAC_A:21,3,0,21,3,21:0,0,1,0
chr4    69841865        .       A       <NON_REF>       .       .       END=69841867    GT:DP:GQ:MIN_DP:PL      0/0:1:0:0:0,0,0
chr4    69841868        .       A       G,<NON_REF>     11.37   .       ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00    GT:AD:DP:GQ:PGT:PID:PL:SB       1/1:0,0,0:0:3:0|1:69841856_AAGTTAAAC_A:21,3,0,21,3,21:0,0,0,0

where the phased variant with no overlapping support is missing DP and the information for MQ because it has no reads from which to calculate those annotations.

vruano · 2018-12-13T21:22:05Z

I guess that should be fixed as well. Do you have the command and data to reproduce... I can add the integration test for that.

ldgauthier · 2018-12-14T15:02:08Z

This is a different sample/locus, but same issue:
java -jar $GATKjar HaplotypeCaller --emit-ref-confidence GVCF --gvcf-gq-bands 10 --gvcf-gq-bands 20 --gvcf-gq-bands 30 --gvcf-gq-bands 40 --gvcf-gq-bands 50 --gvcf-gq-bands 60 --gvcf-gq-bands 70 --gvcf-gq-bands 80 --gvcf-gq-bands 90 --contamination-fraction-to-filter 0.011583226666666667 -L gs://broad-gotc-dev-cromwell-execution/ExomeGermlineSingleSample/873ad750-62ac-41a1-857a-a9388ad23392/call-BamToGvcf/BamToGvcf/c6c55d1d-7f26-4816-b268-10acfe409083/call-ScatterIntervalList/glob-cb4648beeaff920acb03de7603c06f98/41scattered.interval_list -I gs://broad-gotc-dev-cromwell-execution/ExomeGermlineSingleSample/873ad750-62ac-41a1-857a-a9388ad23392/call-UnmappedBamToAlignedBam/UnmappedBamToAlignedBam/3945800d-4668-41b0-9eb0-602c2bf9b209/call-GatherBamFiles/NA12878.bam -G StandardAnnotation -G AS_StandardAnnotaiton -new-qual
where GATKjar is 4.0.10.1 in my case

My problematic output is:
chr6 149638730 . G GTTT,<NON_REF> 0.13 . AS_RAW_BaseQRankSum=||;AS_RAW_MQ=0.00|0.00|0.00;AS_RAW_MQRankSum=||;AS_RAW_ReadPosRankSum=||;AS_SB_TABLE=0,0|0,0|0,0;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00 GT:AD:DP:GQ:PL:SB 1/1:0,0,0:0:3:19,3,0,19,3,19:0,0,0,0
which is missing DP, QD, and RAW_MQandDP

Let me know if you don't have access to the gotc-dev bucket.

… DP calculations. It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for both and give the option to change it using an additional argument in HC and Mutec: --allele-informative-reads-overlap-radius INT (default == 0).

…d DP calculations. It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for both and give the option to change it using an additional argument in HC and Mutec: --allele-informative-reads-overlap-radius INT (default == 0).

…d DP calculations. It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for both and give the option to change it using an additional argument in HC and Mutec: --allele-informative-reads-overlap-radius INT (default == 2). We keep the wired value for the PL calculation since that affects sensitivity and we can observed a fall in QUAL values in integration test alone. However, this is not compeling evidence that indeed this is the most appropriate value and a full evaluation is in order. Now the AD and DP annotations would take in consideration this new radius (was 0 before). This will result in large values for AD and DP. In practice DP nearly every time goes up by a few reads whereas AD does it more rarely. This is due to the fact that the radius increase often just add reads that are actually not informative for PL and so the have the same likelihood for each variants and they don't contribute to AD. When AD increases though one may thing that that is a good thing as we are adding additional suportive reads.... however you need to notice that the `why we simply don't use all the reads in the active region in the first place (radius = +Infinity) this could often apply here... these reads are providing evidence thru linakge-desecilibrium (arguably a good thing) or perhaps a defficient assembly with missing key haplotypes (bad for sure). This makes is more likely that sum(AD) diverges from DP which some user may not be happy (although is not incorrect) but at least non 0,0,0 PL will now in hand with non 0 AD and DPs which some user may find more appropriate. In anycase the question still remains on whether a radius of 2 is ideal.

vruano · 2019-11-23T20:06:42Z

Doing some additional refactoring and rebasing.

vruano · 2019-11-23T20:37:37Z

Tried to add that example/bug above but the bam is no longer available. A quick try with another NA12878 bam didn't reproduce the bug. So won't do it for now.

…d DP calculations. It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for both and give the option to change it using an additional argument in HC and Mutec: --allele-informative-reads-overlap-radius INT (default == 2). We keep the wired value for the PL calculation since that affects sensitivity and we can observed a fall in QUAL values in integration test alone. However, this is not compeling evidence that indeed this is the most appropriate value and a full evaluation is in order. Now the AD and DP annotations would take in consideration this new radius (was 0 before). This will result in large values for AD and DP. In practice DP nearly every time goes up by a few reads whereas AD does it more rarely. This is due to the fact that the radius increase often just add reads that are actually not informative for PL and so the have the same likelihood for each variants and they don't contribute to AD. When AD increases though one may thing that that is a good thing as we are adding additional suportive reads.... however you need to notice that the `why we simply don't use all the reads in the active region in the first place (radius = +Infinity) this could often apply here... these reads are providing evidence thru linakge-desecilibrium (arguably a good thing) or perhaps a defficient assembly with missing key haplotypes (bad for sure). This makes is more likely that sum(AD) diverges from DP which some user may not be happy (although is not incorrect) but at least non 0,0,0 PL will now in hand with non 0 AD and DPs which some user may find more appropriate. Also (silent) bug fixes in AlleleLikelihoods and separation of marginzalization and filtering based on, amongst other things, read overlap on a particular location (e.g. a variant site)

…nd DP calculations. It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for both and give the option to change it using an additional argument in HC and Mutec: --allele-informative-reads-overlap-radius INT (default == 2). We keep the wired value for the PL calculation since that affects sensitivity and we can observed a fall in QUAL values in integration test alone. However, this is not compeling evidence that indeed this is the most appropriate value and a full evaluation might be in order. Now the AD and DP annotations would take in consideration this new radius (was 0 before). This will result in large values for AD and DP. In practice DP nearly every time goes up by a few reads whereas AD does it more rarely. This is due to the fact that the radius increase often just add reads that are actually not informative for PL and so the have the same likelihood for each variants and they don't contribute to AD. When AD increases though one may thing that that is a good thing as we are adding additional suportive reads.... however you need to notice that the `why we simply don't use all the reads in the active region in the first place (radius = +Infinity) this could often apply here... these reads are providing evidence thru linakge-desecilibrium (arguably a good thing) or perhaps a defficient assembly with missing key haplotypes (bad for sure). This makes is more likely that sum(AD) diverges from DP which some user may not be happy (although is not incorrect) but at least non 0,0,0 PL will now in hand with non 0 AD and DPs which some user may find more appropriate.

…nd DP calculations. It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for both and give the option to change it using an additional argument in HC and Mutec: --allele-informative-reads-overlap-radius INT (default == 2). We keep the wired value for the PL calculation since that affects sensitivity and we can observed a fall in QUAL values in integration test alone. However, this is not compeling evidence that indeed this is the most appropriate value and a full evaluation might be in order. Now the AD and DP annotations would take in consideration this new radius (was 0 before). This will result in large values for AD and DP. In practice DP nearly every time goes up by a few reads whereas AD does it more rarely. This is due to the fact that the radius increase often just add reads that are actually not informative for PL and so the have the same likelihood for each variants and they don't contribute to AD. When AD increases though one may thing that that is a good thing as we are adding additional suportive reads.... however you need to notice that the `why we simply don't use all the reads in the active region in the first place (radius = +Infinity) this could often apply here... these reads are providing evidence thru linakge-desecilibrium (arguably a good thing) or perhaps a defficient assembly with missing key haplotypes (bad for sure). This makes is more likely that sum(AD) diverges from DP which some user may not be happy (although is not incorrect) but at least non 0,0,0 PL will now in hand with non 0 AD and DPs which some user may find more appropriate. Also fixes (silent) bugs in AlleleLikelihoods including some refactoring including splitting marginalize and evidence filtering.

…nd DP calculations. (#6055) It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for both and give the option to change it using an additional argument in HC and Mutec: --allele-informative-reads-overlap-radius INT (default == 2). We keep the wired value for the PL calculation since that affects sensitivity and we can observed a fall in QUAL values in integration test alone. However, this is not compeling evidence that indeed this is the most appropriate value and a full evaluation might be in order. Now the AD and DP annotations would take in consideration this new radius (was 0 before). This will result in large values for AD and DP. In practice DP nearly every time goes up by a few reads whereas AD does it more rarely. This is due to the fact that the radius increase often just add reads that are actually not informative for PL and so the have the same likelihood for each variants and they don't contribute to AD. When AD increases though one may thing that that is a good thing as we are adding additional suportive reads.... however you need to notice that the `why we simply don't use all the reads in the active region in the first place (radius = +Infinity) this could often apply here... these reads are providing evidence thru linakge-desecilibrium (arguably a good thing) or perhaps a defficient assembly with missing key haplotypes (bad for sure). This makes is more likely that sum(AD) diverges from DP which some user may not be happy (although is not incorrect) but at least non 0,0,0 PL will now in hand with non 0 AD and DPs which some user may find more appropriate. Also fixes (silent) bugs in AlleleLikelihoods including some refactoring including splitting marginalize and evidence filtering.

davidbenjamin · 2020-02-07T15:12:13Z

Closed by #6055.

vruano added bug HaplotypeCaller labels Nov 19, 2018

byoo mentioned this issue Jan 15, 2019

fix edge case for Mutect2 germline resource #5563

Merged

davidbenjamin closed this as completed Feb 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in HaplotypeCaller: lack of consistency in the overlap criteria applied to reads to be considered for PL and AD/DP annotations. #5434

Bug in HaplotypeCaller: lack of consistency in the overlap criteria applied to reads to be considered for PL and AD/DP annotations. #5434

vruano commented Nov 19, 2018

vruano commented Nov 19, 2018

vruano commented Nov 19, 2018

vruano commented Nov 19, 2018 •

edited

Loading

ldgauthier commented Nov 29, 2018

ldgauthier commented Dec 7, 2018

vruano commented Dec 13, 2018

ldgauthier commented Dec 14, 2018

vruano commented Nov 23, 2019

vruano commented Nov 23, 2019

davidbenjamin commented Feb 7, 2020

Bug in HaplotypeCaller: lack of consistency in the overlap criteria applied to reads to be considered for PL and AD/DP annotations. #5434

Bug in HaplotypeCaller: lack of consistency in the overlap criteria applied to reads to be considered for PL and AD/DP annotations. #5434

Comments

vruano commented Nov 19, 2018

Instructions

vruano commented Nov 19, 2018

vruano commented Nov 19, 2018

vruano commented Nov 19, 2018 • edited Loading

ldgauthier commented Nov 29, 2018

ldgauthier commented Dec 7, 2018

vruano commented Dec 13, 2018

ldgauthier commented Dec 14, 2018

vruano commented Nov 23, 2019

vruano commented Nov 23, 2019

davidbenjamin commented Feb 7, 2020

vruano commented Nov 19, 2018 •

edited

Loading