Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in HaplotypeCaller: lack of consistency in the overlap criteria applied to reads to be considered for PL and AD/DP annotations. #5434

Closed
vruano opened this issue Nov 19, 2018 · 10 comments

Comments

@vruano
Copy link
Contributor

vruano commented Nov 19, 2018

Instructions

Initially reported by a user on the forum... Aparently some variants with non-zero quals have 0 AD and DPs. Other annotations are also missing from the INFO columns.

After some debugging it turns out that the criteria to determine whether a read should be considered for a variant in terms of alignment overlap are different for taking part of PL calculation and AD/DP calculation.

Where is not totally clear what is the best way to go in practice. It seems to me that we should be consistent here and both PL and AD/DP should use the same criterion. The offending code lines:

HaplotypeCallerGenotypingEngine.java ln171:

ReadLikelihoods<Allele> readAlleleLikelihoods = readLikelihoods.marginalize(alleleMapper, 
         new SimpleInterval(mergedVC).expandWithinContig(ALLELE_EXTENSION, header.getSequenceDictionary()));
            if (configuration.isSampleContaminationPresent()) {
                readAlleleLikelihoods.contaminationDownsampling(configuration.getSampleContamination());
            }

The code above decides the involvement in PL calculations. Notice that ALLELE_EXTENSION is set to 2.

For the AD/DP and so on the code responsible is in AssemblyBasedCallerGenotypingEngine.java ln366:

    // Otherwise (else part) we need to do it again.
        if (configuration.useFilteredReadMapForAnnotations || !configuration.isSampleContaminationPresent()) {
            readAlleleLikelihoodsForAnnotations = readAlleleLikelihoodsForGenotyping;
            readAlleleLikelihoodsForAnnotations.filterToOnlyOverlappingReads(loc);
        } else {
            readAlleleLikelihoodsForAnnotations = readHaplotypeLikelihoods.marginalize(alleleMapper, loc);
            if (emitReferenceConfidence) {
                readAlleleLikelihoodsForAnnotations.addNonReferenceAllele(Allele.NON_REF_ALLELE);
            }
        }

The filterToOnlyOverlappingReads(loc) is called then the overlap criterion is strict. (e.g. 0bp padding). This is also the case for the marginalize call if the conditional is false as the loc passed has not been padded.

It seems to me that setting the ALLELE_EXTENSION == 2 is a very deliberative action (so it was done for a reason) and perhaps this is the way to go... but in deed if the read really does not overlap the variant should be considered at all.

This come from a more complex discussion whether the in cases whether variants are totally linked in the assembly graph we should consider reads supporting another variant alleles as supporting this other variant linked allele or not. I think that user found it a bit strange that this would be the case and perhaps this is the reason why we are doing this read filtering in the first place.

@vruano
Copy link
Contributor Author

vruano commented Nov 19, 2018

Please whatever the course of action, report it back on the forum thread using the link above.

@vruano
Copy link
Contributor Author

vruano commented Nov 19, 2018

@ldgauthier what do you think would be the implications of fixing this by either keeping the 2bp ALLELE_EXTENSION overlap or remove it. I guess that most of the time the variant is supported by a healthy number of reads and the AD/DP is perhaps a couple of reads lower that is supposed based on the PL if anything.

It is more parsimonious to simply don't consider reads that don't overlap the variant but it seems to me that the 2bp was put there for a reason (increase sensitivity?)

@vruano
Copy link
Contributor Author

vruano commented Nov 19, 2018

A third option is allow for the 2bp padding but conditional that there are at least a read or a few reads that actually overlap the variant so that no variant is supported exclusively by "ghost" reads (i.e. those not counted in AD/DP).

@vruano vruano changed the title Bug in HaplotypeCaller: lack of consistency in the overlap criteria applied to read to be considered for PL and AD/DP annotations. Bug in HaplotypeCaller: lack of consistency in the overlap criteria applied to reads to be considered for PL and AD/DP annotations. Nov 21, 2018
@ldgauthier
Copy link
Contributor

I would rather revert the 2bp padding by default. I suspect it was put in to increase sensitivity as you suggest, in cases where there are phased variants in close proximity. If two reads supported the event during assembly, but then neither of them pass filterPoorlyMappingReads then that's pretty suspicious. As you suggested yesterday, I'm in favor of making it an option so that people with low coverage data don't lose sensitivity.

@ldgauthier
Copy link
Contributor

So another inconsistency that arises from the support of non-overlapping reads is that the variants don't get annotated properly. For example:

chr4    69841856        .       AAGTTAAAC       A,<NON_REF>     11.37   .       DP=1;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;RAW_MQandDP=3600,1 GT:AD:DP:GQ:PGT:PID:PL:SB       1/1:0,1,0:1:3:0|1:69841856_AAGTTAAAC_A:21,3,0,21,3,21:0,0,1,0
chr4    69841865        .       A       <NON_REF>       .       .       END=69841867    GT:DP:GQ:MIN_DP:PL      0/0:1:0:0:0,0,0
chr4    69841868        .       A       G,<NON_REF>     11.37   .       ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00    GT:AD:DP:GQ:PGT:PID:PL:SB       1/1:0,0,0:0:3:0|1:69841856_AAGTTAAAC_A:21,3,0,21,3,21:0,0,0,0

where the phased variant with no overlapping support is missing DP and the information for MQ because it has no reads from which to calculate those annotations.

@vruano
Copy link
Contributor Author

vruano commented Dec 13, 2018

I guess that should be fixed as well. Do you have the command and data to reproduce... I can add the integration test for that.

@ldgauthier
Copy link
Contributor

This is a different sample/locus, but same issue:
java -jar $GATKjar HaplotypeCaller --emit-ref-confidence GVCF --gvcf-gq-bands 10 --gvcf-gq-bands 20 --gvcf-gq-bands 30 --gvcf-gq-bands 40 --gvcf-gq-bands 50 --gvcf-gq-bands 60 --gvcf-gq-bands 70 --gvcf-gq-bands 80 --gvcf-gq-bands 90 --contamination-fraction-to-filter 0.011583226666666667 -L gs://broad-gotc-dev-cromwell-execution/ExomeGermlineSingleSample/873ad750-62ac-41a1-857a-a9388ad23392/call-BamToGvcf/BamToGvcf/c6c55d1d-7f26-4816-b268-10acfe409083/call-ScatterIntervalList/glob-cb4648beeaff920acb03de7603c06f98/41scattered.interval_list -I gs://broad-gotc-dev-cromwell-execution/ExomeGermlineSingleSample/873ad750-62ac-41a1-857a-a9388ad23392/call-UnmappedBamToAlignedBam/UnmappedBamToAlignedBam/3945800d-4668-41b0-9eb0-602c2bf9b209/call-GatherBamFiles/NA12878.bam -G StandardAnnotation -G AS_StandardAnnotaiton -new-qual
where GATKjar is 4.0.10.1 in my case

My problematic output is:
chr6 149638730 . G GTTT,<NON_REF> 0.13 . AS_RAW_BaseQRankSum=||;AS_RAW_MQ=0.00|0.00|0.00;AS_RAW_MQRankSum=||;AS_RAW_ReadPosRankSum=||;AS_SB_TABLE=0,0|0,0|0,0;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00 GT:AD:DP:GQ:PL:SB 1/1:0,0,0:0:3:19,3,0,19,3,19:0,0,0,0
which is missing DP, QD, and RAW_MQandDP

Let me know if you don't have access to the gotc-dev bucket.

vruano added a commit that referenced this issue Jul 24, 2019
… DP calculations.

It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants
with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for
both and give the option to change it using an additional argument in HC and Mutec:
   --allele-informative-reads-overlap-radius INT (default == 0).
vruano added a commit that referenced this issue Jul 24, 2019
… DP calculations.

It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants
with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for
both and give the option to change it using an additional argument in HC and Mutec:
   --allele-informative-reads-overlap-radius INT (default == 0).
vruano added a commit that referenced this issue Jul 24, 2019
…d DP calculations.

It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants
with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for
both and give the option to change it using an additional argument in HC and Mutec:
   --allele-informative-reads-overlap-radius INT (default == 0).
vruano added a commit that referenced this issue Jul 24, 2019
…d DP calculations.

It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants
with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for
both and give the option to change it using an additional argument in HC and Mutec:
   --allele-informative-reads-overlap-radius INT (default == 0).
vruano added a commit that referenced this issue Jul 26, 2019
…d DP calculations.

It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants
with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for
both and give the option to change it using an additional argument in HC and Mutec:
   --allele-informative-reads-overlap-radius INT (default == 0).
vruano added a commit that referenced this issue Jul 31, 2019
…d DP calculations.

It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants
with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for
both and give the option to change it using an additional argument in HC and Mutec:
   --allele-informative-reads-overlap-radius INT (default == 0).
vruano added a commit that referenced this issue Jul 31, 2019
…d DP calculations.

It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants
with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for
both and give the option to change it using an additional argument in HC and Mutec:
   --allele-informative-reads-overlap-radius INT (default == 0).
vruano added a commit that referenced this issue Jul 31, 2019
…d DP calculations.

It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants
with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for
both and give the option to change it using an additional argument in HC and Mutec:
   --allele-informative-reads-overlap-radius INT (default == 0).
vruano added a commit that referenced this issue Jul 31, 2019
…d DP calculations.

It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants
with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for
both and give the option to change it using an additional argument in HC and Mutec:

--allele-informative-reads-overlap-radius INT (default == 2).

We keep the wired value for the PL calculation since that affects sensitivity and we can observed a fall in QUAL values in integration test alone.
However, this is not compeling evidence that indeed this is the most appropriate value and a full evaluation is in order.

Now the AD and DP annotations would take in consideration this new radius (was 0 before). This will result in large values for AD and DP. In practice
DP nearly every time goes up by a few reads whereas AD does it more rarely. This is due to the fact that the radius increase often just add reads
that are actually not informative for PL and so the have the same likelihood for each variants and they don't contribute to AD.

When AD increases though one may thing that that is a good thing as we are adding additional suportive reads.... however you need to notice that
the `why we simply don't use all the reads in the active region in the first place (radius = +Infinity) this could often apply here... these reads are providing
evidence thru linakge-desecilibrium (arguably a good thing) or perhaps a defficient assembly with missing key haplotypes (bad for sure).

This makes is more likely that sum(AD) diverges from DP which some user may not be happy (although is not incorrect) but at least non 0,0,0 PL will now in hand with non 0
AD and DPs which some user may find more appropriate.

In anycase the question still remains on whether a radius of 2 is ideal.
vruano added a commit that referenced this issue Nov 22, 2019
…d DP calculations.

It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants
with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for
both and give the option to change it using an additional argument in HC and Mutec:

--allele-informative-reads-overlap-radius INT (default == 2).

We keep the wired value for the PL calculation since that affects sensitivity and we can observed a fall in QUAL values in integration test alone.
However, this is not compeling evidence that indeed this is the most appropriate value and a full evaluation is in order.

Now the AD and DP annotations would take in consideration this new radius (was 0 before). This will result in large values for AD and DP. In practice
DP nearly every time goes up by a few reads whereas AD does it more rarely. This is due to the fact that the radius increase often just add reads
that are actually not informative for PL and so the have the same likelihood for each variants and they don't contribute to AD.

When AD increases though one may thing that that is a good thing as we are adding additional suportive reads.... however you need to notice that
the `why we simply don't use all the reads in the active region in the first place (radius = +Infinity) this could often apply here... these reads are providing
evidence thru linakge-desecilibrium (arguably a good thing) or perhaps a defficient assembly with missing key haplotypes (bad for sure).

This makes is more likely that sum(AD) diverges from DP which some user may not be happy (although is not incorrect) but at least non 0,0,0 PL will now in hand with non 0
AD and DPs which some user may find more appropriate.

In anycase the question still remains on whether a radius of 2 is ideal.
@vruano
Copy link
Contributor Author

vruano commented Nov 23, 2019

Doing some additional refactoring and rebasing.

@vruano
Copy link
Contributor Author

vruano commented Nov 23, 2019

Tried to add that example/bug above but the bam is no longer available. A quick try with another NA12878 bam didn't reproduce the bug. So won't do it for now.

vruano added a commit that referenced this issue Nov 26, 2019
…d DP calculations.

It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants
with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for
both and give the option to change it using an additional argument in HC and Mutec:

--allele-informative-reads-overlap-radius INT (default == 2).

We keep the wired value for the PL calculation since that affects sensitivity and we can observed a fall in QUAL values in integration test alone.
However, this is not compeling evidence that indeed this is the most appropriate value and a full evaluation is in order.

Now the AD and DP annotations would take in consideration this new radius (was 0 before). This will result in large values for AD and DP. In practice
DP nearly every time goes up by a few reads whereas AD does it more rarely. This is due to the fact that the radius increase often just add reads
that are actually not informative for PL and so the have the same likelihood for each variants and they don't contribute to AD.

When AD increases though one may thing that that is a good thing as we are adding additional suportive reads.... however you need to notice that
the `why we simply don't use all the reads in the active region in the first place (radius = +Infinity) this could often apply here... these reads are providing
evidence thru linakge-desecilibrium (arguably a good thing) or perhaps a defficient assembly with missing key haplotypes (bad for sure).

This makes is more likely that sum(AD) diverges from DP which some user may not be happy (although is not incorrect) but at least non 0,0,0 PL will now in hand with non 0
AD and DPs which some user may find more appropriate.

Also (silent) bug fixes in AlleleLikelihoods and separation of marginzalization and filtering based on, amongst other things, read overlap on a particular location (e.g. a variant site)
vruano added a commit that referenced this issue Nov 26, 2019
…nd DP calculations.

    It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants
    with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for
    both and give the option to change it using an additional argument in HC and Mutec:

    --allele-informative-reads-overlap-radius INT (default == 2).

    We keep the wired value for the PL calculation since that affects sensitivity and we
    can observed a fall in QUAL values in integration test alone.

    However, this is not compeling evidence that indeed this is the most appropriate value
    and a full evaluation might be in order.

    Now the AD and DP annotations would take in consideration this new radius (was 0 before).
    This will result in large values for AD and DP. In practice DP nearly every time goes up by
    a few reads whereas AD does it more rarely. This is due to the fact that the radius increase
    often just add reads that are actually not informative for PL and so the have the same likelihood
    for each variants and they don't contribute to AD.

    When AD increases though one may thing that that is a good thing as we are adding additional
    suportive reads.... however you need to notice that the `why we simply don't use all the reads
    in the active region in the first place (radius = +Infinity) this could often apply here...
    these reads are providing evidence thru linakge-desecilibrium (arguably a good thing) or perhaps
    a defficient assembly with missing key haplotypes (bad for sure).

    This makes is more likely that sum(AD) diverges from DP which some user may not be happy
    (although is not incorrect) but at least non 0,0,0 PL will now in hand with non 0
    AD and DPs which some user may find more appropriate.
vruano added a commit that referenced this issue Nov 27, 2019
…nd DP calculations.

    It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants
    with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for
    both and give the option to change it using an additional argument in HC and Mutec:

    --allele-informative-reads-overlap-radius INT (default == 2).

    We keep the wired value for the PL calculation since that affects sensitivity and we
    can observed a fall in QUAL values in integration test alone.

    However, this is not compeling evidence that indeed this is the most appropriate value
    and a full evaluation might be in order.

    Now the AD and DP annotations would take in consideration this new radius (was 0 before).
    This will result in large values for AD and DP. In practice DP nearly every time goes up by
    a few reads whereas AD does it more rarely. This is due to the fact that the radius increase
    often just add reads that are actually not informative for PL and so the have the same likelihood
    for each variants and they don't contribute to AD.

    When AD increases though one may thing that that is a good thing as we are adding additional
    suportive reads.... however you need to notice that the `why we simply don't use all the reads
    in the active region in the first place (radius = +Infinity) this could often apply here...
    these reads are providing evidence thru linakge-desecilibrium (arguably a good thing) or perhaps
    a defficient assembly with missing key haplotypes (bad for sure).

    This makes is more likely that sum(AD) diverges from DP which some user may not be happy
    (although is not incorrect) but at least non 0,0,0 PL will now in hand with non 0
    AD and DPs which some user may find more appropriate.
vruano added a commit that referenced this issue Dec 6, 2019
…nd DP calculations.

    It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants
    with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for
    both and give the option to change it using an additional argument in HC and Mutec:

    --allele-informative-reads-overlap-radius INT (default == 2).

    We keep the wired value for the PL calculation since that affects sensitivity and we
    can observed a fall in QUAL values in integration test alone.

    However, this is not compeling evidence that indeed this is the most appropriate value
    and a full evaluation might be in order.

    Now the AD and DP annotations would take in consideration this new radius (was 0 before).
    This will result in large values for AD and DP. In practice DP nearly every time goes up by
    a few reads whereas AD does it more rarely. This is due to the fact that the radius increase
    often just add reads that are actually not informative for PL and so the have the same likelihood
    for each variants and they don't contribute to AD.

    When AD increases though one may thing that that is a good thing as we are adding additional
    suportive reads.... however you need to notice that the `why we simply don't use all the reads
    in the active region in the first place (radius = +Infinity) this could often apply here...
    these reads are providing evidence thru linakge-desecilibrium (arguably a good thing) or perhaps
    a defficient assembly with missing key haplotypes (bad for sure).

    This makes is more likely that sum(AD) diverges from DP which some user may not be happy
    (although is not incorrect) but at least non 0,0,0 PL will now in hand with non 0
    AD and DPs which some user may find more appropriate.

    Also fixes (silent) bugs in AlleleLikelihoods
    including some refactoring including splitting marginalize and evidence filtering.
vruano added a commit that referenced this issue Dec 7, 2019
…nd DP calculations. (#6055)

It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants
    with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for
    both and give the option to change it using an additional argument in HC and Mutec:

    --allele-informative-reads-overlap-radius INT (default == 2).

    We keep the wired value for the PL calculation since that affects sensitivity and we
    can observed a fall in QUAL values in integration test alone.

    However, this is not compeling evidence that indeed this is the most appropriate value
    and a full evaluation might be in order.

    Now the AD and DP annotations would take in consideration this new radius (was 0 before).
    This will result in large values for AD and DP. In practice DP nearly every time goes up by
    a few reads whereas AD does it more rarely. This is due to the fact that the radius increase
    often just add reads that are actually not informative for PL and so the have the same likelihood
    for each variants and they don't contribute to AD.

    When AD increases though one may thing that that is a good thing as we are adding additional
    suportive reads.... however you need to notice that the `why we simply don't use all the reads
    in the active region in the first place (radius = +Infinity) this could often apply here...
    these reads are providing evidence thru linakge-desecilibrium (arguably a good thing) or perhaps
    a defficient assembly with missing key haplotypes (bad for sure).

    This makes is more likely that sum(AD) diverges from DP which some user may not be happy
    (although is not incorrect) but at least non 0,0,0 PL will now in hand with non 0
    AD and DPs which some user may find more appropriate.

    Also fixes (silent) bugs in AlleleLikelihoods
    including some refactoring including splitting marginalize and evidence filtering.
@davidbenjamin
Copy link
Contributor

Closed by #6055.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants