-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in HaplotypeCaller: lack of consistency in the overlap criteria applied to reads to be considered for PL and AD/DP annotations. #5434
Comments
Please whatever the course of action, report it back on the forum thread using the link above. |
@ldgauthier what do you think would be the implications of fixing this by either keeping the 2bp ALLELE_EXTENSION overlap or remove it. I guess that most of the time the variant is supported by a healthy number of reads and the AD/DP is perhaps a couple of reads lower that is supposed based on the PL if anything. It is more parsimonious to simply don't consider reads that don't overlap the variant but it seems to me that the 2bp was put there for a reason (increase sensitivity?) |
A third option is allow for the 2bp padding but conditional that there are at least a read or a few reads that actually overlap the variant so that no variant is supported exclusively by "ghost" reads (i.e. those not counted in AD/DP). |
I would rather revert the 2bp padding by default. I suspect it was put in to increase sensitivity as you suggest, in cases where there are phased variants in close proximity. If two reads supported the event during assembly, but then neither of them pass filterPoorlyMappingReads then that's pretty suspicious. As you suggested yesterday, I'm in favor of making it an option so that people with low coverage data don't lose sensitivity. |
So another inconsistency that arises from the support of non-overlapping reads is that the variants don't get annotated properly. For example:
where the phased variant with no overlapping support is missing DP and the information for MQ because it has no reads from which to calculate those annotations. |
I guess that should be fixed as well. Do you have the command and data to reproduce... I can add the integration test for that. |
This is a different sample/locus, but same issue: My problematic output is: Let me know if you don't have access to the gotc-dev bucket. |
… DP calculations. It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for both and give the option to change it using an additional argument in HC and Mutec: --allele-informative-reads-overlap-radius INT (default == 0).
… DP calculations. It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for both and give the option to change it using an additional argument in HC and Mutec: --allele-informative-reads-overlap-radius INT (default == 0).
…d DP calculations. It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for both and give the option to change it using an additional argument in HC and Mutec: --allele-informative-reads-overlap-radius INT (default == 0).
…d DP calculations. It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for both and give the option to change it using an additional argument in HC and Mutec: --allele-informative-reads-overlap-radius INT (default == 0).
…d DP calculations. It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for both and give the option to change it using an additional argument in HC and Mutec: --allele-informative-reads-overlap-radius INT (default == 0).
…d DP calculations. It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for both and give the option to change it using an additional argument in HC and Mutec: --allele-informative-reads-overlap-radius INT (default == 0).
…d DP calculations. It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for both and give the option to change it using an additional argument in HC and Mutec: --allele-informative-reads-overlap-radius INT (default == 0).
…d DP calculations. It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for both and give the option to change it using an additional argument in HC and Mutec: --allele-informative-reads-overlap-radius INT (default == 0).
…d DP calculations. It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for both and give the option to change it using an additional argument in HC and Mutec: --allele-informative-reads-overlap-radius INT (default == 2). We keep the wired value for the PL calculation since that affects sensitivity and we can observed a fall in QUAL values in integration test alone. However, this is not compeling evidence that indeed this is the most appropriate value and a full evaluation is in order. Now the AD and DP annotations would take in consideration this new radius (was 0 before). This will result in large values for AD and DP. In practice DP nearly every time goes up by a few reads whereas AD does it more rarely. This is due to the fact that the radius increase often just add reads that are actually not informative for PL and so the have the same likelihood for each variants and they don't contribute to AD. When AD increases though one may thing that that is a good thing as we are adding additional suportive reads.... however you need to notice that the `why we simply don't use all the reads in the active region in the first place (radius = +Infinity) this could often apply here... these reads are providing evidence thru linakge-desecilibrium (arguably a good thing) or perhaps a defficient assembly with missing key haplotypes (bad for sure). This makes is more likely that sum(AD) diverges from DP which some user may not be happy (although is not incorrect) but at least non 0,0,0 PL will now in hand with non 0 AD and DPs which some user may find more appropriate. In anycase the question still remains on whether a radius of 2 is ideal.
…d DP calculations. It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for both and give the option to change it using an additional argument in HC and Mutec: --allele-informative-reads-overlap-radius INT (default == 2). We keep the wired value for the PL calculation since that affects sensitivity and we can observed a fall in QUAL values in integration test alone. However, this is not compeling evidence that indeed this is the most appropriate value and a full evaluation is in order. Now the AD and DP annotations would take in consideration this new radius (was 0 before). This will result in large values for AD and DP. In practice DP nearly every time goes up by a few reads whereas AD does it more rarely. This is due to the fact that the radius increase often just add reads that are actually not informative for PL and so the have the same likelihood for each variants and they don't contribute to AD. When AD increases though one may thing that that is a good thing as we are adding additional suportive reads.... however you need to notice that the `why we simply don't use all the reads in the active region in the first place (radius = +Infinity) this could often apply here... these reads are providing evidence thru linakge-desecilibrium (arguably a good thing) or perhaps a defficient assembly with missing key haplotypes (bad for sure). This makes is more likely that sum(AD) diverges from DP which some user may not be happy (although is not incorrect) but at least non 0,0,0 PL will now in hand with non 0 AD and DPs which some user may find more appropriate. In anycase the question still remains on whether a radius of 2 is ideal.
Doing some additional refactoring and rebasing. |
Tried to add that example/bug above but the bam is no longer available. A quick try with another NA12878 bam didn't reproduce the bug. So won't do it for now. |
…d DP calculations. It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for both and give the option to change it using an additional argument in HC and Mutec: --allele-informative-reads-overlap-radius INT (default == 2). We keep the wired value for the PL calculation since that affects sensitivity and we can observed a fall in QUAL values in integration test alone. However, this is not compeling evidence that indeed this is the most appropriate value and a full evaluation is in order. Now the AD and DP annotations would take in consideration this new radius (was 0 before). This will result in large values for AD and DP. In practice DP nearly every time goes up by a few reads whereas AD does it more rarely. This is due to the fact that the radius increase often just add reads that are actually not informative for PL and so the have the same likelihood for each variants and they don't contribute to AD. When AD increases though one may thing that that is a good thing as we are adding additional suportive reads.... however you need to notice that the `why we simply don't use all the reads in the active region in the first place (radius = +Infinity) this could often apply here... these reads are providing evidence thru linakge-desecilibrium (arguably a good thing) or perhaps a defficient assembly with missing key haplotypes (bad for sure). This makes is more likely that sum(AD) diverges from DP which some user may not be happy (although is not incorrect) but at least non 0,0,0 PL will now in hand with non 0 AD and DPs which some user may find more appropriate. Also (silent) bug fixes in AlleleLikelihoods and separation of marginzalization and filtering based on, amongst other things, read overlap on a particular location (e.g. a variant site)
…nd DP calculations. It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for both and give the option to change it using an additional argument in HC and Mutec: --allele-informative-reads-overlap-radius INT (default == 2). We keep the wired value for the PL calculation since that affects sensitivity and we can observed a fall in QUAL values in integration test alone. However, this is not compeling evidence that indeed this is the most appropriate value and a full evaluation might be in order. Now the AD and DP annotations would take in consideration this new radius (was 0 before). This will result in large values for AD and DP. In practice DP nearly every time goes up by a few reads whereas AD does it more rarely. This is due to the fact that the radius increase often just add reads that are actually not informative for PL and so the have the same likelihood for each variants and they don't contribute to AD. When AD increases though one may thing that that is a good thing as we are adding additional suportive reads.... however you need to notice that the `why we simply don't use all the reads in the active region in the first place (radius = +Infinity) this could often apply here... these reads are providing evidence thru linakge-desecilibrium (arguably a good thing) or perhaps a defficient assembly with missing key haplotypes (bad for sure). This makes is more likely that sum(AD) diverges from DP which some user may not be happy (although is not incorrect) but at least non 0,0,0 PL will now in hand with non 0 AD and DPs which some user may find more appropriate.
…nd DP calculations. It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for both and give the option to change it using an additional argument in HC and Mutec: --allele-informative-reads-overlap-radius INT (default == 2). We keep the wired value for the PL calculation since that affects sensitivity and we can observed a fall in QUAL values in integration test alone. However, this is not compeling evidence that indeed this is the most appropriate value and a full evaluation might be in order. Now the AD and DP annotations would take in consideration this new radius (was 0 before). This will result in large values for AD and DP. In practice DP nearly every time goes up by a few reads whereas AD does it more rarely. This is due to the fact that the radius increase often just add reads that are actually not informative for PL and so the have the same likelihood for each variants and they don't contribute to AD. When AD increases though one may thing that that is a good thing as we are adding additional suportive reads.... however you need to notice that the `why we simply don't use all the reads in the active region in the first place (radius = +Infinity) this could often apply here... these reads are providing evidence thru linakge-desecilibrium (arguably a good thing) or perhaps a defficient assembly with missing key haplotypes (bad for sure). This makes is more likely that sum(AD) diverges from DP which some user may not be happy (although is not incorrect) but at least non 0,0,0 PL will now in hand with non 0 AD and DPs which some user may find more appropriate.
…nd DP calculations. It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for both and give the option to change it using an additional argument in HC and Mutec: --allele-informative-reads-overlap-radius INT (default == 2). We keep the wired value for the PL calculation since that affects sensitivity and we can observed a fall in QUAL values in integration test alone. However, this is not compeling evidence that indeed this is the most appropriate value and a full evaluation might be in order. Now the AD and DP annotations would take in consideration this new radius (was 0 before). This will result in large values for AD and DP. In practice DP nearly every time goes up by a few reads whereas AD does it more rarely. This is due to the fact that the radius increase often just add reads that are actually not informative for PL and so the have the same likelihood for each variants and they don't contribute to AD. When AD increases though one may thing that that is a good thing as we are adding additional suportive reads.... however you need to notice that the `why we simply don't use all the reads in the active region in the first place (radius = +Infinity) this could often apply here... these reads are providing evidence thru linakge-desecilibrium (arguably a good thing) or perhaps a defficient assembly with missing key haplotypes (bad for sure). This makes is more likely that sum(AD) diverges from DP which some user may not be happy (although is not incorrect) but at least non 0,0,0 PL will now in hand with non 0 AD and DPs which some user may find more appropriate. Also fixes (silent) bugs in AlleleLikelihoods including some refactoring including splitting marginalize and evidence filtering.
…nd DP calculations. (#6055) It turns out that we use a wired 2bp overlap values for PL but 0 for AD. As a result some variants with convinicing PL have 0 AD to support alternative. This change make sure that the same radius is used for both and give the option to change it using an additional argument in HC and Mutec: --allele-informative-reads-overlap-radius INT (default == 2). We keep the wired value for the PL calculation since that affects sensitivity and we can observed a fall in QUAL values in integration test alone. However, this is not compeling evidence that indeed this is the most appropriate value and a full evaluation might be in order. Now the AD and DP annotations would take in consideration this new radius (was 0 before). This will result in large values for AD and DP. In practice DP nearly every time goes up by a few reads whereas AD does it more rarely. This is due to the fact that the radius increase often just add reads that are actually not informative for PL and so the have the same likelihood for each variants and they don't contribute to AD. When AD increases though one may thing that that is a good thing as we are adding additional suportive reads.... however you need to notice that the `why we simply don't use all the reads in the active region in the first place (radius = +Infinity) this could often apply here... these reads are providing evidence thru linakge-desecilibrium (arguably a good thing) or perhaps a defficient assembly with missing key haplotypes (bad for sure). This makes is more likely that sum(AD) diverges from DP which some user may not be happy (although is not incorrect) but at least non 0,0,0 PL will now in hand with non 0 AD and DPs which some user may find more appropriate. Also fixes (silent) bugs in AlleleLikelihoods including some refactoring including splitting marginalize and evidence filtering.
Closed by #6055. |
Instructions
Initially reported by a user on the forum... Aparently some variants with non-zero quals have 0 AD and DPs. Other annotations are also missing from the INFO columns.
After some debugging it turns out that the criteria to determine whether a read should be considered for a variant in terms of alignment overlap are different for taking part of PL calculation and AD/DP calculation.
Where is not totally clear what is the best way to go in practice. It seems to me that we should be consistent here and both PL and AD/DP should use the same criterion. The offending code lines:
HaplotypeCallerGenotypingEngine.java ln171:
The code above decides the involvement in PL calculations. Notice that
ALLELE_EXTENSION
is set to2
.For the AD/DP and so on the code responsible is in AssemblyBasedCallerGenotypingEngine.java ln366:
The
filterToOnlyOverlappingReads(loc)
is called then the overlap criterion is strict. (e.g. 0bp padding). This is also the case for themarginalize
call if the conditional is false as the loc passed has not been padded.It seems to me that setting the
ALLELE_EXTENSION == 2
is a very deliberative action (so it was done for a reason) and perhaps this is the way to go... but in deed if the read really does not overlap the variant should be considered at all.This come from a more complex discussion whether the in cases whether variants are totally linked in the assembly graph we should consider reads supporting another variant alleles as supporting this other variant linked allele or not. I think that user found it a bit strange that this would be the case and perhaps this is the reason why we are doing this read filtering in the first place.
The text was updated successfully, but these errors were encountered: