-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix to quality score calculation for sites with spanning deletions #6859
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this PR changes the HaplotypeCaller
integration tests, I suggest we wait to merge it until after #6634 is merged.
d71d669
to
fca9609
Compare
…obability assigned to homozygous star allele genotypes
fca9609
to
2c15d8e
Compare
@cwhelan Yes, I think we can move forward on this one now -- I've asked @jamesemery to have an initial look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand why there are so many differences in some of these GenotypeGVCFs outputs. It looks like your code should only affect spanning deletions...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 | ||
20 10087820 . C CAG 294.64 . AC=1;AF=0.500;AN=2;AS_BaseQRankSum=-2.100;AS_FS=2.811;AS_MQ=60.00;AS_MQRankSum=0.600;AS_QD=6.15;AS_ReadPosRankSum=-0.400;AS_SOR=1.147;BaseQRankSum=-2.273e+00;DP=93;ExcessHet=3.0103;FS=4.432;MLEAC=1;MLEAF=0.500;MQ=59.55;MQRankSum=0.678;QD=6.14;RAW_MQ=329810.00;ReadPosRankSum=-4.920e-01;SOR=1.302 GT:AD:GQ:PL 0/1:35,13:99:302,0,1079 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why the HaplotypeCaller outputs changed at SpanningDeletion sites. What about these files? This one looks like it has neither spanning deletions to genotype nor has its input file changed. Can you explain why there was so much jitter with the old outputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I have no idea. The behavior of HaplotypeCaller must have changed slightly over time to create this jitter without a corresponding update to the integration test files. These changes of e.g. 0.03 or 0.04 to qual score didn't cause the test to fail, I think given the DEFAULT_FLOAT_TOLERANCE
in BaseTest
. Since I added automatic update of expected test results to the GenotypeGVCFsIntegrationTest
, the jitter was passed along to the expected files in this PR. If there's a historical reason that we need to preserve these test files exactly as is, I can revert the automatic-update feature in the test and manually update only the files that have spanning deletion in them -- otherwise I think updating them to account for current jitter is the price of implementing an auto-update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh that makes a lot of sense. In that case 👍
@jamesemery Is that a 👍 for the PR? In that case @droazen could you unblock this by changing your review status from "changes requested"? |
Since the Dragen-HaplotypeCaller PR is already in this is no longer blocked
…6859) * fix off-by one error that was causing the qual calculation to miss probability assigned to homozygous star allele genotypes * make sure we output all sites if forceOutput it true, even if low-qual * updates to handle the case where the only alleles are ref and star * add update expected results option to GenotypeGVCFsIntegrationTest * update integration test files
This fixes a bug in the
AlleleFrequencyCalculator
that was causing quality to be overestimated for sites with*
alleles representing spanning deletions.The bug was causing the calculator to not include homozygous
*
genotypes in the sum of non-site specific variant allele probabilities that is the basis for the qual score. The bug was caused by an off-by-one index error:IndexRange(0,2)
returns[0,1]
, not[0,1,2]
as intended. Not including this genotype inflated the quality score for these sites.Due to interactions with QUAL-based variant and allele trimming, this causes slightly different behavior when HaplotyeCaller is run in modes where it is forced to emit variants for every locus, as can be seen in the
expected/gvcf.basepairResolution.includeNonVariantSites.vcf
test file forGenotypeGVCFsIntegrationTest
: 1) Sites spanned by a deletion are now reported with a*
alt allele and have QUAL 0 and a LowQual filter.Also added a mechanism to
GenotypeGCVFsIntegrationTest
to automatically update the expected result files, similar to what already exists inHaplotypeCallerIntegrationTest
andCombineGVCFsIntegrationTest
.