Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix to quality score calculation for sites with spanning deletions #6859

Merged
merged 7 commits into from
Oct 29, 2020

Conversation

cwhelan
Copy link
Member

@cwhelan cwhelan commented Oct 1, 2020

This fixes a bug in the AlleleFrequencyCalculator that was causing quality to be overestimated for sites with * alleles representing spanning deletions.

The bug was causing the calculator to not include homozygous * genotypes in the sum of non-site specific variant allele probabilities that is the basis for the qual score. The bug was caused by an off-by-one index error: IndexRange(0,2) returns [0,1], not [0,1,2] as intended. Not including this genotype inflated the quality score for these sites.

Due to interactions with QUAL-based variant and allele trimming, this causes slightly different behavior when HaplotyeCaller is run in modes where it is forced to emit variants for every locus, as can be seen in the expected/gvcf.basepairResolution.includeNonVariantSites.vcf test file for GenotypeGVCFsIntegrationTest: 1) Sites spanned by a deletion are now reported with a * alt allele and have QUAL 0 and a LowQual filter.

Also added a mechanism to GenotypeGCVFsIntegrationTest to automatically update the expected result files, similar to what already exists in HaplotypeCallerIntegrationTest and CombineGVCFsIntegrationTest.

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good.

droazen
droazen previously requested changes Oct 2, 2020
Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR changes the HaplotypeCaller integration tests, I suggest we wait to merge it until after #6634 is merged.

@cwhelan cwhelan force-pushed the cw_newqual_spandel_fix branch from fca9609 to 2c15d8e Compare October 23, 2020 13:20
@cwhelan
Copy link
Member Author

cwhelan commented Oct 23, 2020

@droazen Now that #6634 is merged, can we move forward with this PR? If you haven't had time to check them out already, do you want to take a look at the changes I made to allow GenotypeGCVFsIntegrationTest update its own test files?

@droazen
Copy link
Contributor

droazen commented Oct 26, 2020

@cwhelan Yes, I think we can move forward on this one now -- I've asked @jamesemery to have an initial look

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand why there are so many differences in some of these GenotypeGVCFs outputs. It looks like your code should only affect spanning deletions...

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
20 10087820 . C CAG 294.64 . AC=1;AF=0.500;AN=2;AS_BaseQRankSum=-2.100;AS_FS=2.811;AS_MQ=60.00;AS_MQRankSum=0.600;AS_QD=6.15;AS_ReadPosRankSum=-0.400;AS_SOR=1.147;BaseQRankSum=-2.273e+00;DP=93;ExcessHet=3.0103;FS=4.432;MLEAC=1;MLEAF=0.500;MQ=59.55;MQRankSum=0.678;QD=6.14;RAW_MQ=329810.00;ReadPosRankSum=-4.920e-01;SOR=1.302 GT:AD:GQ:PL 0/1:35,13:99:302,0,1079
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why the HaplotypeCaller outputs changed at SpanningDeletion sites. What about these files? This one looks like it has neither spanning deletions to genotype nor has its input file changed. Can you explain why there was so much jitter with the old outputs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I have no idea. The behavior of HaplotypeCaller must have changed slightly over time to create this jitter without a corresponding update to the integration test files. These changes of e.g. 0.03 or 0.04 to qual score didn't cause the test to fail, I think given the DEFAULT_FLOAT_TOLERANCE in BaseTest. Since I added automatic update of expected test results to the GenotypeGVCFsIntegrationTest, the jitter was passed along to the expected files in this PR. If there's a historical reason that we need to preserve these test files exactly as is, I can revert the automatic-update feature in the test and manually update only the files that have spanning deletion in them -- otherwise I think updating them to account for current jitter is the price of implementing an auto-update.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh that makes a lot of sense. In that case 👍

@cwhelan
Copy link
Member Author

cwhelan commented Oct 28, 2020

@jamesemery Is that a 👍 for the PR? In that case @droazen could you unblock this by changing your review status from "changes requested"?

@jamesemery jamesemery dismissed droazen’s stale review October 28, 2020 19:43

Since the Dragen-HaplotypeCaller PR is already in this is no longer blocked

@cwhelan cwhelan merged commit 61b992f into master Oct 29, 2020
@cwhelan cwhelan deleted the cw_newqual_spandel_fix branch October 29, 2020 13:31
mwalker174 pushed a commit that referenced this pull request Nov 3, 2020
…6859)

* fix off-by one error that was causing the qual calculation to miss probability assigned to homozygous star allele genotypes
* make sure we output all sites if forceOutput it true, even if low-qual 
* updates to handle the case where the only alleles are ref and star
* add update expected results option to GenotypeGVCFsIntegrationTest
* update integration test files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants