Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GenotypeGVCFs not outputting some variants it should with -stand-call-conf #5793

Open
2 tasks
tfenne opened this issue Mar 13, 2019 · 11 comments
Open
2 tasks
Assignees

Comments

@tfenne
Copy link
Contributor

tfenne commented Mar 13, 2019

Bug Report

Affected tool(s) or class(es)

GenotypeGVCFs 4.0.0.12

Affected version(s)

  • Latest public release version [version?]
  • Latest master branch as of [date of test?]

Description

I've run into a weird case where GenotypeGVCFs is doing something unexpected. I have a gVCF with the following entry in it:

chr11   6637739 .       ATTTTT  A,AT,ATT,ATTT,ATTTT,ATTTTTT,<NON_REF>   565.73  .       BaseQRankSum=-0.014;ClippingRankSum=0.508;DP=94;ExcessHet=3.0103;MLEAC=0,0,0,1,0,0,0;MLEAF=0,0,0,0.5,0,0,0;MQRankSum=0;RAW_MQandDP=338400,94;REF_BASES=GCCGGCCTGGATTTTTTTTTT;ReadPosRankSum=-0.812      GT:AD:DP:F1R2:F2R1:GQ:PL:SB     0/4:9,3,3,11,15,8,3,0:52:8,2,2,8,12,6,3,0:1,1,1,3,3,2,0,0:56:603,504,1526,335,1171,1118,56,661,640,608,0,362,313,183,335,336,500,389,187,171,527,655,864,622,277,169,466,1026,597,1101,953,645,465,625,861,1133:8,1,33,10

It's a messy site for sure, an indel in a long homopolymer-T, but I think that's a separate issue. If I run the following on that gVCF:

gatk GenotypeGVCFs \
  -R hg19.fa -V test.g.vcf -O test.vcf \
  -A ClippingRankSumTest -A Coverage -A ExcessHet -A FisherStrand \
  -A MappingQualityRankSumTest -A OxoGReadCounts -A QualByDepth -A ReadPosRankSumTest \
  -A ReferenceBases -A RMSMappingQuality -A StrandOddsRatio -A TandemRepeat \
  -L chr11:6637730-6637750 \
  -stand-call-conf 18.0 \

then I get the following output to the VCF just like I'd expect:

chr11   6637739 .       ATT     A       565.73  .       AC=1;AF=0.500;AN=2;BaseQRankSum=-1.400e-02;ClippingRankSum=0.508;DP=94;ExcessHet=3.0103;FS=1.779;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=23.57;REF_BASES=GCCGGCCTGGATTTTTTTTTT;RPA=15,13;RU=T;ReadPosRankSum=-8.120e-01;SOR=0.386;STR    GT:AD:DP:F1R2:F2R1:GQ:PL        0/1:9,15:52:8,2,2,8,12,6,3,0:1,1,1,3,3,2,0,0:99:603,0,335

QUAL is unchanged since I'm genotyping a single-sample gVCF. However, if I raise my -stand-call-conf threshold to 19.0, GenotypeGVCFs no longer outputs any variants. 565.73 >> 19.0, so I'm confused as to why that variant is no longer emitted.

Steps to reproduce

Run GenotypeGVCFs on the above example with -stand-call-conf values ranging up to ~550.

Expected behavior

The variant should be emitted into the VCF.

Actual behavior

The variant is not emitted if -stand-call-conf is set to 19 or higher.

@droazen
Copy link
Contributor

droazen commented Mar 13, 2019

@lbergelson and/or @ldgauthier, your thoughts? Louis thought there was something fishy in the handling of -stand-call-conf when he ported GenotypeGVCFs to GATK4 a few years ago...

@lbergelson
Copy link
Member

lbergelson commented Mar 13, 2019

This sounds really familiar... I feel like this was something I noticed a long time ago and it was explained away by someone.

-stand-call-conf is used to filter variants at several points in the code, and not always compared against the final GQ.

In particular I think there's this site here

final boolean isPlausible = afCalculationResult.isPolymorphicPhredScaledQual(allele, configuration.genotypeArgs.STANDARD_CONFIDENCE_FOR_CALLING);
which decides if a variant isPlausible. If I remember correctly, this can filter sites that would end up with higher than stand-call-conf qual if they continued through.

This was years ago though, so my memory is hazy. I remember thinking something seemed really wrong off about the parameter though.

@ldgauthier
Copy link
Contributor

ldgauthier commented Mar 14, 2019 via email

@nh13
Copy link
Contributor

nh13 commented Mar 21, 2019

@ldgauthier it looks like using --use-new-qual-calculator did the trick:

chr11	6637739	.	ATT	A	595.64	.	AC=1;AF=0.500;AN=2;BaseQRankSum=-1.400e-02;ClippingRankSum=0.508;DP=94;ExcessHet=3.0103;FS=1.779;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=24.82;REF_BASES=GCCGGCCTGGATTTTTTTTTT;RPA=15,13;RU=T;ReadPosRankSum=-8.120e-01;SOR=0.386;STR	GT:AD:DP:F1R2:F2R1:GQ:PL	0/1:9,15:52:8,2,2,8,12,6,3,0:1,1,1,3,3,2,0,0:99:603,0,335

It looks like this is fixed in #5484 (and so 4.1.0.0).

@tfenne are you ok closing this?

@tfenne
Copy link
Contributor Author

tfenne commented Mar 21, 2019

Closing as resolved with the new QUAL calculator.

@tfenne tfenne closed this as completed Mar 21, 2019
@tfenne tfenne reopened this Apr 2, 2019
@tfenne
Copy link
Contributor Author

tfenne commented Apr 2, 2019

I'm reopening this as I'm still having this same problem when using the latest release (4.1.1.0). The variant shown above is not emitted by GenotypeGVCFs when using -stand-call-conf 100 but is when using a lower threshold, e.g. -stand-call-conf 50. It would be really nice to be able to use -stand-call-conf and have that just filter out records that would result in records in the output VCF with QUAL < threshold.

@davidbenjamin
Copy link
Contributor

It's kind of tricky because suppose eg that we have three alt alleles each with an allele qual of 19, so that the overall variant qual is roughly 3x19 = 57. If we filter alleles with a confidence of 20, we get no alleles and the variant qual changes to 0.

Now, if instead of filtering by allele we only filter by overall variant qual we then have to keep an arbitrary number of sketchy alleles. I mean, what if we have 30 alleles each with a qual of 1? The current behavior seems preferable to me because the usual question users would ask downstream is whether some allele is real, not whether some site exhibits variation. As long as we define -stand-call-conf to pertain to alleles everything is consistent.

@tfenne
Copy link
Contributor Author

tfenne commented Apr 4, 2019

@davidbenjamin I'm not sure I follow your logic. But if you believe the current implementing is doing the expected thing I'd like to understand.

In the example above, the site is multi-allelic in the gVCF. However, when run through GenotypeGVCFs it's reduced to being bi-allelic, and the QUAL of the bi-allelic site in the genotyped GVCF doesn't change - it's still 595.64.

To rephrase the issue - I find that if I run GenotypeGVCFs -stand-call-conf 0.0 on that variant, it emits a variant with QUAL 595.64. But if I run GenotypeGVCFs -stand-call-conf 100 that variant doesn't get emitted. I think that's wrong, or at the very least misleading.

@ldgauthier
Copy link
Contributor

@tfenne would it make more sense to separate out the confidences into per-allele and per-site thresholds (where a multi-allelic site's QUAL accumulates evidence from all alleles)?

The bi-allelic QUAL seems wrong if it's the same as when all the alleles are present. The genotype you give above is for the only sample in the VCF?

@tfenne
Copy link
Contributor Author

tfenne commented Apr 9, 2019

@ldgauthier I think it's probably best to try and get to the bottom of why this variant's qual isn't being adjusted as it's reduced from 7 alleles in the gVCF to 2 alleles in the called VCF. This is, as you guessed, all single-sample. I've attached a reduced gVCF that just includes the variant in question, and the resulting genotyped VCF from running the command line below (and their indices)
here.

gatk GenotypeGVCFs \
  -R /Work/refseq/hg19/hg19.fasta \
  -V HG02568.g.vcf \
  -O HG02568.vcf \
  -L chr11:6637700-6637800 \
  -stand-call-conf 30

A few observations from running the above command but varying the -stand-call-conf at that locus, all performed with GATK 4.1.1.0:

  • Running with -stand-call-conf 30 results in a reduction from 7->2 alleles but no change at all in QUAL
  • Running with -stand-call-conf 0 results in the same genotype and QUAL, but another allele squeaks through even though it's not referenced in the genotype
  • Running with -stand-call-conf 100 results in no variants being emitted

Circling back to one of my original statements, I believe the least confusing way for this to work would be to think of it this way:

  • If you run with -stand-call-conf 0 you should see all variants
  • If you run with -stand-call-conf n you should only lose variants that were previously emitted with -stand-call-conf 0 that had QUAL < n

That said, it sounds like maybe the problem is less with the filtering on QUAL and more to do with the calculation of the final QUAL that ends up in the VCF?

@ldgauthier
Copy link
Contributor

So we definitely don't update the QUAL if we drop alternate alleles:

final double probOfAtLeastOneAltAllele = Math.pow(10, AFresult.getLog10PosteriorOfAFGT0());

Note that the QUAL is based off of the AFResult that had alleles removed if they exceeded the output limit, but not if they had less evidence than the calling confidence threshold.

@davidbenjamin I really hate to run the AF calculator again if we drop low quality alleles. Or maybe the new qual isn't as bad as I think? Would it be a decent approximation to add up the per-allele quals for the remaining alleles?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants