Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's the correct output for a SNP with a spanning MNP #5523

Open
tfenne opened this issue Dec 14, 2018 · 9 comments
Open

What's the correct output for a SNP with a spanning MNP #5523

tfenne opened this issue Dec 14, 2018 · 9 comments

Comments

@tfenne
Copy link
Contributor

tfenne commented Dec 14, 2018

Bug Report

We've run into a bit of a weird situation with HaplotypeCaller where what we appear to have is two hapolotypes within one individual where one haplotype carries a 3bp MNP (GGC->TGT) and the other haplotype carries only the second substitution (i.e. GGC -> GGT).

Here's an IGV snapshot of the region in question:

mnp_and_snp

What we have is:

  • depth around 85-88X
  • About 50% carries TGT
  • About 50% carries GGT
  • Two annoying reads (likely contaminants) that carry GGC (ref)

The calls that get produced are:

# Full calls  
chr6  42932200  .  GGC  TGT  1632.77  PASS  AC=1;AF=0.5;AN=2;BaseQRankSum=-1.628;ClippingRankSum=1.78;DP=86;ExcessHet=3.0103;FS=2.004;MLEAC=1;MLEAF=0.5;MQ=60.0;MQRankSum=0.0;QD=19.67;ReadPosRankSum=-2.36;SOR=0.454  GT:AD:DP:F1R2:F2R1:GQ:PL  0/1:39,44:83:10,14,0:29,30,0:99:1661,0,1458
chr6  42932202  .  C    T    3439.77  PASS  AC=2;AF=1.0;AN=2;BaseQRankSum=1.81;ClippingRankSum=1.38;DP=85;ExcessHet=3.0103;FS=0.0;MLEAC=1;MLEAF=0.5;MQ=60.0;MQRankSum=0.0;QD=28.5;ReadPosRankSum=-1.268;SOR=0.853      GT:AD:DP:F1R2:F2R1:GQ:PL  1/1:1,37:82:0,10,14,0:1,27,30,0:99:1807,141,0

# Easier to read w/o INFO
chr6  42932200  .  GGC  TGT  1632.77  PASS  GT:AD:DP:F1R2:F2R1:GQ:PL  0/1:39,44:83:10,14,0:29,30,0:99:1661,0,1458
chr6  42932202  .  C    T    3439.77  PASS  GT:AD:DP:F1R2:F2R1:GQ:PL  1/1:1,37:82:0,10,14,0:1,27,30,0:99:1807,141,0

The calls aren't exactly wrong, but I'm trying to figure out whether a 1/1 call at the second position is the right thing or not. Especially since it's only counting half the depth. I'm curious if this has been discussed with respect to MNP calling? I'm somewhat leaning towards thinking that the better genotype for the second locus would be */1 or 1/* to indicate that there's a non-ref allele and an allele that's called by an upstream spanning variant, but it's not quite the same as a spanning deletion and the spec says * is really only for spanning deletions.

Also, it looks to me like #5513 might be a similar problem, though I think it's probably easier to reason about with HC and a germline sample.

@tfenne
Copy link
Contributor Author

tfenne commented Dec 14, 2018

cc @nh13 who's also been thinking about this a bunch.

@bhandsaker
Copy link

bhandsaker commented Dec 14, 2018 via email

@cwhelan
Copy link
Member

cwhelan commented Dec 14, 2018

I'm actually a little surprised that it's not emitting a 1/* at the second location (despite that perhaps not being compatible with the spec; I agree with @bhandsaker 's view of what * should mean) given the changes I made in #4963 to support spanning deletions. Those changes were not made with MNPs in mind but reading the code I'm surprised the * allele is not injected into the variant context. What version of HaplotypeCaller are you using @tfenne?

@tfenne
Copy link
Contributor Author

tfenne commented Dec 14, 2018

Thanks for the input @bhandsaker and @cwhelan. The calls were made with version 4.0.10.0. I thought that was the latest - but it looks like we're a couple of minor versions behind. I'll retry making those calls with the latest version and see if it persists. Are the changes you're talking about in the latest release version or just on master?

@cwhelan
Copy link
Member

cwhelan commented Dec 14, 2018

The spanning deletion changes were in 4.0.9.0 so I must be mistaken about my expectation about how the code would work in this scenario. Once we figure out what the desired behavior is I can try to figure out how to make the changes if we do want to use the * allele.

@nh13
Copy link
Contributor

nh13 commented Dec 14, 2018

Why not just output one record? Based on the haplotype assembly BAM I would have expected just this call:

chr6  42932200  .  GGC  GGT,TGT ... GT:AD:DP 1/2:1,39,44:83

@ldgauthier
Copy link
Contributor

That representation would make the most sense, but the reason we don't do it right now is that the haplotypes are treated separately when we enumerate the variants, so the single SNP haplotype doesn't know about the MNP that contains the same SNP and the single SNP representation is carried through as if it's an independent event, which it isn't really. This goes back to David's dream in #1700 of actually calling haplotypes. I don't think it's a super quick fix, but it should be doable in the medium term.

@tfenne
Copy link
Contributor Author

tfenne commented Dec 24, 2018

@cwhelan & @ldgauthier sorry for the delay on getting back this issue. Firstly, I just grabbed the new release (4.0.12.0) and re-ran with that to generate both gVCF and VCF. The VCF output still generates the 1/1 genotype unfortunately. What's interesting though is that the gVCF is capturing the spanning allele! So it looks like GenotypeGVCFs is causing the problem.

Here are the rows from the gVCF and VCF respectively (with INFO elided for compactness):

# gVCF
chr6  42932200  .  GGC  TGT,<NON_REF>  1623.73  .  GT:AD:DP:GQ:PL:SB  0/1:39,44,0:83:99:1661,0,1458,1778,1591,3369:10,29,14,30
chr6  42932202  .  C    T,*,<NON_REF>  3439.77  .  GT:AD:DP:GQ:PL:SB  1/2:1,37,44,0:82:99:3468,1802,1661,1518,0,1449,3406,1802,1577,3436:0,1,24,57

# VCF
chr6  42932200  .  GGC  TGT  1632.77  .  GT:AD:DP:GQ:PL  0/1:39,44:83:99:1661,0,1458
chr6  42932202  .  C    T    3439.77  .  GT:AD:DP:GQ:PL  1/1:1,37:82:99:1807,141,0

For completeness I also ran HaplotypeCaller going direct to VCF without making a gVCF first. The results are fairly similar to the VCF above, except for some AD/DP differences:

# Direct to VCF
chr6  42932200  .  GGC  TGT  1623.73  .  GT:AD:DP:GQ:PL  0/1:39,44:83:99:1661,0,1458
chr6  42932202  .  C    T    3439.77  .  GT:AD:DP:GQ:PL  1/1:1,50:51:99:1807,141,0

Going back to what's the right representation - I think I largely agree with @nh13 and @ldgauthier that long term it would be nice, when running with MNP support, to integrate the two haplotypes into a single variant output. But that sounds like it might be a big project and not happening any time soon? In the meantime if there's an easier fix to have the */spanning allele called in the VCF propagated into the genotyped VCF that would really help.

@nh13
Copy link
Contributor

nh13 commented Mar 21, 2019

@cwhelan @ldgauthier what do you think of @tfenne's suggestion of propagating the */spanning allele into the genotyped VCF? Are the VCF lines above enough for you to reproduce?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants