-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What's the correct output for a SNP with a spanning MNP #5523
Comments
cc @nh13 who's also been thinking about this a bunch. |
For what it's worth, my view when the * allele was proposed/added was that the spec should say "incompatible spanning allele", instead of "spanning deletion", but I didn't prevail.
…-Bob
On 12/14/18 10:32 AM, Tim Fennell wrote:
Bug Report
We've run into a bit of a weird situation with HaplotypeCaller where what we appear to have is two hapolotypes within one individual where one haplotype carries a 3bp MNP (GGC->TGT) and the other haplotype carries only the second substitution (i.e. GGC -> GGT).
Here's an IGV snapshot of the region in question:
mnp_and_snp <https://user-images.githubusercontent.com/1609210/50011462-29ffcd80-ff8a-11e8-8ae5-854016104999.png>
What we have is:
* depth around 85-88X
* About 50% carries TGT
* About 50% carries GGT
* Two annoying reads (likely contaminants) that carry GGC (ref)
The calls that get produced are:
|# Full calls chr6 42932200 . GGC TGT 1632.77 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=-1.628;ClippingRankSum=1.78;DP=86;ExcessHet=3.0103;FS=2.004;MLEAC=1;MLEAF=0.5;MQ=60.0;MQRankSum=0.0;QD=19.67;ReadPosRankSum=-2.36;SOR=0.454 GT:AD:DP:F1R2:F2R1:GQ:PL 0/1:39,44:83:10,14,0:29,30,0:99:1661,0,1458 chr6 42932202 . C T 3439.77 PASS AC=2;AF=1.0;AN=2;BaseQRankSum=1.81;ClippingRankSum=1.38;DP=85;ExcessHet=3.0103;FS=0.0;MLEAC=1;MLEAF=0.5;MQ=60.0;MQRankSum=0.0;QD=28.5;ReadPosRankSum=-1.268;SOR=0.853 GT:AD:DP:F1R2:F2R1:GQ:PL 1/1:1,37:82:0,10,14,0:1,27,30,0:99:1807,141,0 # Easier to read w/o INFO chr6 42932200 . GGC TGT 1632.77 PASS GT:AD:DP:F1R2:F2R1:GQ:PL 0/1:39,44:83:10,14,0:29,30,0:99:1661,0,1458 chr6 42932202 . C T
3439.77 PASS GT:AD:DP:F1R2:F2R1:GQ:PL 1/1:1,37:82:0,10,14,0:1,27,30,0:99:1807,141,0 |
The calls aren't exactly wrong, but I'm trying to figure out whether a |1/1| call at the second position is the right thing or not. Especially since it's only counting half the depth. I'm curious if this has been discussed with respect to MNP calling? I'm somewhat leaning towards thinking that the better genotype for the second locus would be |*/1| or |1/*| to indicate that there's a non-ref allele /and/ an allele that's called by an upstream spanning variant, but it's not quite the same as a spanning deletion and the spec says |*| is really only for spanning deletions.
Also, it looks to me like #5513 <#5513> might be a similar problem, though I think it's /probably/ easier to reason about with HC and a germline sample.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#5523>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADFApidiaXB2Oq-oDRoMuF9JD3nOyJM3ks5u48SJgaJpZM4ZTu5T>.
|
I'm actually a little surprised that it's not emitting a 1/* at the second location (despite that perhaps not being compatible with the spec; I agree with @bhandsaker 's view of what |
Thanks for the input @bhandsaker and @cwhelan. The calls were made with version |
The spanning deletion changes were in |
Why not just output one record? Based on the haplotype assembly BAM I would have expected just this call:
|
That representation would make the most sense, but the reason we don't do it right now is that the haplotypes are treated separately when we enumerate the variants, so the single SNP haplotype doesn't know about the MNP that contains the same SNP and the single SNP representation is carried through as if it's an independent event, which it isn't really. This goes back to David's dream in #1700 of actually calling haplotypes. I don't think it's a super quick fix, but it should be doable in the medium term. |
@cwhelan & @ldgauthier sorry for the delay on getting back this issue. Firstly, I just grabbed the new release (4.0.12.0) and re-ran with that to generate both gVCF and VCF. The VCF output still generates the 1/1 genotype unfortunately. What's interesting though is that the gVCF is capturing the spanning allele! So it looks like Here are the rows from the gVCF and VCF respectively (with INFO elided for compactness):
For completeness I also ran HaplotypeCaller going direct to VCF without making a gVCF first. The results are fairly similar to the VCF above, except for some AD/DP differences:
Going back to what's the right representation - I think I largely agree with @nh13 and @ldgauthier that long term it would be nice, when running with MNP support, to integrate the two haplotypes into a single variant output. But that sounds like it might be a big project and not happening any time soon? In the meantime if there's an easier fix to have the |
@cwhelan @ldgauthier what do you think of @tfenne's suggestion of propagating the |
Bug Report
We've run into a bit of a weird situation with HaplotypeCaller where what we appear to have is two hapolotypes within one individual where one haplotype carries a 3bp MNP (GGC->TGT) and the other haplotype carries only the second substitution (i.e. GGC -> GGT).
Here's an IGV snapshot of the region in question:
What we have is:
The calls that get produced are:
The calls aren't exactly wrong, but I'm trying to figure out whether a
1/1
call at the second position is the right thing or not. Especially since it's only counting half the depth. I'm curious if this has been discussed with respect to MNP calling? I'm somewhat leaning towards thinking that the better genotype for the second locus would be*/1
or1/*
to indicate that there's a non-ref allele and an allele that's called by an upstream spanning variant, but it's not quite the same as a spanning deletion and the spec says*
is really only for spanning deletions.Also, it looks to me like #5513 might be a similar problem, though I think it's probably easier to reason about with HC and a germline sample.
The text was updated successfully, but these errors were encountered: