-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upstream deletions and CollectVariantCallingMetrics do not play nice right now. #555
Comments
Oh, also, INDELs that have a upstream deletion covering suffer from this as well. |
@yfarjoun Good catch. I'm curious why such a genotype in VCF is represented as I think I would treat these a little bit differently for: What do you think? |
I'm failing to see the discord...so I'll give a specific example (the one I've been using for testing):
This is what I meant by "an upstream deletion will count as the reference allele". |
The point I was trying to make is that if you count a spanning deletion as the reference allele, then at sites where a sample has one alt and a spanning deletion, you'll call that site as a het, when I think it would be more accurate to describe it as a hom var, since there is no reference allele present. Wouldn't change the TOTAL metrics, but would change the het/hom ratio. |
but for the purposes of HET-HOM ration we probably want this to count as a het...given that the probability if it happening is P and not P^2...(ignoring the upstream deletion) |
But from a functional point of view they behave like HOMs, since there is no reference copy present... How common are these in large call-sets? Maybe the right answer is to simply exclude them from the het/hom ratio. With only one allele present due to the upstream deletion that seems like the actual correct answer to me. |
This came up in a recent 100 wgs sample (!) callset where out of 3.5M snps about 40K were "moved" from the TOTAL_SNPS column to the TOTAL_MULTIALLELIC_SNP column. so about 1% of TOTAL_SNPS but 300% of TOTAL_MULTIALLELIC_SNP... |
Yikes! So what do you think of just counting these for totals, but excluding from het/hom computation? I would say the only other option that seems sensible to me would be to introduce a new category of compound-hets and count those separately, but that seems like a lot of extra work. |
|
Het:Hom ratio is a commonly used way of measuring that is quite useful. You're right that in a neutral case you might expect it to be I'm not sure how I feel about the
instead of
Would you agree? |
|
Ugh, you and your valid points! Of course you're right, and it's even more complicated - what if there's a deletion large enough than it spans more than one even downstream? Without implementing some complicated look-ahead or look-behind there's no way to account for that correctly; heck I'm not even sure what the right answer is! So I think we're back to either:
|
I like your approach of the unified representation. Indeed, we should be able to recreate the metrics that would arise from that. Analyzing a compact example we get:
Since this is a complex variant, TOTAL_SNP is not increased here..I"m not sure if that's the right thing to do, but at least it is consistent! Writing this in expanded form we get:
which gives us 2 more UPSTREAM_DEL_POLY and 1 more COMPOUND_HETS... |
My current opinion:
|
I think that's as good as we're going to get @yfarjoun. |
The new fields only seem relevant in the detail metrics...do you agree? |
What about filtering...should I keep TOTAL_UPSTREAM_DEL_POLY & FILTERED_UPSTREAM_DEL_POLY? (same for COMPOUND_HETS)? |
As I started implementing this I've another dilemma: What is the type of a Ref/* or Alt/* variant? currently we have
is it symbolic? is it a new kind? |
note that the * allele itself is not symbolic...it has it's own kind: Allele.SPAN_DEL.... |
Is this still a thing or has the relevant work been done/closed? |
still a thing! on my "todo" list too!
…On Wed, Jan 18, 2017 at 8:23 PM, Geraldine Van der Auwera < ***@***.***> wrote:
Is this still a thing or has the relevant work been done/closed?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#555 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACnk0g_H44yzYQaMI0WOnT6KtDNzXTEFks5rTrsXgaJpZM4ItK4k>
.
|
This needs to be put on hold until I modify VariantContext in HtsJdk... |
The current VCF spec allows for a
*
allele (no brackets):CollectVariantCallingMetrics treats this as a third (size 1!!) allele so that in the case of
both the 0/2 and 1/2 genotypes in the second line are counted towards TOTAL_MULTIALLELIC_SNPS (for the detailed metrics) Also, both of these genotype will not be counted towards the TOTAL_SNPS (as that only captures bi-alleleic SNPs). So upstream deletions are "hurting" both the monomorphic samples (as they get an inflated TOTAL_MULTIALLELIC_SNPS ) and the polymorphic samples (as they get a deflated TOTAL_SNPS count)
I propose changing this behavior so that an upstream deletion will count as the reference allele for the purpose of metrics.
I will also add a few column or two to capture the number of upstream deletions, perhaps counting the 0/2 separately from the 1/2 genotypes.
Does this sounds reasonable to folks?
@eitanbanks @tfenne ?
The text was updated successfully, but these errors were encountered: