Upstream deletions and CollectVariantCallingMetrics do not play nice right now. #555

yfarjoun · 2016-06-03T03:20:45Z

The current VCF spec allows for a * allele (no brackets):

"The ‘*’ allele is reserved to indicate that the allele is missing due to a upstream deletion."

CollectVariantCallingMetrics treats this as a third (size 1!!) allele so that in the case of

1   10347   .   TAAACCCTA   T   100 .   AC=2    GT  0/1 0/1
1   10350   .   A           C,* 100 .   AC=3    GT  1/2 0/2

both the 0/2 and 1/2 genotypes in the second line are counted towards TOTAL_MULTIALLELIC_SNPS (for the detailed metrics) Also, both of these genotype will not be counted towards the TOTAL_SNPS (as that only captures bi-alleleic SNPs). So upstream deletions are "hurting" both the monomorphic samples (as they get an inflated TOTAL_MULTIALLELIC_SNPS ) and the polymorphic samples (as they get a deflated TOTAL_SNPS count)

I propose changing this behavior so that an upstream deletion will count as the reference allele for the purpose of metrics.

I will also add a few column or two to capture the number of upstream deletions, perhaps counting the 0/2 separately from the 1/2 genotypes.

Does this sounds reasonable to folks?

@eitanbanks @tfenne ?

The text was updated successfully, but these errors were encountered:

yfarjoun · 2016-06-03T03:48:04Z

Oh, also, INDELs that have a upstream deletion covering suffer from this as well.

tfenne · 2016-06-03T09:47:30Z

@yfarjoun Good catch. I'm curious why such a genotype in VCF is represented as 0/2 instead of just 0, but that's beside the point.

I think I would treat these a little bit differently for:
0/2 / A/* - if the non-deleted allele is ref then I would just ignore the site, as the deletion is already captured in the metrics
0/1 / C/* - if the non-deleted allele is an alt ... maybe call this a hom-alt?

What do you think?

yfarjoun · 2016-06-03T10:35:11Z

I'm failing to see the discord...so I'll give a specific example (the one I've been using for testing):

#CHROM  POS  ID  REF        ALT   QUAL FT INFO  FMT  SM1  SM2 
1       146  .   AC         A     100  .  AC=1  GT   0/0  0/1 
1       180  .   T          C     100  .  AC=1  GT   0/1  0/0
1       247  .   TAAACCCTA  T     100  .  AC=1  GT   0/0  0/1
1       250  .   A          C,*   100  .  AC=1  GT   0/0  0/2  <- no effect on TOTAL_SNP
1       347  .   TAAACCCTA  T     100  .  AC=1  GT   0/1  0/0
1       350  .   A          C,*   100  .  AC=2  GT   1/2  0/0  <- TOTAL_SNP+=1 for summary and SM1
1       447  .   TAAACCCTA  T     100  .  AC=1  GT   0/0  0/1
1       450  .   A          AC,*  100  .  AC=1  GT   0/0  0/2  <- no effect on TOTAL_INDEL
1       547  .   TAAACCCTA  T     100  .  AC=1  GT   0/1  0/0
1       650  .   A          AC,*  100  .  AC=2  GT   1/2  0/0  <- TOTAL_INDEL+=1 for summary and SM1

This is what I meant by "an upstream deletion will count as the reference allele".

tfenne · 2016-06-03T10:39:06Z

The point I was trying to make is that if you count a spanning deletion as the reference allele, then at sites where a sample has one alt and a spanning deletion, you'll call that site as a het, when I think it would be more accurate to describe it as a hom var, since there is no reference allele present. Wouldn't change the TOTAL metrics, but would change the het/hom ratio.

yfarjoun · 2016-06-03T10:41:40Z

but for the purposes of HET-HOM ration we probably want this to count as a het...given that the probability if it happening is P and not P^2...(ignoring the upstream deletion)

tfenne · 2016-06-03T10:44:31Z

But from a functional point of view they behave like HOMs, since there is no reference copy present...

How common are these in large call-sets? Maybe the right answer is to simply exclude them from the het/hom ratio. With only one allele present due to the upstream deletion that seems like the actual correct answer to me.

yfarjoun · 2016-06-03T10:51:57Z

This came up in a recent 100 wgs sample (!) callset where out of 3.5M snps about 40K were "moved" from the TOTAL_SNPS column to the TOTAL_MULTIALLELIC_SNP column. so about 1% of TOTAL_SNPS but 300% of TOTAL_MULTIALLELIC_SNP...

tfenne · 2016-06-03T10:54:21Z

Yikes! So what do you think of just counting these for totals, but excluding from het/hom computation? I would say the only other option that seems sensible to me would be to introduce a new category of compound-hets and count those separately, but that seems like a lot of extra work.

yfarjoun · 2016-06-03T11:04:57Z

I agree regarding the functional aspect. But I'm not sure that this is what HET_HOM_RATIO is measuring....In my mind, HET_HOM_RATIO is a funny way of approximating <1/P> for a single sample (since it's <P>/<P^2> which is not <1/P> but at least the units are right! 😄 ). The functional aspect is difficult to evaluate since these SNP will be a mixture of neutral, LoF and GoF mutations and without knowing the mixture, not much can be said about the functional "burden".
I was planning to introduce two new categories: UPSTREAM_DEL_MONO (for the 0/2) and UPSTREAM_DEL_POLY (for the 0/1) so that we can see how prevalent this is in the data. But I could also add COMPOUND_HETS while I'm at it, (which will include the 1/2 but not the 0/2)

tfenne · 2016-06-03T11:16:47Z

Het:Hom ratio is a commonly used way of measuring that is quite useful. You're right that in a neutral case you might expect it to be p/p^2, but deviation from this can highlight interesting phenomenon (biologically higher rate of heterozygosity, lots of artifactual SNP calls, etc.).

I'm not sure how I feel about the UPSTREAM* metrics. To my way of thinking UPSTREAM_DEL_MONO would only exist because some other sample had an allele at a site, causing it to be listed in the VCF - i.e. membership in this set is almost entirely dictated by the other samples that are present in the same VCF. Given that I would find it hard to reason about or make predictions about, so I'm not sure how useful it is.

UPSTREAM_DEL_POLY at least would be consistent for a sample regardless of the other samples in the VCF. That said, I think this is still more of an artifact of how the data is represented in VCF. What we're really saying is that a sample has two alleles over a small region. You could just as easily represent your example as:

#CHROM  POS  ID  REF        ALT   QUAL FT INFO  FMT  SM1  SM2 
1       347  .   TAAACCCTA  T,TAACCCCTA     100  .  AC=1  GT   1/2  0/0

instead of

#CHROM  POS  ID  REF        ALT   QUAL FT INFO  FMT  SM1  SM2 
1       347  .   TAAACCCTA  T     100  .  AC=1  GT   0/1  0/0
1       350  .   A          C,*   100  .  AC=2  GT   1/2  0/0  <- TOTAL_SNP+=1 for summary and SM1

Would you agree?

yfarjoun · 2016-06-03T11:23:45Z

I agree about UPSTREAM_DEL_MONO, no need for this category.
regarding UPSTREAM_DEL_POLY, if we just add a COMPOUND_HET category, the upstream deletion will still get counted toward TOTAL_INDELS and so will be double counted depending on representation. We can either make the code only put a 1/2 variant into COMPOUND_HET even if the 2 comes from an upstream deletion, but that would be difficult to code, or we could account for the UPSTREAM_DEL_POLY and then you will know the amount of the COMPOUND_HETS that are double counted...

tfenne · 2016-06-03T11:37:03Z

Ugh, you and your valid points! Of course you're right, and it's even more complicated - what if there's a deletion large enough than it spans more than one even downstream? Without implementing some complicated look-ahead or look-behind there's no way to account for that correctly; heck I'm not even sure what the right answer is!

So I think we're back to either:

Just classifying alts in spanning deletions as either hets or homs
Counting them separately as compound hets, and accepting that things won't add up exactly

yfarjoun · 2016-06-03T14:00:12Z

I like your approach of the unified representation. Indeed, we should be able to recreate the metrics that would arise from that.

Analyzing a compact example we get:

#CHROM  POS  ID  REF        ALT          QUAL FT INFO  FMT  SM1  SM2 
1       347  .   TAAACCCTA  T,TAAcCaCTA  100  .  AC=1  GT   1/2  0/0  <- TOTAL_INDEL++, TOTAL_SNP+=0, COMPOUND_HETS++, numHets++, numHoms+=0

Since this is a complex variant, TOTAL_SNP is not increased here..I"m not sure if that's the right thing to do, but at least it is consistent!

Writing this in expanded form we get:

1       347  .   TAAACCCTA  T     100  .  AC=1  GT   0/1  0/0  <- TOTAL_INDEL++, numHets++
1       350  .   A          C,*   100  .  AC=2  GT   1/2  0/0  <- TOTAL_SNP+=0, UPSTREAM_DEL_POLY++, and COMPOUND_HETS++
1       352  .   C          A,*   100  .  AC=2  GT   1/2  0/0  <- TOTAL_SNP+=0, UPSTREAM_DEL_POLY++, and COMPOUND_HETS++

which gives us 2 more UPSTREAM_DEL_POLY and 1 more COMPOUND_HETS...

yfarjoun · 2016-06-03T14:09:46Z

My current opinion:

count the original deletion as a HET, but not the spanning deletions records
increment COMPOUND_HETS and UPSTREAM_DEL_POLY for each of the 1/2 genotypes in the spanning deletion variants
not count the 1/2 spanning deletions towards TOTAL_SNPS

tfenne · 2016-06-03T15:37:47Z

I think that's as good as we're going to get @yfarjoun.

yfarjoun · 2016-06-04T03:33:12Z

The new fields only seem relevant in the detail metrics...do you agree?

yfarjoun · 2016-06-04T03:36:31Z

What about filtering...should I keep TOTAL_UPSTREAM_DEL_POLY & FILTERED_UPSTREAM_DEL_POLY? (same for COMPOUND_HETS)?

yfarjoun · 2016-06-04T15:40:11Z

As I started implementing this I've another dilemma: What is the type of a Ref/* or Alt/* variant?

currently we have

        NO_VARIATION,
        SNP,
        MNP,    // a multi-nucleotide polymorphism
        INDEL,
        SYMBOLIC,
        MIXED,

is it symbolic? is it a new kind?

yfarjoun · 2016-06-04T15:41:30Z

note that the * allele itself is not symbolic...it has it's own kind: Allele.SPAN_DEL....

vdauwera · 2017-01-19T01:23:34Z

Is this still a thing or has the relevant work been done/closed?

yfarjoun · 2017-01-19T03:58:21Z

still a thing! on my "todo" list too!

…

On Wed, Jan 18, 2017 at 8:23 PM, Geraldine Van der Auwera < ***@***.***> wrote: Is this still a thing or has the relevant work been done/closed? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#555 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACnk0g_H44yzYQaMI0WOnT6KtDNzXTEFks5rTrsXgaJpZM4ItK4k> .

yfarjoun · 2017-02-14T16:25:10Z

This needs to be put on hold until I modify VariantContext in HtsJdk...

vdauwera mentioned this issue Jan 19, 2017

add percent GQ0 to CollectVariantCallingMetrics #286

Closed

vdauwera added the can of worms label Jan 19, 2017

yfarjoun mentioned this issue Feb 14, 2017

Proposal for changes to Allele VariantContext, and Genotype Context for upstream deletion compatibility samtools/htsjdk#806

Open

rickymagner mentioned this issue Oct 2, 2023

VariantContext has inconsistent behavior on getting allele type with spanning deletions samtools/htsjdk#1686

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upstream deletions and CollectVariantCallingMetrics do not play nice right now. #555

Upstream deletions and CollectVariantCallingMetrics do not play nice right now. #555

yfarjoun commented Jun 3, 2016 •

edited

Loading

yfarjoun commented Jun 3, 2016

tfenne commented Jun 3, 2016

yfarjoun commented Jun 3, 2016

tfenne commented Jun 3, 2016

yfarjoun commented Jun 3, 2016

tfenne commented Jun 3, 2016

yfarjoun commented Jun 3, 2016

tfenne commented Jun 3, 2016

yfarjoun commented Jun 3, 2016 •

edited

Loading

tfenne commented Jun 3, 2016

yfarjoun commented Jun 3, 2016

tfenne commented Jun 3, 2016

yfarjoun commented Jun 3, 2016 •

edited

Loading

yfarjoun commented Jun 3, 2016

tfenne commented Jun 3, 2016

yfarjoun commented Jun 4, 2016

yfarjoun commented Jun 4, 2016

yfarjoun commented Jun 4, 2016

yfarjoun commented Jun 4, 2016

vdauwera commented Jan 19, 2017

yfarjoun commented Jan 19, 2017 via email

yfarjoun commented Feb 14, 2017

Upstream deletions and CollectVariantCallingMetrics do not play nice right now. #555

Upstream deletions and CollectVariantCallingMetrics do not play nice right now. #555

Comments

yfarjoun commented Jun 3, 2016 • edited Loading

yfarjoun commented Jun 3, 2016

tfenne commented Jun 3, 2016

yfarjoun commented Jun 3, 2016

tfenne commented Jun 3, 2016

yfarjoun commented Jun 3, 2016

tfenne commented Jun 3, 2016

yfarjoun commented Jun 3, 2016

tfenne commented Jun 3, 2016

yfarjoun commented Jun 3, 2016 • edited Loading

tfenne commented Jun 3, 2016

yfarjoun commented Jun 3, 2016

tfenne commented Jun 3, 2016

yfarjoun commented Jun 3, 2016 • edited Loading

yfarjoun commented Jun 3, 2016

tfenne commented Jun 3, 2016

yfarjoun commented Jun 4, 2016

yfarjoun commented Jun 4, 2016

yfarjoun commented Jun 4, 2016

yfarjoun commented Jun 4, 2016

vdauwera commented Jan 19, 2017

yfarjoun commented Jan 19, 2017 via email

yfarjoun commented Feb 14, 2017

yfarjoun commented Jun 3, 2016 •

edited

Loading

yfarjoun commented Jun 3, 2016 •

edited

Loading

yfarjoun commented Jun 3, 2016 •

edited

Loading