Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream deletions and CollectVariantCallingMetrics do not play nice right now. #555

Open
yfarjoun opened this issue Jun 3, 2016 · 22 comments

Comments

@yfarjoun
Copy link
Contributor

yfarjoun commented Jun 3, 2016

The current VCF spec allows for a * allele (no brackets):

"The ‘*’ allele is reserved to indicate that the allele is missing due to a upstream deletion."

CollectVariantCallingMetrics treats this as a third (size 1!!) allele so that in the case of

1   10347   .   TAAACCCTA   T   100 .   AC=2    GT  0/1 0/1
1   10350   .   A           C,* 100 .   AC=3    GT  1/2 0/2

both the 0/2 and 1/2 genotypes in the second line are counted towards TOTAL_MULTIALLELIC_SNPS (for the detailed metrics) Also, both of these genotype will not be counted towards the TOTAL_SNPS (as that only captures bi-alleleic SNPs). So upstream deletions are "hurting" both the monomorphic samples (as they get an inflated TOTAL_MULTIALLELIC_SNPS ) and the polymorphic samples (as they get a deflated TOTAL_SNPS count)

I propose changing this behavior so that an upstream deletion will count as the reference allele for the purpose of metrics.

I will also add a few column or two to capture the number of upstream deletions, perhaps counting the 0/2 separately from the 1/2 genotypes.

Does this sounds reasonable to folks?

@eitanbanks @tfenne ?

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Jun 3, 2016

Oh, also, INDELs that have a upstream deletion covering suffer from this as well.

@tfenne
Copy link
Collaborator

tfenne commented Jun 3, 2016

@yfarjoun Good catch. I'm curious why such a genotype in VCF is represented as 0/2 instead of just 0, but that's beside the point.

I think I would treat these a little bit differently for:
0/2 / A/* - if the non-deleted allele is ref then I would just ignore the site, as the deletion is already captured in the metrics
0/1 / C/* - if the non-deleted allele is an alt ... maybe call this a hom-alt?

What do you think?

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Jun 3, 2016

I'm failing to see the discord...so I'll give a specific example (the one I've been using for testing):

#CHROM  POS  ID  REF        ALT   QUAL FT INFO  FMT  SM1  SM2 
1       146  .   AC         A     100  .  AC=1  GT   0/0  0/1 
1       180  .   T          C     100  .  AC=1  GT   0/1  0/0
1       247  .   TAAACCCTA  T     100  .  AC=1  GT   0/0  0/1
1       250  .   A          C,*   100  .  AC=1  GT   0/0  0/2  <- no effect on TOTAL_SNP
1       347  .   TAAACCCTA  T     100  .  AC=1  GT   0/1  0/0
1       350  .   A          C,*   100  .  AC=2  GT   1/2  0/0  <- TOTAL_SNP+=1 for summary and SM1
1       447  .   TAAACCCTA  T     100  .  AC=1  GT   0/0  0/1
1       450  .   A          AC,*  100  .  AC=1  GT   0/0  0/2  <- no effect on TOTAL_INDEL
1       547  .   TAAACCCTA  T     100  .  AC=1  GT   0/1  0/0
1       650  .   A          AC,*  100  .  AC=2  GT   1/2  0/0  <- TOTAL_INDEL+=1 for summary and SM1

This is what I meant by "an upstream deletion will count as the reference allele".

@tfenne
Copy link
Collaborator

tfenne commented Jun 3, 2016

The point I was trying to make is that if you count a spanning deletion as the reference allele, then at sites where a sample has one alt and a spanning deletion, you'll call that site as a het, when I think it would be more accurate to describe it as a hom var, since there is no reference allele present. Wouldn't change the TOTAL metrics, but would change the het/hom ratio.

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Jun 3, 2016

but for the purposes of HET-HOM ration we probably want this to count as a het...given that the probability if it happening is P and not P^2...(ignoring the upstream deletion)

@tfenne
Copy link
Collaborator

tfenne commented Jun 3, 2016

But from a functional point of view they behave like HOMs, since there is no reference copy present...

How common are these in large call-sets? Maybe the right answer is to simply exclude them from the het/hom ratio. With only one allele present due to the upstream deletion that seems like the actual correct answer to me.

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Jun 3, 2016

This came up in a recent 100 wgs sample (!) callset where out of 3.5M snps about 40K were "moved" from the TOTAL_SNPS column to the TOTAL_MULTIALLELIC_SNP column. so about 1% of TOTAL_SNPS but 300% of TOTAL_MULTIALLELIC_SNP...

@tfenne
Copy link
Collaborator

tfenne commented Jun 3, 2016

Yikes! So what do you think of just counting these for totals, but excluding from het/hom computation? I would say the only other option that seems sensible to me would be to introduce a new category of compound-hets and count those separately, but that seems like a lot of extra work.

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Jun 3, 2016

  • I agree regarding the functional aspect. But I'm not sure that this is what HET_HOM_RATIO is measuring....In my mind, HET_HOM_RATIO is a funny way of approximating <1/P> for a single sample (since it's <P>/<P^2> which is not <1/P> but at least the units are right! 😄 ). The functional aspect is difficult to evaluate since these SNP will be a mixture of neutral, LoF and GoF mutations and without knowing the mixture, not much can be said about the functional "burden".
  • I was planning to introduce two new categories: UPSTREAM_DEL_MONO (for the 0/2) and UPSTREAM_DEL_POLY (for the 0/1) so that we can see how prevalent this is in the data. But I could also add COMPOUND_HETS while I'm at it, (which will include the 1/2 but not the 0/2)

@tfenne
Copy link
Collaborator

tfenne commented Jun 3, 2016

Het:Hom ratio is a commonly used way of measuring that is quite useful. You're right that in a neutral case you might expect it to be p/p^2, but deviation from this can highlight interesting phenomenon (biologically higher rate of heterozygosity, lots of artifactual SNP calls, etc.).

I'm not sure how I feel about the UPSTREAM* metrics. To my way of thinking UPSTREAM_DEL_MONO would only exist because some other sample had an allele at a site, causing it to be listed in the VCF - i.e. membership in this set is almost entirely dictated by the other samples that are present in the same VCF. Given that I would find it hard to reason about or make predictions about, so I'm not sure how useful it is.

UPSTREAM_DEL_POLY at least would be consistent for a sample regardless of the other samples in the VCF. That said, I think this is still more of an artifact of how the data is represented in VCF. What we're really saying is that a sample has two alleles over a small region. You could just as easily represent your example as:

#CHROM  POS  ID  REF        ALT   QUAL FT INFO  FMT  SM1  SM2 
1       347  .   TAAACCCTA  T,TAACCCCTA     100  .  AC=1  GT   1/2  0/0

instead of

#CHROM  POS  ID  REF        ALT   QUAL FT INFO  FMT  SM1  SM2 
1       347  .   TAAACCCTA  T     100  .  AC=1  GT   0/1  0/0
1       350  .   A          C,*   100  .  AC=2  GT   1/2  0/0  <- TOTAL_SNP+=1 for summary and SM1

Would you agree?

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Jun 3, 2016

  • I agree about UPSTREAM_DEL_MONO, no need for this category.
  • regarding UPSTREAM_DEL_POLY, if we just add a COMPOUND_HET category, the upstream deletion will still get counted toward TOTAL_INDELS and so will be double counted depending on representation. We can either make the code only put a 1/2 variant into COMPOUND_HET even if the 2 comes from an upstream deletion, but that would be difficult to code, or we could account for the UPSTREAM_DEL_POLY and then you will know the amount of the COMPOUND_HETS that are double counted...

@tfenne
Copy link
Collaborator

tfenne commented Jun 3, 2016

Ugh, you and your valid points! Of course you're right, and it's even more complicated - what if there's a deletion large enough than it spans more than one even downstream? Without implementing some complicated look-ahead or look-behind there's no way to account for that correctly; heck I'm not even sure what the right answer is!

So I think we're back to either:

  1. Just classifying alts in spanning deletions as either hets or homs
  2. Counting them separately as compound hets, and accepting that things won't add up exactly

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Jun 3, 2016

I like your approach of the unified representation. Indeed, we should be able to recreate the metrics that would arise from that.

Analyzing a compact example we get:

#CHROM  POS  ID  REF        ALT          QUAL FT INFO  FMT  SM1  SM2 
1       347  .   TAAACCCTA  T,TAAcCaCTA  100  .  AC=1  GT   1/2  0/0  <- TOTAL_INDEL++, TOTAL_SNP+=0, COMPOUND_HETS++, numHets++, numHoms+=0

Since this is a complex variant, TOTAL_SNP is not increased here..I"m not sure if that's the right thing to do, but at least it is consistent!

Writing this in expanded form we get:

1       347  .   TAAACCCTA  T     100  .  AC=1  GT   0/1  0/0  <- TOTAL_INDEL++, numHets++
1       350  .   A          C,*   100  .  AC=2  GT   1/2  0/0  <- TOTAL_SNP+=0, UPSTREAM_DEL_POLY++, and COMPOUND_HETS++
1       352  .   C          A,*   100  .  AC=2  GT   1/2  0/0  <- TOTAL_SNP+=0, UPSTREAM_DEL_POLY++, and COMPOUND_HETS++

which gives us 2 more UPSTREAM_DEL_POLY and 1 more COMPOUND_HETS...

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Jun 3, 2016

My current opinion:

  • count the original deletion as a HET, but not the spanning deletions records
  • increment COMPOUND_HETS and UPSTREAM_DEL_POLY for each of the 1/2 genotypes in the spanning deletion variants
  • not count the 1/2 spanning deletions towards TOTAL_SNPS

@tfenne
Copy link
Collaborator

tfenne commented Jun 3, 2016

I think that's as good as we're going to get @yfarjoun.

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Jun 4, 2016

The new fields only seem relevant in the detail metrics...do you agree?

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Jun 4, 2016

What about filtering...should I keep TOTAL_UPSTREAM_DEL_POLY & FILTERED_UPSTREAM_DEL_POLY? (same for COMPOUND_HETS)?

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Jun 4, 2016

As I started implementing this I've another dilemma: What is the type of a Ref/* or Alt/* variant?

currently we have

        NO_VARIATION,
        SNP,
        MNP,    // a multi-nucleotide polymorphism
        INDEL,
        SYMBOLIC,
        MIXED,

is it symbolic? is it a new kind?

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Jun 4, 2016

note that the * allele itself is not symbolic...it has it's own kind: Allele.SPAN_DEL....

@vdauwera
Copy link
Contributor

Is this still a thing or has the relevant work been done/closed?

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Jan 19, 2017 via email

@yfarjoun
Copy link
Contributor Author

This needs to be put on hold until I modify VariantContext in HtsJdk...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants