-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for changes to Allele VariantContext, and Genotype Context for upstream deletion compatibility #806
Comments
@yfarjoun Thanks for tackling this! I'm in agreement with most of what you propose but would like to nit-pick a couple of points if I may! The latest released version of the spec I could find (4.3) has slightly different language than you quote: "The ‘*’ allele is reserved to indicate that the allele is missing due to a an overlapping deletion". I think that terminology is better because "upstream" in this terminology really just means "upstream in the file", because the deletion in question spans the current context. So I'd suggest calling the new allele type something else, suggestions include: By the same logic I would argue that your case of This raises further questions like if you have a SNP variant with four genotyped samples, but one of the samples has |
SPANNING_DELETION would be consistent with the language we've been using in GATK already I think. I agree with Tim that UPSTREAM is strange. |
I have one concern about this: if "the ‘*’ allele is reserved to indicate that the allele is missing due to a an overlapping deletion" it does not mean that this overlapping deletion was called or even present in the file. What I would like to point is that I'm in favor of naming it as an From my point of view, in In the case of the As conclusion, I support @lbergelson's idea of treat is just as symbolic and letting the dealing of it to the implementation of each tool/framework. There are too many things that could be confusing with the spanning deletion allele. |
@magicDGS's points reminded me of something I've been thinking about, but which might be beyond the scope of this PR. Specifically that with larger callsets and/or more complicated genomic regions the idea of a |
The problem with treating star as symbolic, is that I'm considering the case of callsets like ExAC where almost every variant will have a spanning deletion and so all the variant types will become SYMBOLIC...so that's not good either. Alternatively, we could add a boolean to the VC Thus we will have: Alelle VariantContext GenotypeContext |
I see your point with Regarding the Anyway, I have several non-related concerns about the design of genotypes in HTSJDK, so maybe my opinion is not that important in that case. I would rather prefer a |
I'm delighted to see this being discussed -- but I would point out that it will be crucial to make whatever nomenclature comes out of this agnostic of ploidy. The amount of data being produced for non diploid orgs is growing all the time (from bacterial genomes to agricultural staples; lots of cereals are polyploid, iirc) so we can't ignore it and should plan for it, otherwise we'll just have to revise the system again, which is disruptive. |
Apart of the cases that @vdauwera mentioned, it will be also important in pooled samples such cancer, virus or Pool-Seq datasets. Actually I was looking for discussing the diploid design of HTSJDK in a different issue, because this is just focused on the spanning deletion. |
The current VCF spec allows for a * allele (no brackets):
"The ‘*’ allele is reserved to indicate that the allele is missing due to a upstream deletion."
Currently, Allele, VariantContext, and GenotypeContext do not have any special consideration towards the star allele, and therefore consider the following to be true:
Alelle
"*" is a SNP and in particular not SYMBOLIC
VariantContext
Ref = A Alt = C,* is of type SNP
Ref = A Alt = AC,* is of type MIXED
GenotypeContext
A*/* (meaning A as reference and star allele) is type HET
A/* (meaning A is alternate) is also type HET
This means that for large cohorts, for which star allele is becoming a fact of life, all variants will be MIXED HET genotypes. This is uninformative, and unacceptable (IMHO).
In particular, this is causing downstream problems, as we can see in the following picard issue:
broadinstitute/picard#555
I propose the following modification:
Alelle
"*" is a new allele type: UPSTREAM_DELETION, not SYMBOLIC, not a SNP
VariantContext
Ref = A Alt = C,* is of type SNP
Ref = A Alt = AC,* is of type INDEL
Ref = A Alt = * is a new type UPSTREAM_DELETION
GenotypeContext
A*/* (meaning A as reference and star allele) is type HOM_REF (since the deletion was already "counted" upstream.
A/* (meaning A is alternate) remains of type HET due to the non-ref A allele.
I'm happy to implement this, but I wanted to get community comments....
The text was updated successfully, but these errors were encountered: