-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide genotypes (GT) for somatic variant calls #16
Comments
Thanks for your feedback Brad, I agree this is a good opportunity to make our output easier to interpret. We're currently using (almost) the same somatic calling output format from the original Strelka release in 2012. At that time, the objections to filling in the
I think the practical benefit of simpler interpretation for most somatic calls and moving closer to a community standard should outweigh these reservations. Let me start the conversation here and then see if we can move to a broader discussion. |
Chris -- brilliant, thank you for considering this. I'm agreed on the downsides, VCF representation here is imperfect. However, I'm also agreed on the upside that is helps to have a consistent way of representing calls across different methods. We use separate germline calling for identifying pre-existing mutations from the normal, and strictly use the normal in the pair for somatic detection purposes so shouldn't have any issues with your second point. Thanks again for the discussion and thoughts. |
I had an initial discussion with Sangtae and others on this yesterday. We have some preliminary proposals but I also want to get more detail from you on which tools are asserting that a genotype is required (below). Our preliminary thoughts on this:
This leads us down a road of some proposals that could be very simple (Normal GT is always “./.” and Tumor GT is always “./1”), and others that may take some time to gain consensus on and/or implement. For the short term, another side of this question we should be examining is whether the downstream tools in question have a good reason to require a GT field for somatic alleles? If you could provide a short list of a few major tools that have this requirement I’d be interested in following up on this discussion. |
Chris; The main place where I caught missing genotypes being an issue was in rtg vcfeval (https://github.com/RealTimeGenomics/rtg-tools). We could work around this with For your simplified proposal, why do you prefer the half no calls instead of |
Thanks Brad for your comments. It might be best to motivate my above points with a concrete example. The image below shows a segment from a real tumor cell line (normal on top, tumor below). We have orthogonal data suggesting that the 2 base deletion in the tumor sample is likely to be a real somatic variant, and the normal is heterzygous for a 1 base deletion. As our STR models improve these types of examples are not difficult to find. In terms of VCF representation, as you point out, my aside on the simplest possible solution (using unknown/half genotypes) is not ideal. What we want in principal is an extension of the This suggests either just using the "reference allele", as understood to represent all alleles besides the somatic allele, OR working with custom tags outside of the regular Note in the former proposal, the putative somatic indel in the above IGV snapshot would be reported as REF=ATT, ALT=A, NORMAL/GT=0/0, TUMOR/GT=0/1, where (outside the letter of the VCF spec) we understand that allele "0" means the union of the reference and all deletion alleles besides the 2 base deletion here. I'd be curious to hear your feedback on such a representation? |
Chris;
but TUMOR/GT should really be a mix of 1/2 and 0/2 (and I guess that 3 and 4 bp deletion) so I see your point in making 0 represent anything non-normal. I think there is value in also capturing this in the INFO fields so we have the explicit calls even if GT has the simplified approach. Thanks again for the examples and great discussion on this. |
For edge cases where variants not present in REF/ALTs, we now provide these as hets. Temporary workaround until we have improved representation upstream in strelka (Illumina/strelka#16). Thanks to @ohofmann
I am using Strelka upstream of snpEff to determine the predicted severity of mutations in tumor. For this purpose I want to know what changed in going from the germline to the tumor: NOT what changed in going from ref to tumor. I am frustrated that every tool I try focuses on the latter. (snpEff itself does not handle cases where ref was A, germline was GG, tumor was AG correctly in non-coding regions; it refuses to admit that a mutation has happened....) So I'd like to put in a plug for not losing this information. |
Interesting example. I thought homopolymers were sequences that Illumina's sequencing technology struggled to correctly sequence. Are such results published in a journal yet? |
I also ran into this when trying to evaluate the vcfs with rtg vcfeval --threads 12 --ref-overlap --all-records --output-mode ga4gh --baseline results/variants/imgag-somatic-10perc.truth.cov-medium.vcf.gz --calls results/stratified-variants/sarek-strelka-vep-10perc/medium.vcf.gz --output results/vcfeval/sarek-strelka-vep-10perc/medium --template resources/reference/genome-sdf --squash-ploidy --sample truth,TUMOR which always lead to:
Is there any hint, how I could handle this (or even adapt the strelka output) in regards to benchmarking results of different callers / mappers? |
Chris and Sangtae;
This is a follow-up to a bcbio discussion (bcbio/bcbio-nextgen#2112 (comment)) which I thought could use it's own thread. In bcbio we're converting
NT
andSGT
into FORMAT genotypes (GT) for the tumor and normal. Many downstream tools like validation make use of these so having a consistent output from different callers is very useful.In mapping these over we've identified some edge cases where we can't cleanly do this due to 'SGT' not referencing alleles found in the VCF. Would it be possible to have strelka fill in all possible alternative alleles and place genotypes directly so we don't need to post-process?
I recognize there are some trickier edge cases with multiallelic positions and multiple low frequency reads. We've also been discussing these with MuTect2 output (broadinstitute/gatk#3564) and suggested normalizing to report multiple variants at these positions.
Thanks for considering this and starting the discussion.
The text was updated successfully, but these errors were encountered: