-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dragen JgVCF merging #162
Comments
Hi, thank you for reporting these issues. For Error 1, I believe this was recently fixed on the GLnexus master branch. I have just tagged a new version 1.1.4 which includes this patch, hopefully the executable is available on the Releases page shortly. (building in Travis CI now) For Error 2, we will need to look into this further. Are you specifically getting this from the sex chromosomes in male samples? Or is it able to produce haploid genotypes in other contexts as well? |
Hi Mike, For error 2: Yes, we observed this issue with male X-chromosome genotypes. So far we did not get to see any output as the program fails before that. |
OK -- appreciate hearing if v1.1.4 handles the star alleles for you. (Perhaps on some female samples?) If there are remaining problems with those it can probably be hotfixed quickly, assuming DRAGEN uses those in roughly the same way as GATK, for overlapping deletions. I want to do a little more work on those to get the output representation just right, but it should at least not fail. Handling your male sex chromosomes will probably be a little project as we haven't worked with the haploid gVCF representation before. It can be interesting to get diploid gVCF from them and use the male X het calls (or female Y calls) as QC/contamination indicators. Are you aware of any publicly available gVCF files reflective of the DRAGEN haploid output? Alternatively, would it be possible to generate a few from public data (e.g. GiaB trios, Polaris) and share? I could then scope what would be involved in accommodating them. |
@MatthiasWielscher just a ping on this -- would be happy to chat further whether we can help each other here! |
Hi Mike @mlin! I spent a pair of days reading the code and now more or less understand where the problem arises. Probably solving all the difficulties with non-diploid genotypes is not a simple task, but what would you tell about the following workaround: We can revise all the sites for which all samples are diploid (or use only diploid samples for estimations) and just skips revisiting a site genotypes if some of samples have non-diploid genotype at these site. It will revise most of the genotypes and will fail gracefully for a few ones which cause problems. A similar thing can be done to formatters. Right now there is no way to leave genotype-wise fields (like PL) in format string for combining Dragen result, even though for most of the sites PL can be inferred. I'd suggest not to raise an error in such case but to skip a field. Can you share your thoughts regarding my suggestion? Are there any problems which I've missed? Do you have any ideas, what's the best way to implement it (e.g. is it better to prefilter sites from revision or is it better to fallback on inconsistencies)? |
Hi @VorontsovIE I think I just found a public DRAGEN gVCF for a male to use as an example for this, which is the key thing I was missing before. I expect it shouldn't be too difficult to get some kind of sane output, even if it may be need further improvement over time. I'll try to get it to work although I can't commit to any particular timeline. Here's an excerpt, this is what we're talking about right?
|
@mlin, thank you for your answer! Not sure this example was public :) But yes that looks similar to the problem I have. Genotyping step is necessary for my project so I probably will work on this issue anyway. If you can consult me about a direction you prefer this to be fixed, I can try to help you to make it happen in earlier time and with less efforts from your side. And with a more predictable result. :) |
It looks like our previous (still half-done) work on Strelka2 left a convenient |
Maybe simpler: in that linked snippet where it rewrites the haploid genotype to a pseudo-diploid genotype, also extend the PL vector with one additional, missing entry as if it were diploid. Then everything downstream should work? |
Hi, @mlin. Also I created a transformer to convert format Number=G haploid fields with genotype A into pseudo-diploid fields. Values at indices corresponding to genotype 0/A are filled, values for genotypes B/A just marked with missing values (but this violates bcf standard for storing vectors so I will replace it with correct values). Probably I should also make reverse transformation pseudo-diploid into haploid before updating format fields. |
I also tried to convert haploid genotypes into pseudo-diploid genotypes by transforming fields. That attempt failed because at some point I should transform diploid calls back into haploids and it is not trivial to tell where and how to do it. I suppose that it can be done somewhere in Current implementation uses missing values instead of some fixed value, it doesn't conform vcf4.2 standard, but I believe it's simple to fix if that approach worth finalizing it. May be you will find this draft of code useful, especially transformer part. |
Hi there,
We trying to merge joint Edico-Dragen called gVCFs (one VCF file per family) and noticed two issues:
Error 1: "Invalid: allele is not a DNA sequence" - that seems to happen if an asterix is part of the VCF file allele notation.
Error2: "invalid GT entry in gVCF record" happened for haploid genotypes.
Thank you!
The text was updated successfully, but these errors were encountered: