Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dragen JgVCF merging #162

Open
MatthiasWielscher opened this issue Mar 28, 2019 · 12 comments · May be fixed by #230
Open

Dragen JgVCF merging #162

MatthiasWielscher opened this issue Mar 28, 2019 · 12 comments · May be fixed by #230

Comments

@MatthiasWielscher
Copy link

MatthiasWielscher commented Mar 28, 2019

Hi there,
We trying to merge joint Edico-Dragen called gVCFs (one VCF file per family) and noticed two issues:

Error 1: "Invalid: allele is not a DNA sequence" - that seems to happen if an asterix is part of the VCF file allele notation.
Error2: "invalid GT entry in gVCF record" happened for haploid genotypes.

Thank you!

@mlin
Copy link
Contributor

mlin commented Mar 29, 2019

Hi, thank you for reporting these issues.

For Error 1, I believe this was recently fixed on the GLnexus master branch. I have just tagged a new version 1.1.4 which includes this patch, hopefully the executable is available on the Releases page shortly. (building in Travis CI now)

For Error 2, we will need to look into this further. Are you specifically getting this from the sex chromosomes in male samples? Or is it able to produce haploid genotypes in other contexts as well?

@MatthiasWielscher
Copy link
Author

Hi Mike,
Good to know that Error 1 is fixed.

For error 2: Yes, we observed this issue with male X-chromosome genotypes. So far we did not get to see any output as the program fails before that.

@mlin
Copy link
Contributor

mlin commented Mar 29, 2019

OK -- appreciate hearing if v1.1.4 handles the star alleles for you. (Perhaps on some female samples?) If there are remaining problems with those it can probably be hotfixed quickly, assuming DRAGEN uses those in roughly the same way as GATK, for overlapping deletions. I want to do a little more work on those to get the output representation just right, but it should at least not fail.

Handling your male sex chromosomes will probably be a little project as we haven't worked with the haploid gVCF representation before. It can be interesting to get diploid gVCF from them and use the male X het calls (or female Y calls) as QC/contamination indicators.

Are you aware of any publicly available gVCF files reflective of the DRAGEN haploid output? Alternatively, would it be possible to generate a few from public data (e.g. GiaB trios, Polaris) and share? I could then scope what would be involved in accommodating them.

@mlin
Copy link
Contributor

mlin commented Apr 8, 2019

@MatthiasWielscher just a ping on this -- would be happy to chat further whether we can help each other here!

@VorontsovIE
Copy link
Contributor

VorontsovIE commented Jun 2, 2020

Hi Mike @mlin!
I was trying to combine gVCFs obtained by Dragen and stucked with the same problem of genotyping haploid regions on chrX. I was able to run genotyping using config with revise_genotypes: false and missing PL formatter as in Strelka2 config. But I'd like to use GLNexus with revise_genotypes: true so I am looking for a way to accomplish it.

I spent a pair of days reading the code and now more or less understand where the problem arises. Probably solving all the difficulties with non-diploid genotypes is not a simple task, but what would you tell about the following workaround:

We can revise all the sites for which all samples are diploid (or use only diploid samples for estimations) and just skips revisiting a site genotypes if some of samples have non-diploid genotype at these site. It will revise most of the genotypes and will fail gracefully for a few ones which cause problems.

A similar thing can be done to formatters. Right now there is no way to leave genotype-wise fields (like PL) in format string for combining Dragen result, even though for most of the sites PL can be inferred. I'd suggest not to raise an error in such case but to skip a field.
I think GLNexus will benefit from being more tolerant to violations of expected format.

Can you share your thoughts regarding my suggestion? Are there any problems which I've missed? Do you have any ideas, what's the best way to implement it (e.g. is it better to prefilter sites from revision or is it better to fallback on inconsistencies)?
Should I go ahead with developing a patch or do you prefer to implement it yourself (it such a soltion is possible at all)?

@mlin
Copy link
Contributor

mlin commented Jun 4, 2020

Hi @VorontsovIE I think I just found a public DRAGEN gVCF for a male to use as an example for this, which is the key thing I was missing before. I expect it shouldn't be too difficult to get some kind of sane output, even if it may be need further improvement over time. I'll try to get it to work although I can't commit to any particular timeline.

Here's an excerpt, this is what we're talking about right?

X       11366005        .       T       <NON_REF>       .       LowGQ   END=11366018    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  .:17,0:17:0:15:0,0:0,255:0,0
X       11366019        .       G       GA,<NON_REF>    95.15   PASS    DP=15;MQ=250.00;FractionInformativeReads=1.000  GT:AD:AF:DP:F1R2:F2R1:GQ:PL:SPL:ICNT:GP:PRI:SB:MB  1:0,15,0:1.000,0.000:15:0,7,0:0,8,0:95:124,0,601:255,0:0,15:9.5148e+01,0.0000e+00,4.5000e+02:0.00,29.00,34.77:0,0,7,8:0,0,9,6
X       11366020        .       G       <NON_REF>       .       LowGQ   END=11366020    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  .:0,15:15:0:15:0,0:255,0:14,0
X       11366021        .       A       <NON_REF>       .       PASS    END=11366062    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0:17,0:17:99:15:0,585:0,255:17,0
X       11366063        .       A       <NON_REF>       .       PASS    END=11367125    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0:18,0:18:99:8:0,310:0,255:27,0
X       11367126        .       A       <NON_REF>       .       PASS    END=11367200    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0:23,1:24:99:23:0,665:0,255:24,3
X       11367201        .       G       <NON_REF>       .       PASS    END=11369121    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0:24,0:24:99:11:0,490:0,255:34,1
X       11369122        .       G       <NON_REF>       .       PASS    END=11369215    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0:27,0:27:99:23:0,607:0,255:26,5
X       11369216        .       A       <NON_REF>       .       PASS    END=11372369    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0:22,0:22:99:13:0,469:0,255:36,0
X       11372370        .       C       <NON_REF>       .       PASS    END=11372414    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0:25,0:25:99:23:0,270:0,255:26,1
X       11372415        .       TTC     T,<NON_REF>     23.74   PASS    DP=22;MQ=232.76;MQRankSum=0.736;ReadPosRankSum=-0.736;FractionInformativeReads=0.273;R2_5P_bias=0.000  GT:AD:AF:DP:F1R2:F2R1:GQ:PL:SPL:ICNT:GP:PRI:SB:MB       1:1,5,0:0.833,0.000:6:0,2,0:1,3,0:24:26,0,40:0,255:2,3:2.3743e+01,1.8382e-02,7.2607e+01:0.00,2.00,34.77:0,1,4,1:1,0,3,2
X       11372416        .       T       <NON_REF>       .       PASS    END=11372446    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0:21,0:21:90:20:0,90:0,255:7,3
X       11372447        .       C       <NON_REF>       .       LowGQ   END=11372447    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  .:6,14:20:0:20:0,0:214,0:5,2
X       11372448        .       T       <NON_REF>       .       PASS    END=11372484    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0:16,0:16:99:13:0,315:0,255:12,2
X       11372485        .       G       <NON_REF>       .       PASS    END=11376576    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0:22,0:22:99:11:0,342:0,255:40,1
X       11376577        .       C       <NON_REF>       .       PASS    END=11376619    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0:15,1:16:99:12:0,99:0,99:16,1
X       11376620        .       A       <NON_REF>       .       PASS    END=11376620    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0:4,7:11:1:11:0,1:0,1:7,1
X       11376621        .       C       <NON_REF>       .       PASS    END=11376621    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0:13,4:17:99:17:0,259:0,255:11,1
X       11376622        .       A       <NON_REF>       .       LowGQ   END=11376622    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  .:4,6:10:0:10:0,0:13,0:7,1
X       11376623        .       G       <NON_REF>       .       PASS    END=11376660    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0:16,0:16:99:12:0,100:0,100:16,1
X       11376661        .       G       <NON_REF>       .       PASS    END=11390848    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0:24,1:25:99:10:0,385:0,255:40,1
X       11390849        .       G       <NON_REF>       .       PASS    END=11390959    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0:19,2:21:99:18:0,279:0,255:21,3
X       11390960        .       T       <NON_REF>       .       PASS    END=11392064    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0:23,1:24:99:9:0,332:0,255:31,0
X       11392065        .       A       <NON_REF>       .       PASS    END=11392137    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0:16,0:16:99:14:0,468:0,255:19,0
X       11392138        .       A       <NON_REF>       .       PASS    END=11398316    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0:24,1:25:99:15:0,496:0,255:37,1
X       11398317        .       A       <NON_REF>       .       PASS    END=11398353    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0:18,0:18:99:17:0,630:0,255:19,0
X       11398354        .       T       C,<NON_REF>     115.29  PASS    DP=22;MQ=249.30;FractionInformativeReads=1.000  GT:AD:AF:DP:F1R2:F2R1:GQ:PL:SPL:ICNT:GP:PRI:SB:MB  1:0,22,0:1.000,0.000:22:0,10,0:0,12,0:115:150,0,870:255,0:0,0:1.1529e+02,0.0000e+00,4.5000e+02:0.00,34.77,34.77:0,0,7,15:0,0,9,13
X       11398355        .       A       <NON_REF>       .       PASS    END=11398391    GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT  0:17,0:17:99:15:0,504:0,255:17,0

@VorontsovIE
Copy link
Contributor

@mlin, thank you for your answer!

Not sure this example was public :) But yes that looks similar to the problem I have.
Though, some haploid variants doesn't cause problems (probably due to speed optimizations which imply skipping of revise-step), so let me know if this dataset won't cause genotyping errors. I would be able to provide a tiny (properly anonymized) extract from my data which causes problems.

Genotyping step is necessary for my project so I probably will work on this issue anyway. If you can consult me about a direction you prefer this to be fixed, I can try to help you to make it happen in earlier time and with less efforts from your side. And with a more predictable result. :)

@mlin
Copy link
Contributor

mlin commented Jun 4, 2020

It looks like our previous (still half-done) work on Strelka2 left a convenient bcf1_t_plus::was_haploid flag available to the genotyper. Possibly you could just have revise_genotypes branch on that flag; when it's true for a sample, just copy in the pseudo-diploid genotype and set the corresponding PL entries to missing for now.

@mlin
Copy link
Contributor

mlin commented Jun 4, 2020

Maybe simpler: in that linked snippet where it rewrites the haploid genotype to a pseudo-diploid genotype, also extend the PL vector with one additional, missing entry as if it were diploid. Then everything downstream should work?

@VorontsovIE
Copy link
Contributor

VorontsovIE commented Jul 21, 2020

Hi, @mlin.
Just to let you know. I'm actively working on this issue. The simplest solution is to add
if (vr.was_haploid) return Status::OK();
at the beginning of revise_genotypes. It works when there are no Number=G[enotypes] fields in a unifier config.

Also I created a transformer to convert format Number=G haploid fields with genotype A into pseudo-diploid fields. Values at indices corresponding to genotype 0/A are filled, values for genotypes B/A just marked with missing values (but this violates bcf standard for storing vectors so I will replace it with correct values). Probably I should also make reverse transformation pseudo-diploid into haploid before updating format fields.
It takes much more changes but I hope this will make haploid calls totally compatible with any unifier configs.

@VorontsovIE VorontsovIE linked a pull request Jul 29, 2020 that will close this issue
@VorontsovIE
Copy link
Contributor

VorontsovIE commented Jul 29, 2020

Hi @mlin. I decided to stick to another approach - and to skip steps which can fail when genotyping haploid calls. Please, check whether you find my pull request #230 is valid.

@VorontsovIE
Copy link
Contributor

VorontsovIE commented Jul 29, 2020

I also tried to convert haploid genotypes into pseudo-diploid genotypes by transforming fields. That attempt failed because at some point I should transform diploid calls back into haploids and it is not trivial to tell where and how to do it. I suppose that it can be done somewhere in combine_format_data but in that case FormatHelpers will need an additional attribute to mark samples which should be transformed from pseudo-diploid back into haploid ones.

Current implementation uses missing values instead of some fixed value, it doesn't conform vcf4.2 standard, but I believe it's simple to fix if that approach worth finalizing it. May be you will find this draft of code useful, especially transformer part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants