-
Notifications
You must be signed in to change notification settings - Fork 729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
issues with complex haplotypes #272
Comments
Thank you Brent! We started an internal discussion about it. Will keep you posted. |
I ran make_examples with dad.sam.gz and recorded output of the realigner (upper part is original BAM and lower part show realigned reads). You are right Brent, realinged reads are mapped to support 1 del and 1 ins. But in the end the haplotype would be the same with 1 DEL and 1 INS or 3 SNPs, it is just a different representation. This behavior depends on penalty scores that we use for Smith–Waterman alignment. |
And child reads should have been aligned the same way as dad's. But we have a special algorithm that decides whether to run realignment. In the case of a child this algorithm decided against realignment. make_examples |
Thanks @akolesnikov ! @brentp if you're using
|
hi, thanks for the replies. so and I don't see much documentation on that option, is there any downside to that? (I assume it just affects run time?) |
Yes, if you use
deepvariant/scripts/run_deepvariant.py Lines 92 to 96 in 97cd861
Regarding the downside of turning off You can find the description of this feature when it's first released in v0.7: "Changed window selector to use a linear decision model for choosing realignment candidates. This can be controlled by a flag. |
@pichuan Is this option gonna increase accuracy and sensitivity in general? |
@kokyriakidis to your question, having ws_use_window_selector_model was to improve runtime, not to improve accuracy or sensitivity. But empirically we expect the trade-off on accuracy to be small. Below you can see my comparison between turning it on/off for WGS and WES on v0.9. (On PACBIO setting, this doesn't affect anything because we don't run realigner for PACBIO.) I ran some numbers based on the v0.9 WGS and WES case study data. I used this type of CPU machine for the runtime. The following results were done in two settings: On WGS case study:
On WES case study:
For the detailed hap.py output, you can see here. |
Hi @pichuan , Thank you very much for this wonderful analysis! Turning it off MAY be worth the increased runtime in cases where someone is looking for rare SNPs in a population or in a clinical environment where you prioritize accuracy over runtime. |
I re-ran DV+glnexus on a cohort of ~150 WGS trios. Using The right side is lower; good. But, then I further filtered that set of variants on your recently released thousand genomes DV VCF, requiring an allele frequency less than 1% there. Now, even though the RHS as dropped by ~20 DNs per trio, the "before" LHS, has a lower number of putative de novos and the end result is that it is worse to run with This must be because you ran the thousand genomes samples without turning off the window selector and I am annotating on exact REF and ALT allele matches, not using haplotype enumeration as does hap.py, but AFAIK, this is how all (allele frequency) annotation software will work. Any recommendations on how to proceed? I did bcftools norm on my set and on the deep variant thousand G calls and I suspect something like vcfallelelicprimitives will not help here either. |
Hi @brentp This is a very good question, but I am not sure I have a good answer for you. For you and @kokyriakidis the fact that window selector can cause inconsistency across a pedigree was not something we previously appreciated when considering the trade-offs between speed and accuracy. In the intermediate future, we hope to profile performance of make_examples to improve speed and at that time we will revisit the decision to enable this by default. This will not occur for the upcoming release, but possibly the following one. It is possible to create a BED file covering de novo sites and force window selector=false across all of them, this should be fast and it could be an exercise we would want to do on the 1KG pVCF. We are also working on some more involved methods for calling in a trio, but that is also in the intermediate timescale. |
Hi Andrew, thanks for the reply. I hadn't previously appreciated the extent of this problem before either--not just in DV, but everywhere. The "kid" and "dad" variants I posted above can be made to be the same variant-set if we know they are on the same haplotype--and running with the window selection off does this, but it still creates a problem with annotating across different call-sets when different representations are used. gnomad prefers the representation of an insertion and a deletion rather than the 3 individual SNPs. So, even if this is further resolved in DV (running with edit: feel free to close this issue as my original issue is addressed. |
I can open a separate issue if it's helpful, but just a couple more things related to this... Another thing that happens with things that DV calls de novos but obviously are not is that the kid will just meet some threshold and have a number of MQ ~40 reads with the de novo, where as the parent will have a number of reads with the allele that are MQ ~18 or lower. But the VCF reports AD[1] == 0 for many of these in the parent. If the count of low-quality alleles were reported in the sample fields in the VCF, it would be simpler to filter to make sure the allele was absent from the parent. |
…e to increase the accuracy and runtime of Illumina WGS and WES runs, but doesn't affect PacBio. This is a follow-up from #272. PiperOrigin-RevId: 298676661
Hi @brentp @kokyriakidis , We (genomics team in Google Health) have released DeepVariant 0.10 on 3/26, which includes improving consistency and accuracy by turning off
For more information, please read our release notes: https://github.com/google/deepvariant/releases/tag/v0.10.0 Thank you! |
hi, I am using DV 0.9 to find de novo variants in trios and so I am enriching for weird stuff.
I am sure your team is aware of some/all of these, but I'll document at least 1 case here for the record.
Nearly all obvious false positive de novo calls follow the same pattern. Mostly there is a haplotype from mom , and a haplotype from dad that are different. The kid inherits both the variable haplotypes, but the combination of DV and glnexus gentoype such that the variant appears as de novo.
But, here is an example where there is just an incorrect call from DV that leads to 3 neighboring spurious de novo calls in the kid (top row):
note that dad in the 3rd row has a single read with a 1-base deletion (dash) followed by an insertion (purple tick). I can't show all of the reads in this image, but I have scrolled through and verified that is the only read.
here is the content of the dad's VCF for that region (the mom's is actually very similar):
note that that is the single-base del and the insertion that occurs in only 1 read. Since the mom's is the same, maybe there's something akin to realignment going on, but
by contrast, here is the kid's (seemingly more sensible) VCF for that region:
here is the content of the gvcf for dad:
and kid:
I am attaching a small sam for kid and dad aligned to hg38
kid.sam.gz
dad.sam.gz
I have other scenarios, but this one is one that seems clearly a deep variant issue and not a problem with glnexus.
The text was updated successfully, but these errors were encountered: