Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about how GQ does work #326

Closed
ghost opened this issue Jul 16, 2020 · 6 comments
Closed

Questions about how GQ does work #326

ghost opened this issue Jul 16, 2020 · 6 comments

Comments

@ghost
Copy link

ghost commented Jul 16, 2020

Hello!
so some quick background, I am interested in SNP that are unique (= private) to each sample of my cohort. I ran DeepVariant on each sample, then GLnexus as per the best practices recommendations.

I extracted unique variants using bcftools --private option. Then, I wanted to do some filtering on GQ.

Here is the GQ distribution on one of the individual vcf file (so before joint-calling), I don't have much to say about it, it makes sense
GQ

However, here is the GQ distribution of the --private SNPs
GQ_private

First, there are some values that are NA due to SNPs that have . as GQ value. That's all right, errors in sequencing / mapping I guess. In fact, my reasoning is that the --private option will enrich the SNP set in all the errors that are unique to each sequencing data set. Therefore, I was expecting that the GQ distribution would be shifted to the left. However, what we see is that, not only is it shifted to the left but the shape of the distribution is also changed.
So I would like to know more about how GQ is exactly computed by DeepVariant. And why does the GQ seems to abruptly peak at 11-12

I also read on your blog that you consider "high quality variants" as the ones with a GQ of 20. Of course, owing to the distribution of GQ for the private set, setting a GQ threshold at 20 will make a big difference, as seen on this plot
GQ_range_subs_cropped_for_github

Thanks a lot for your insight!

@ghost
Copy link
Author

ghost commented Jul 16, 2020

To give you more perspective, here is the effect of filtering, with filters >= GQ (each dot represents one of my 10 samples)
allGQ_private_crop_github

@tedyun
Copy link
Collaborator

tedyun commented Jul 16, 2020

Hi @aderzelle, thank you for your question. I believe this is coming from the cohort merging step (by GLnexus) assuming that you used --config DeepVariantWGS or --config DeepVariant for merging, not the DeepVariant itself. The merging step by GLnexus includes filtering alleles and revising genotypes and we have conducted an extensive study [1] on finding the best merging parameters for DeepVariant outputs, which was later pushed to the open-source GLnexus.

The current version of the merging parameters uses min_AQ1 = min_AQ2 = 10 (AQ means "allele quality") as you can see here https://github.com/dnanexus-rnd/GLnexus/blob/4d057dcf24b68b33de7a9759ae65ca2b144a3d47/src/cli_utils.cc#L874 The definitions of these parameters can be found at https://github.com/dnanexus-rnd/GLnexus/wiki/Configuration .

I believe this is the reason why you are seeing the sharp drop at GQ = 10 as the alleles corresponding to them were already filtered. If you'd like to merge DeepVariant gVCFs without any filtering or genotype revision, you can download the .yml file here https://gist.github.com/tedyun/1d4f57ca67fb18647b7b251f9e0b35c2 and use --config DeepVariant_nomod.yml instead when running GLnexus.

I hope this helps and please let us know if you have any more questions/comments.

[1] https://doi.org/10.1101/2020.02.10.942086

Best,
Ted

@tedyun tedyun changed the title Questions about how QG does work Questions about how GQ does work Jul 17, 2020
@ghost
Copy link
Author

ghost commented Jul 18, 2020

Thanks for the very helpful answer.
So, actually what I am seeing is the distribution for "private" seems to be centred on 10 and has been amputated of its left part by GLnexus. Interesting ...

From GLnexus definition

Thus we may have a lower quality threshold for alleles observed in multiple individuals, compared to singletons.

That means we should be more stringent on quality filtering for singletons since they are not supported by observations in more than one individual, right?

@tedyun
Copy link
Collaborator

tedyun commented Jul 19, 2020

We can be (that would imply min_AQ1 > min_AQ2), but based on our investigation the current settings were chosen to give the best overall quality for variety of cohort sizes and sequencing depths.

Depending on how the cohort VCF is used, one may want to apply additional filters to the cohort, which can either be applied after merging using a standard VCF modification tool (e.g. bcftools), or by changing the merging settings directly in the .yml file.

@tedyun
Copy link
Collaborator

tedyun commented Jul 20, 2020

FYI I submitted a pull request to GLnexus repo for the "nomod" preset for merging (no filters or genotype revision). dnanexus-rnd/GLnexus#229 If this is accepted, you'll be able to try it out without downloading an external .yml file.

@aderzelle Please let me know if you have any questions/comments related to this issue. If not, please feel free to close it :)

@ghost
Copy link
Author

ghost commented Jul 25, 2020

FYI I submitted a pull request to GLnexus repo for the "nomod" preset for merging (no filters or genotype revision). dnanexus-rnd/GLnexus#229 If this is accepted, you'll be able to try it out without downloading an external .yml file.

@aderzelle Please let me know if you have any questions/comments related to this issue. If not, please feel free to close it :)

Thanks for your detail explanation! I think I have all I need ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants