Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement in Variant calling #28

Closed
Rohit-Satyam opened this issue Oct 26, 2022 · 5 comments
Closed

Improvement in Variant calling #28

Rohit-Satyam opened this issue Oct 26, 2022 · 5 comments

Comments

@Rohit-Satyam
Copy link

Rohit-Satyam commented Oct 26, 2022

@priesgo
@ibn-salem
@ozlemmuslu

I am using COVIGATOR as a reference to build my own Nextflow pipeline and thanks for doing a wonderful job. Your code is clean and comprehensible. I have few queries and suggestions that I need your insight on

Suggestion

I stumbled upon this post, that recommends use of -x in bcftools mpileup flag to disable read-pair overlap detection, following an issue on bcftools and use of -h500 flag for not missing deletions of high coverage, see.

Also, maybe -m 10

EDIT 1: There is also some issue with IVAR related to spanning deletions that can be temporarily taken care of by using --ignore-overlaps. See issue1 and issue2

EDIT2: I also observe a weird trend of more variants being reported by ivar when we don't use --reference option in samtools pileup. Attaching the results obtained on same sample with the only difference of --reference Sars_cov_2.ASM985889v3.dna.toplevel.fa. Can you tell me what's happening?

ivar-results.xlsx

@Rohit-Satyam
Copy link
Author

Hi @priesgo. Do you have any thoughts about the points I mentioned above. Besides, I have one more doubt. It appears that the GATK HaplotypeCaller default behaviour is to decompose MNPs to SNPs (See Documentation where they use `--max-mnp-distance 0).

This can be achieved using bcftools norm --atomize option (see issue). However, in the normalization step of COVIGATOR, I see that this option is missing. I was wondering if we were to compare the VCF calls generated by covigator (intersect VCF calls from different tools), the presence of MNVs in some VCF and absence of MNVs in GATK might give false results

@priesgo
Copy link
Collaborator

priesgo commented Oct 31, 2022

Thanks for the kind words. Many points in a single issue, I'll try to cover them all and I will spin off particular tasks from this thread.

Starting from the end...

Normalization and MNVs

We actually "phase" the mutations from all variant callers and join them into MNVs if two conditions are met: 1) they are both clonal variants (VAF >= 80 %) and 2) overlap the same amino acid. We use some custom code for this under bin/phasing.py. I am sure there may be corner cases here though, please, let us know if you find any pathological case and we can revisit our approach.

iVar and reference

According to the samtools documentation when reference is provided then the Base Alignment Quality is performed.

BAQ (Base Alignment Quality)

BAQ is the Phred-scaled probability of a read base being misaligned. It greatly helps to reduce false SNPs caused by misalignments. BAQ is calculated using the probabilistic realignment method described in the paper “Improving SNP discovery by base alignment quality”, Heng Li, Bioinformatics, Volume 27, Issue 8 <https://doi.org/10.1093/bioinformatics/btr076>

BAQ is turned on when a reference file is supplied using the -f option. To disable it, use the -B option.

So, you may be filtering FPs when using the reference... will add this option here #29

iVar and spanning deletions

This looks reasonable although we have been focusing on LoFreq results so cannot comment. Will add though an option to the pipeline here #30

Improvements to BCFtools

Again we have not been looking so much into BCFtools results, but the best for the pipeline would to add optional BCFtools arguments for end-users to tweak variant calling. Will do this here #31

@Rohit-Satyam we accept contributions if you would want to extend the CoVigator pipeline to your needs instead of developing your own pipeline. That may help us both advance faster, but we are happy to help whatever your decision is.

@Rohit-Satyam
Copy link
Author

Rohit-Satyam commented Oct 31, 2022

Sorry, I have few more doubt

  1. How do you handle "Consecutive variants called by ivar belonging to the same codon ". I can see that Viralrecon in their release 2.3 inculcated this feature.
  2. Is Phasing necessary to be performed before using VCFs to produce assemblies using bcftools consensus. This is because I can see VCF2FASTA running before normalization and Phasing step.

It would be wonderful if that can happen. The only problem is I have just started learning nextflow. I will try to edit your code and if that doesn't work will try to share the additional modules (maybe via Email/ gists) that can be included in pipeline.

@priesgo
Copy link
Collaborator

priesgo commented Nov 1, 2022

How do you handle "Consecutive variants called by ivar belonging to the same codon ". I can see that Viralrecon in their nf-core/viralrecon#271 inculcated this feature.

We do the same. In order to get the annotation right you need to merge all variants that overlap the same codon. We do this for all variant callers, not only for iVar. This is the "phasing".

Is Phasing necessary to be performed before using VCFs to produce assemblies using bcftools consensus. This is because I can see VCF2FASTA running before normalization and Phasing step.

This is a good question! The phasing does not matter as bcftools consensus just builds the DNA sequence by applying all mutations, phasing is only important when predicting the amino acid change. But I wonder about normalization not sure if it will change things, but it definitely is more correct to apply it before, will handle this in #33

@priesgo
Copy link
Collaborator

priesgo commented Nov 1, 2022

I will try to edit your code and if that doesn't work will try to share the additional modules (maybe via Email/ gists) that can be included in pipeline.

Any contribution will be appreciated. You can also create a github branch and do your work there, I can help you to make it work if you face troubles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants