Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disabling BAQ in mpileup: better results? #1475

Closed
charbel-gem opened this issue Apr 22, 2021 · 9 comments
Closed

Disabling BAQ in mpileup: better results? #1475

charbel-gem opened this issue Apr 22, 2021 · 9 comments

Comments

@charbel-gem
Copy link

charbel-gem commented Apr 22, 2021

Hello.

I kindly request from you to refer to my post on biostars for my question.
Thank you.

@jkbonfield
Copy link
Contributor

jkbonfield commented Apr 23, 2021

Thanks for the report. I think the documentation in bcftools is misleading.

I added an explanation to your biostars post so others can see it, but also the whole idea of whether to-BAQ or not-to-BAQ has been on my radar for the last few months. See #1474 for a shake up of that idea, which adds partial-BAQ mode. See also samtools/htslib#1273, which is vital for calling on amplicon sequencing.

As an alternative, try bcftools mpileup -x -B -m3 -h500 to disable both BAQ and overlap removal (plus some saner default values). With neither of them you're likely to get the right answer assuming your depth is high (which it almost certainly is for Covid-19).

@charbel-gem
Copy link
Author

Thank you @jkbonfield for your speedy and thorough response, I really appreciate it!

@pd3 pd3 closed this as completed May 24, 2021
@IsabelFE
Copy link

IsabelFE commented Jun 8, 2021

@charbel-gem, would you mind sharing what settings did you end up using for mpileup and ivar consensus/variants? Is this for COVID amplicon data?

@jkbonfield,I am confused about using --reference fasta, but also -B. Aren't those supposed to cancel each other? Or I am understanding -B wrong? Would you recommend the use of bcftools mpileup vs. samtools mpileup? Can bcftools mpileup be piped into ivar?

I explained here andersen-lab/ivar#85 the issues I am having around indels and the use of BAQ.

Thanks

@gkarthik
Copy link

gkarthik commented Jun 8, 2021

Currently, ivar doesn't accept bcftools mpileup output but that has been on my radar for a bit now.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jun 9, 2021

-B and --reference don't "cancel each other out" per se. It's more that BAQ requires a reference, so not specifying one also implicitly has the same effect as -B. If you do specify a reference, then you get the choice - to use BAQ or not. I should also add that -B doesn't disable BAQ universally! It removes it for SNP calls, but it is still used internally with indels (even without a reference - it generates its own from various consensus sequences). In this second scenario BAQ is mainly used as a proxy for alignment score, to determine which candidate allele each record aligns best against.

If evaluating Bcftools here it would be useful to check both the current 1.12 release and the current develop branch, as they have diverged substantially.

The false ivar variants you show are mainly due to alignment error due to biasing towards the reference rather than other sample reads. There are two approaches - nullifying their quality so they don't count (BAQ) or a local reassembly in that region to get the correct alignments. Locally we have experimented with using FreeBayes for this and it does fix a number of ivar errors. The person doing this though didn't evaluate bcftools so we don't have any side-by-side comparisons sadly.

@IsabelFE
Copy link

IsabelFE commented Jun 9, 2021

Thanks @jkbonfield for the explanation about -B and --reference
What I still dont understand is what is going on here:
Screen Shot 2021-06-09 at 10 12 59 AM

If BAQ is always in use around indels, why do I see a difference here when using -f vs. non using it? In fact when I add -f it fails at detecting what I think is a real indel:

image

2021_0405_01C21_AllPiles_28270.txt This is pileup for the region with all the settings I tried.

@IsabelFE
Copy link

IsabelFE commented Jun 9, 2021

I also tried using bcftools mpileup, but I can't pipe it into ivars.
I tried this to use bcftools consensus like this:

bcftools mpileup -Ou -x -m 10 -d 0 -h 500 -f wuhan_trim3polyA.fasta sorted.bam | bcftools call -Ou -cv --ploidy 1 | bcftools norm -f wuhan_trim3polyA.fasta -Oz -o output.vcf.gz
tabix output.vcf.gz
bcftools consensus -f wuhan_trim3polyA.fasta output.vcf.gz > out.fa

But where the coverage is zero and I should get a stretch of Ns, it gets the reference sequence instead. How could I use bcftools properly? The output.vcf.gz file is only calling the variants, but I am losing the Ns.

@pd3
Copy link
Member

pd3 commented Jun 10, 2021

@IsabelFE the usage of consensus is unrelated to this, so probably would have been better to open a new issue. The command modifies a fasta sequence by applying variants present in the VCF. When there is no information, for example because of no coverage, the reference allele is assumed. If you have additional information about regions with zero coverage, you can use the --mask option.

@IsabelFE
Copy link

Thanks @pd3, I will open a new issue if I decide to explore more the bcftools consensus route

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants