-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bcftools call ignores deletion with high coverage #1459
Comments
There is a maximum depth parameter. Places that are too deep get skipped. This can be increased with the |
From what I have seen (and also tested) bcftools call does not have the option We used that option though when executing bcftools mpileup: |
Sorry, I see what you mean. It's hard to diagnose this without seeing the BAM file too. The record in question has Please try |
Thank you for your idea and suggestion. Using the BAM file and the reference, I executed the following command with the Unfortunately it didn't change anything. |
I cannot reproduce this. It works absolutely fine for me as a basic pipeline:
That's with 1.11, the same as listed in your bcf file. The mpileup bcf output (pre
Note the PL here indicates a REF/ALT call rather than REF only. |
Thanks for giving this a try! The results do look quite puzzling. To maximize reproducibility, we ran the following script for both bcftools #!/bin/bash
# download data
rm -f important_positions.bam* NC_045512.2.fasta*
wget https://github.com/samtools/bcftools/files/6271997/important_positions.bam.gz
gunzip important_positions.bam.gz
wget https://github.com/samtools/bcftools/files/6271998/NC_045512.2.fasta.txt
mv NC_045512.2.fasta.txt NC_045512.2.fasta
# check version
bcftools --version
# call variants
bcftools mpileup \
--max-depth 1000000 \
--max-idepth 100000 \
--annotate INFO/AD \
-f NC_045512.2.fasta \
important_positions.bam \
| bcftools call \
-mv \
--keep-alts \
- \
| grep -v '^#'
For bcftools bcftools 1.11
Using htslib 1.11
Copyright (C) 2020 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 1000000
NC_045512.2 4253 . C T 222 . DP=1531;AD=360,406;VDB=0;SGB=-0.693147;RPB=0.984628;MQB=1;MQSB=1;BQB=0.911924;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=359,1,406,0;MQ=60 GT:PL 0/1:255,0,255 For bcftools bcftools 1.12
Using htslib 1.12
Copyright (C) 2021 Genome Research Ltd.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 1000000
NC_045512.2 4253 . C T 222.391 . DP=1531;AD=360,406;VDB=0;SGB=-0.693147;RPB=0.984628;MQB=1;MQSB=1;BQB=0.911924;MQ0F=0;AC=1;AN=2;DP4=359,1,406,0;MQ=60 GT:PL 0/1:255,0,255 The deletion does seem to be missing in both cases. The PL for the substitution at position 4253 matches. Do you have any suggestion what else we could do to debug this? |
It still seems to be giving different answers for me. I note you don't have a Also try disabling BAQ (-B), but this doesn't explain why we see different behaviours. I've tried various bcftools versions, and 1.7 onwards work for me with 1.6 missing this variant. I cannot explain why it behaves differently for you, unless it's some bizarre platform specific behaviour. @pd3 know of anything like this? |
I made sure to delete all auxiliary files beforehand, so this is unfortunately not it.
I added that option, so the bcftools mpileup \
--max-depth 1000000 \
--max-idepth 100000 \
--annotate INFO/AD \
--no-BAQ \
-f NC_045512.2.fasta \
important_positions.bam and the deletion appears 🎉
QUAL and the PL still seem to be different than yours though. |
@kpj and I executed the commands now on our real alignment (BAM file). Before we provided you with a restriction of this alignment to the reads overlapping position 4179. We used bcftools 1.11 and 1.12 and executed the following commands:
The deletion appears after the execution of Do you have any suggestion what the reason for this behaviour could be and what we could do to debug this? |
I obtained the following (correct) result
using this script, which includes also bcftools download and compilation I never encountered such a non-deterministic behavior before, it was always down to the user running with different options or different version. Can you please send the full output of the script, including the VCF header? It might be also useful to run the mpileup step in valgrind and see if anything comes up. |
This is getting really interesting now! On macOS (MacBookPro16,2 Darwin) we obtain only one substitution (full log, VCF header):
On Linux (x86_64 GNU/Linux) we obtain both the substitution as well as the deletion (full log,
What should we be on the lookout for when running mpileup through valgrind? |
Uninitialised memory accesses can be one cause for non-deterministic behaviour. I'll have an explore myself, also using gcc's address-sanitizer. |
I can't get errors from either Valgrind nor Petr, can you confirm that random sampling could show this effect? If so I think we should be using our own implementation for all OSes and not just Windows, as we don't want system specific behaviour patterns like this. |
I can confirm my suspicions. From the bcftools-1.12 tarball, I ran configure and then manually edited the htslib-1.12/config.h file to comment out the
So at least that explains the non-deterministic behaviour, caused by random sub-sampling when the depth goes above 255. What's curious is why it's still failing to identify the deletion. I can confirm at least that my work-in-progress PR overhauling calling (including indel detection) does fix this, at least in my current release (which needs pushing back up again, so it may differ on github atm).
With |
The other thing you may wish to try is In theory this helps remove over-confidence elsewhere, but I think at the moment the overlap removal has issues as it can create strand inbalance as it's always removing the same member of a pair. Ideally it should be done at random (deterministically! either via our own implementation of a simple hash of the read name). I don't know if a more random removal would fix this or not though. [Note to self - rerun my variant calling accuracy assessments with -x. Does it actually help overall?] |
I'm not sure if this plays into it or not, but this particularly variant has the G present somewhere centralling in all the forward strand reads and right at the start (adjacent to a poly-A) in the reverse strand reads. It's likely BAQ will be removing many if not all of those. Similarly the deletion is present centrally in the forward strand reads and at the start of the reverse strand reads. This shows a heavy strand bias, indicating that overlap removal must be strand agnostic if it's not going to generate extreme biases, affecting various bcftools INFO filtering rules and (apparently) call confidence. Witness DP4 with and without -x (a different bcftools branch, but you get the picture):
This is definitely a failing of how mpileup works. |
The However So I reckon the way to fix this is to have
|
Currently it always chooses the second sequence (except for the circumstance of differing base calls). This is essentially random strand and random coordinate in most library strategies, but some targetted sequencing methods have a very strong strand bias (first is + strand, second is - strand) or positional bias (eg PCR amplicons). Given SNPs near the end of sequences can give rise to poor BAQ scores, both position and strand bias are detrimental. This change makes it select either read 'a' or 'b' based on a hash of the read name. Unlike using a traditional random number generator, this gives it consistent behaviour regardless of how many sequences have gone before. An example from SynDip region 1:185M-200M: No overlap removal: SNP Q>0 / Filtered SNP TP 18830 / 18803 SNP FP 264 / 238 SNP GT 56 / 53 SNP FN 459 / 486 InDel TP 2788 / 2697 InDel FP 1022 / 86 InDel GT 353 / 345 InDel FN 596 / 687 Old removal strategy: SNP Q>0 / Filtered SNP TP 18841 / 18813 SNP FP 270 / 243 SNP GT 56 / 54 SNP FN 448 / 476 InDel TP 2754 / 2663 InDel FP 985 / 83 InDel GT 413 / 404 InDel FN 630 / 721 This PR: SNP Q>0 / Filtered SNP TP 18841 / 18814 SNP FP 272 / 242 SNP GT 55 / 53 SNP FN 448 / 475 InDel TP 2765 / 2679 InDel FP 996 / 85 InDel GT 382 / 375 InDel FN 619 / 705 The CPU cost on bcftools mpileup | bcftools call between the latter two tests was 0.4% (which may also just be random fluctuation). Vs the old removal system, this is a marginal improvement for SNPs and, oddly, a significant improvement to Indels. (It's still behind no overlap removal for indels, but I'm unsure on the veracity of small indels in that truth set). Fixes samtools/bcftools#1459
Random numbers are used by ks_shuffle, which is called in htslib's errmod code for sub-sampling the data in regions of > 255 depth. This corrects the OS specific randomness seen in samtools#1459.
Random numbers are used by ks_shuffle, which is called in htslib's errmod code for sub-sampling the data in regions of > 255 depth. This corrects the OS specific randomness seen in samtools#1459.
Random numbers are used by ks_shuffle, which is called in htslib's errmod code for sub-sampling the data in regions of > 255 depth. This corrects the OS specific randomness seen in samtools#1459.
Random numbers are used by ks_shuffle, which is called in htslib's errmod code for sub-sampling the data in regions of > 255 depth. This corrects the OS specific randomness seen in samtools#1459.
Currently it always chooses the second sequence (except for the circumstance of differing base calls). This is essentially random strand and random coordinate in most library strategies, but some targetted sequencing methods have a very strong strand bias (first is + strand, second is - strand) or positional bias (eg PCR amplicons). Given SNPs near the end of sequences can give rise to poor BAQ scores, both position and strand bias are detrimental. This change makes it select either read 'a' or 'b' based on a hash of the read name. Unlike using a traditional random number generator, this gives it consistent behaviour regardless of how many sequences have gone before. An example from SynDip region 1:185M-200M: No overlap removal: SNP Q>0 / Filtered SNP TP 18830 / 18803 SNP FP 264 / 238 SNP GT 56 / 53 SNP FN 459 / 486 InDel TP 2788 / 2697 InDel FP 1022 / 86 InDel GT 353 / 345 InDel FN 596 / 687 Old removal strategy: SNP Q>0 / Filtered SNP TP 18841 / 18813 SNP FP 270 / 243 SNP GT 56 / 54 SNP FN 448 / 476 InDel TP 2754 / 2663 InDel FP 985 / 83 InDel GT 413 / 404 InDel FN 630 / 721 This PR: SNP Q>0 / Filtered SNP TP 18841 / 18814 SNP FP 272 / 242 SNP GT 55 / 53 SNP FN 448 / 475 InDel TP 2765 / 2679 InDel FP 996 / 85 InDel GT 382 / 375 InDel FN 619 / 705 The CPU cost on bcftools mpileup | bcftools call between the latter two tests was 0.4% (which may also just be random fluctuation). Vs the old removal system, this is a marginal improvement for SNPs and, oddly, a significant improvement to Indels. (It's still behind no overlap removal for indels, but I'm unsure on the veracity of small indels in that truth set). Fixes samtools/bcftools#1459
Random numbers are used by ks_shuffle, which is called in htslib's errmod code for sub-sampling the data in regions of > 255 depth. This corrects the OS specific randomness seen in samtools#1459.
Hi @pd3 @jkbonfield,
Result for deletion:
I've also tried adding Screenshot of the region: Since the region has some soft clipping of amplicon primers, I also tried the Any idea why this variant might be being missed? Should I be adjusting the command? I have a closely related sample that has the 17 base deletion at a higher frequency (IMF=0.96) and it's correctly called by |
Any chance you could create a small test case to reproduce the problem? The screenshot you've shown is not sufficient. Also next time please open a new issue and if it is related to an existing issue, just link it. |
@charlesfoster, we experienced the same deletion issue in sars cov2, which @pd3 was kind enough to quickly fix. @pd3, on another note, we are now experiencing non-determinism for variant calling. Nothing to share that will consistently reproduce. ~1/25 times we get an extra variant in high coverage. |
@zeeev The read subsetting in high coverage regions uses a random number generator. Can you make the calling reproducible by adding the |
I'm still seeing this issue where bcftools does not call a small indel in Omicron -- in this case its the 211DEL (22194_22196del) in the spike region. We have two samples with ample data covering this variant, but almost no reads pass bcftools quality filter. In this one, 291/292 reads support the deletion, but only 1 read is counted after filtering, and it does not make it into the call set. bcftools call:
pileup output: (filtered by "bcftools call"). Is there anything that can be done to call this variant? Thanks! |
Thank you for the data. I can reproduce the problem, although I don't yet know what causes it. What is your ultimate goal here? If it is to produce a consensus, then maybe the "samtools consensus" PR (samtools/samtools#1557) would help. In the simple mode (basic frequency counting) it gets the correct answer for this sample, as the data is sufficiently deep and of sufficient high quality the result is plainly obvious:
The default Bayesian consensus method is working on a diploid assumption, and some of the sites it determines as being more likely heterozygous than not. This leads to either ambiguity codes (if option -A is used), or in the default mode of only "pure" bases being output then "N"s instead. Note this is probably a weakness we need to fix in the consensus algorithm, as all such cases appear to be bogus alignments which some heuristics could probably cull, eg around reference position 11282. It's still possible to force it into a homozygous style call by setting the probability of a heterozygous base appearing to be vanishingly small:
Alternatively This is still a work in progress, but we hope the consensus subcommand should appear in the next release of samtools. Hopefully a few weeks away. PS. Meanwhile I'll see if I can figure out why bcftools is rejecting this obvious deletion. |
by switching off an existing heuristics which reduces indelQ and, in the extreme case, has the power to discard an indel entirely when the reference alignment has low score. This is a problem for long reads, so newly `--indel-bias 1.01` is added to 'ont' and 'pacbio-ccs' presets. Tentative fix to #1459
@jrharting I just pushed a tentative fix to this problem. Please try to run with the |
awesome, thank you so much @pd3 |
@jkbonfield and @pd3 Thanks for the quick feedback! Will be testing this pronto. Re end goal, both VCF and consensus are desired outputs for our hifiviral workflow. I do like the consensus PR with samtools, and will absolutely give that a try for future use. |
…lling The option switches off an existing heuristics which reduces indelQ and, in the extreme case, has the power to discard an indel entirely when the reference alignment has low score. This is a problem for long reads, so newly `--no-indelQ-tweaks` is added to 'ont' and 'pacbio-ccs' presets. Tentative fix to #1459
The fixed +/- indel_win_size is still used in construction of the type[] array, but then we reduce that size down before doing the alignments and evaluating them. This means, where appropriate (nice unique data without lots of STRs) we can assess over a smaller window and avoid the negative impact of other nearby indels. It cures the example covid19 problem, but also reduces recall elsewhere as if we *do* still get other nearby indels (eg due to low complexity data between our candidate indel and a neighbouring one) then we're now paying a much larger normalised per-base hit as the length is smaller.
The fixed +/- indel_win_size is still used in construction of the type[] array, but then we reduce that size down before doing the alignments and evaluating them. This means, where appropriate (nice unique data without lots of STRs) we can assess over a smaller window and avoid the negative impact of other nearby indels. It cures the example covid19 problem, but also reduces recall elsewhere as if we *do* still get other nearby indels (eg due to low complexity data between our candidate indel and a neighbouring one) then we're now paying a much larger normalised per-base hit as the length is smaller.
The fixed +/- indel_win_size is still used in construction of the type[] array, but then we reduce that size down before doing the alignments and evaluating them. This means, where appropriate (nice unique data without lots of STRs) we can assess over a smaller window and avoid the negative impact of other nearby indels. It cures the example covid19 problem, but also reduces recall elsewhere as if we *do* still get other nearby indels (eg due to low complexity data between our candidate indel and a neighbouring one) then we're now paying a much larger normalised per-base hit as the length is smaller.
An update on this and similar indel related questions: there is now a new experimental option |
When executing bcftools call on the output of bcftools mpileup it sometimes fails to retain deletions with approriate coverage.
For example, consider the following bcf-file, in particular the entry:
When executing the command
bcftools call -Oz -mv --keep-alts --output indel_problem.bcf important_positions.bcf
, the deletion is lost and not listed in the output bcf-file, despite having 752 (out of 1455) raw reads supporting the deletion.Can we somehow keep the deletion?
The text was updated successfully, but these errors were encountered: