bcftools call ignores deletion with high coverage #1459

LaraFuhrmann · 2021-04-07T10:08:05Z

When executing bcftools call on the output of bcftools mpileup it sometimes fails to retain deletions with approriate coverage.

For example, consider the following bcf-file, in particular the entry:

NC_045512.2     4179    .       AGG     AG      0       .       INDEL;IDV=752;IMF=0.516838;DP=1455;AD=1050,405;I16=360,690,405,0,15744,597840,14027,487333,63000,3.78e+06,24300,1.458e+06,12476,243234,10125,253125;QS=0.548652,0.451348;VDB=0.514509;SGB=-0.693147;MQSB=1;MQ0F=0  PL      0,52,61

When executing the command bcftools call -Oz -mv --keep-alts --output indel_problem.bcf important_positions.bcf, the deletion is lost and not listed in the output bcf-file, despite having 752 (out of 1455) raw reads supporting the deletion.

Can we somehow keep the deletion?

The text was updated successfully, but these errors were encountered:

jkbonfield · 2021-04-07T11:19:04Z

There is a maximum depth parameter. Places that are too deep get skipped. This can be increased with the -L or --max-idepth option, which defaults to 250.

LaraFuhrmann · 2021-04-07T13:13:03Z

From what I have seen (and also tested) bcftools call does not have the option --max-idepth.

We used that option though when executing bcftools mpileup:
bcftools mpileup -Ou -f "reference.fasta" --max-depth 1000000 --max-idepth 1000000 --annotate INFO/AD important_positions.bam --output important_positions.bcf.

jkbonfield · 2021-04-07T13:21:35Z

Sorry, I see what you mean.

It's hard to diagnose this without seeing the BAM file too. The record in question has PL 0,52,61 which, if my interpretation is correct, is claiming it's a REF match. I don't know why mpileup concluded this, but I have some suspicions.

Please try mpileup -h500. I found many quite deep indels are rejected because bcftools concludes they're far more likely to be sequencing errors, presumably caused to due polymerase slippage in homopolymers, which this is a small one. The defaults for mpileup were written many years ago when sequencing artifacts were considerably more common than they are now, and for indels are not appropriate for modern data sets. (I have a branch changing this and numerous other things to do with calling, but it's still a work in progress.)

LaraFuhrmann · 2021-04-07T13:44:52Z

Thank you for your idea and suggestion.

Using the BAM file and the reference, I executed the following command with the -h option:
bcftools mpileup -Ou -f "NC_045512.2.fasta" --max-depth 1000000 --max-idepth 100000 -h 500 --annotate INFO/AD important_positions.bam --output important_positions.bcf.

Unfortunately it didn't change anything.

jkbonfield · 2021-04-07T14:06:21Z

I cannot reproduce this. It works absolutely fine for me as a basic pipeline:

$; bcftools mpileup --max-depth 1000000 --max-idepth 100000 --annotate INFO/AD  -f NC_045512.2.fasta important_positions.bam  | bcftools call -mv --keep-alts - | egrep -v '^#'
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 1000000
NC_045512.2	4179	.	AGG	AG	32.3652	.	INDEL;IDV=752;IMF=0.516838;DP=1455;AD=1050,405;VDB=0.514509;SGB=-0.693147;MQSB=1;MQ0F=0;AC=1;AN=2;DP4=360,690,405,0;MQ=60	GT:PL	0/1:65,0,90
NC_045512.2	4253	.	C	T	222.391	.	DP=1531;AD=360,406;VDB=0;SGB=-0.693147;RPB=0.984628;MQB=1;MQSB=1;BQB=0.911924;MQ0F=0;AC=1;AN=2;DP4=359,1,406,0;MQ=60	GT:PL	0/1:255,0,255

That's with 1.11, the same as listed in your bcf file. The mpileup bcf output (pre call) is:

NC_045512.2	4179	.	AGG	AG	0	.	INDEL;IDV=752;IMF=0.516838;DP=1455;AD=1050,405;I16=360,690,405,0,15744,597840,14027,487333,63000,3.78e+06,24300,1.458e+06,12476,243234,10125,253125;QS=0.548652,0.451348;VDB=0.514509;SGB=-0.693147;MQSB=1;MQ0F=0	PL	65,0,90

Note the PL here indicates a REF/ALT call rather than REF only.

kpj · 2021-04-07T15:01:51Z

Thanks for giving this a try! The results do look quite puzzling.

To maximize reproducibility, we ran the following script for both bcftools 1.11 and 1.12:

#!/bin/bash

# download data
rm -f important_positions.bam* NC_045512.2.fasta*

wget https://github.com/samtools/bcftools/files/6271997/important_positions.bam.gz
gunzip important_positions.bam.gz

wget https://github.com/samtools/bcftools/files/6271998/NC_045512.2.fasta.txt
mv NC_045512.2.fasta.txt NC_045512.2.fasta

# check version
bcftools --version

# call variants
bcftools mpileup \
    --max-depth 1000000 \
    --max-idepth 100000 \
    --annotate INFO/AD \
    -f NC_045512.2.fasta \
    important_positions.bam \
| bcftools call \
    -mv \
    --keep-alts \
    - \
| grep -v '^#'

For bcftools 1.11 the output is (ignoring wget):

bcftools 1.11
Using htslib 1.11
Copyright (C) 2020 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 1000000
NC_045512.2	4253	.	C	T	222	.	DP=1531;AD=360,406;VDB=0;SGB=-0.693147;RPB=0.984628;MQB=1;MQSB=1;BQB=0.911924;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=359,1,406,0;MQ=60	GT:PL	0/1:255,0,255

For bcftools 1.12 we get:

bcftools 1.12
Using htslib 1.12
Copyright (C) 2021 Genome Research Ltd.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 1000000
NC_045512.2	4253	.	C	T	222.391	.	DP=1531;AD=360,406;VDB=0;SGB=-0.693147;RPB=0.984628;MQB=1;MQSB=1;BQB=0.911924;MQ0F=0;AC=1;AN=2;DP4=359,1,406,0;MQ=60	GT:PL	0/1:255,0,255

The deletion does seem to be missing in both cases. The PL for the substitution at position 4253 matches.

Do you have any suggestion what else we could do to debug this?

jkbonfield · 2021-04-07T15:23:43Z

It still seems to be giving different answers for me. I note you don't have a samtools faidx NC_045512.2.fasta in there. Could it be you have an out of date index somehow? I'm clutching at straws as I went to a totally fresh directory and it worked without an index anyway (I guess it made one).

Also try disabling BAQ (-B), but this doesn't explain why we see different behaviours.

I've tried various bcftools versions, and 1.7 onwards work for me with 1.6 missing this variant. I cannot explain why it behaves differently for you, unless it's some bizarre platform specific behaviour. @pd3 know of anything like this?

kpj · 2021-04-07T17:33:49Z

It still seems to be giving different answers for me. I note you don't have a samtools faidx NC_045512.2.fasta in there. Could it be you have an out of date index somehow? I'm clutching at straws as I went to a totally fresh directory and it worked without an index anyway (I guess it made one).

I made sure to delete all auxiliary files beforehand, so this is unfortunately not it.

Also try disabling BAQ (-B), but this doesn't explain why we see different behaviours.

I added that option, so the mpileup call is now:

bcftools mpileup \
    --max-depth 1000000 \
    --max-idepth 100000 \
    --annotate INFO/AD \
    --no-BAQ \
    -f NC_045512.2.fasta \
    important_positions.bam

and the deletion appears 🎉

NC_045512.2	4179	.	AGG	AG	14.5181	.	INDEL;IDV=752;IMF=0.516838;DP=1455;AD=1050,405;VDB=0.514509;SGB=-0.693147;MQSB=1;MQ0F=0;AC=1;AN=2;DP4=360,690,405,0;MQ=60	GT:PL	0/1:47,0,113

QUAL and the PL still seem to be different than yours though.
So it seems like the per-Base Alignment Quality was handled differently for some arcane reason?

LaraFuhrmann · 2021-04-08T11:13:26Z

@kpj and I executed the commands now on our real alignment (BAM file). Before we provided you with a restriction of this alignment to the reads overlapping position 4179.

We used bcftools 1.11 and 1.12 and executed the following commands:

bcftools mpileup \
    --max-depth 1000000 \
    --max-idepth 100000 \
    --annotate INFO/AD \
    --no-BAQ \
    -Ou \
    -f NC_045512.2.fasta \
    REF_aln.bam \
| bcftools call \
    -mv \
    --keep-alts \
    -Ou \
| bcftools norm \
    -f NC_045512.2.fasta \
    -Oz \
    --output temp.bcf

The deletion appears after the execution of bcftools mpileup but is not called by bcftools call.

Do you have any suggestion what the reason for this behaviour could be and what we could do to debug this?

pd3 · 2021-04-12T15:16:00Z

I obtained the following (correct) result

NC_045512.2	4179	.	AGG	AG	32.3652	.	INDEL;IDV=752;IMF=0.516838;DP=1455;AD=1050,405;VDB=0.514509;SGB=-0.693147;MQSB=1;MQ0F=0;AC=1;AN=2;DP4=360,690,405,0;MQ=60	GT:PL	0/1:65,0,90
NC_045512.2	4253	.	C	T	222.391	.	DP=1531;AD=360,406;VDB=0;SGB=-0.693147;RPB=0.984628;MQB=1;MQSB=1;BQB=0.911924;MQ0F=0;AC=1;AN=2;DP4=359,1,406,0;MQ=60	GT:PL	0/1:255,0,255

using this script, which includes also bcftools download and compilation

I never encountered such a non-deterministic behavior before, it was always down to the user running with different options or different version. Can you please send the full output of the script, including the VCF header? It might be also useful to run the mpileup step in valgrind and see if anything comes up.

kpj · 2021-04-12T21:37:02Z

This is getting really interesting now!
We executed the script you provided on two machines and the results differ (dramatically).

On macOS (MacBookPro16,2 Darwin) we obtain only one substitution (full log, VCF header):

NC_045512.2	4253	.	C	T	222.391	.	DP=1531;AD=360,406;VDB=0;SGB=-0.693147;RPB=0.984628;MQB=1;MQSB=1;BQB=0.911924;MQ0F=0;AC=1;AN=2;DP4=359,1,406,0;MQ=60	GT:PL	0/1:255,0,255

On Linux (x86_64 GNU/Linux) we obtain both the substitution as well as the deletion (full log,
VCF header):

NC_045512.2	4179	.	AGG	AG	32.3652	.	INDEL;IDV=752;IMF=0.516838;DP=1455;AD=1050,405;VDB=0.514509;SGB=-0.693147;MQSB=1;MQ0F=0;AC=1;AN=2;DP4=360,690,405,0;MQ=60	GT:PL	0/1:65,0,90
NC_045512.2	4253	.	C	T	222.391	.	DP=1531;AD=360,406;VDB=0;SGB=-0.693147;RPB=0.984628;MQB=1;MQSB=1;BQB=0.911924;MQ0F=0;AC=1;AN=2;DP4=359,1,406,0;MQ=60	GT:PL	0/1:255,0,255

What should we be on the lookout for when running mpileup through valgrind?

jkbonfield · 2021-04-13T08:21:33Z

Uninitialised memory accesses can be one cause for non-deterministic behaviour. I'll have an explore myself, also using gcc's address-sanitizer.

jkbonfield · 2021-04-13T08:31:12Z

I can't get errors from either Valgrind nor gcc -fsanitize=address. This appears to be a system difference, possibly in ks_shuffle which uses random numbers, called by errmod_cal.

Petr, can you confirm that random sampling could show this effect? If so I think we should be using our own implementation for all OSes and not just Windows, as we don't want system specific behaviour patterns like this.

jkbonfield · 2021-04-13T08:45:43Z

I can confirm my suspicions.

From the bcftools-1.12 tarball, I ran configure and then manually edited the htslib-1.12/config.h file to comment out the #define HAVE_DRAND48 1 line. This forces it to build its own implementation of drand48, as it would on Windows. It appears that this particular implementation chooses a subset that triggers the same issue seen before (and I assume it is this or something close to it which is also used on MacOS). It now reports:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	important_positions.bam
NC_045512.2	4253	.	C	T	222.391	.	DP=1531;AD=360,406;VDB=0;SGB=-0.693147;RPB=0.984628;MQB=1;MQSB=1;BQB=0.911924;MQ0F=0;AC=1;AN=2;DP4=359,1,406,0;MQ=60	GT:PL	0/1:255,0,255

So at least that explains the non-deterministic behaviour, caused by random sub-sampling when the depth goes above 255.

What's curious is why it's still failing to identify the deletion.

I can confirm at least that my work-in-progress PR overhauling calling (including indel detection) does fix this, at least in my current release (which needs pushing back up again, so it may differ on github atm).

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	important_positions.bam
NC_045512.2	4179	.	AGG	AG	101.365	.	INDEL;IDV=752;IMF=0.516838;DP=1455;AD=696,752;VDB=0.508484;SGB=-0.693147;RPBZ=7.24249;MQBZ=0;SCBZ=-0.0397288;FS=0;MQ0F=0;AC=1;AN=2;DP4=360,336,405,347;MQ=60	GT:PL	0/1:134,0,117
NC_045512.2	4253	.	C	T	222.391	.	DP=1531;AD=360,406;VDB=0;SGB=-0.693147;RPBZ=-0.234093;BQBZ=-0.341734;SCBZ=-1.06197;FS=0;MQ0F=0;AC=1;AN=2;DP4=359,1,406,0;MQ=60	GT:PL	0/1:255,0,255

With HAVE_DRAND48 defined the qual changes to 102.365, so a very marginal difference, but still called.

jkbonfield · 2021-04-13T08:50:43Z

The other thing you may wish to try is mpileup -x to disable the read-pair overlap detection. This gives a very confident call on this variant (with / without the local drand48 implementation).

In theory this helps remove over-confidence elsewhere, but I think at the moment the overlap removal has issues as it can create strand inbalance as it's always removing the same member of a pair. Ideally it should be done at random (deterministically! either via our own implementation of a simple hash of the read name). I don't know if a more random removal would fix this or not though.

[Note to self - rerun my variant calling accuracy assessments with -x. Does it actually help overall?]

jkbonfield · 2021-04-13T09:11:46Z

I'm not sure if this plays into it or not, but this particularly variant has the G present somewhere centralling in all the forward strand reads and right at the start (adjacent to a poly-A) in the reverse strand reads. It's likely BAQ will be removing many if not all of those. Similarly the deletion is present centrally in the forward strand reads and at the start of the reverse strand reads.

This shows a heavy strand bias, indicating that overlap removal must be strand agnostic if it's not going to generate extreme biases, affecting various bcftools INFO filtering rules and (apparently) call confidence.

Witness DP4 with and without -x (a different bcftools branch, but you get the picture):

DP4=359,1,405,0
DP4=359,360,406,406

This is definitely a failing of how mpileup works.

jmarshall · 2021-04-13T11:44:54Z

If so I think we should be using our own [drand48 etc] implementation for all OSes and not just Windows, as we don't want system specific behaviour patterns like this.
[…]
I can confirm my suspicions. From the bcftools-1.12 tarball, I [commented] out the #define HAVE_DRAND48 1 line. This forces it to build its own implementation of drand48 [and the results were different].

The drand48()/lrand48()/etc algorithm is specified by POSIX, and equivalently-initialised implementations on different POSIX-compliant platforms are guaranteed to produce the same sequence of random numbers.

However bcftools mpileup is not calling srand48(), so the system implementations are not equivalently-initialised. You can call drand48() without a previous srand48() call, but POSIX says this “is not recommended practice” and does not specify the default initializer values to be used in this case — hence the observed problem.

So I reckon the way to fix this is to have bcftools mpileup (and anything else that uses ks_shuffle() really) seed the random number generator, ideally in the same way as samtools sort does:

using hts_rand48(), to avoid different results on OpenBSD (cf Use srand48_deterministic() on platforms that have it htslib#1002)
with a default seed based on time(NULL) (hmmm… or maybe a constant default for this one!)…
…overridable by a command line argument.

Currently it always chooses the second sequence (except for the circumstance of differing base calls). This is essentially random strand and random coordinate in most library strategies, but some targetted sequencing methods have a very strong strand bias (first is + strand, second is - strand) or positional bias (eg PCR amplicons). Given SNPs near the end of sequences can give rise to poor BAQ scores, both position and strand bias are detrimental. This change makes it select either read 'a' or 'b' based on a hash of the read name. Unlike using a traditional random number generator, this gives it consistent behaviour regardless of how many sequences have gone before. An example from SynDip region 1:185M-200M: No overlap removal: SNP Q>0 / Filtered SNP TP 18830 / 18803 SNP FP 264 / 238 SNP GT 56 / 53 SNP FN 459 / 486 InDel TP 2788 / 2697 InDel FP 1022 / 86 InDel GT 353 / 345 InDel FN 596 / 687 Old removal strategy: SNP Q>0 / Filtered SNP TP 18841 / 18813 SNP FP 270 / 243 SNP GT 56 / 54 SNP FN 448 / 476 InDel TP 2754 / 2663 InDel FP 985 / 83 InDel GT 413 / 404 InDel FN 630 / 721 This PR: SNP Q>0 / Filtered SNP TP 18841 / 18814 SNP FP 272 / 242 SNP GT 55 / 53 SNP FN 448 / 475 InDel TP 2765 / 2679 InDel FP 996 / 85 InDel GT 382 / 375 InDel FN 619 / 705 The CPU cost on bcftools mpileup | bcftools call between the latter two tests was 0.4% (which may also just be random fluctuation). Vs the old removal system, this is a marginal improvement for SNPs and, oddly, a significant improvement to Indels. (It's still behind no overlap removal for indels, but I'm unsure on the veracity of small indels in that truth set). Fixes samtools/bcftools#1459

Random numbers are used by ks_shuffle, which is called in htslib's errmod code for sub-sampling the data in regions of > 255 depth. This corrects the OS specific randomness seen in samtools#1459.

Currently it always chooses the second sequence (except for the circumstance of differing base calls). This is essentially random strand and random coordinate in most library strategies, but some targetted sequencing methods have a very strong strand bias (first is + strand, second is - strand) or positional bias (eg PCR amplicons). Given SNPs near the end of sequences can give rise to poor BAQ scores, both position and strand bias are detrimental. This change makes it select either read 'a' or 'b' based on a hash of the read name. Unlike using a traditional random number generator, this gives it consistent behaviour regardless of how many sequences have gone before. An example from SynDip region 1:185M-200M: No overlap removal: SNP Q>0 / Filtered SNP TP 18830 / 18803 SNP FP 264 / 238 SNP GT 56 / 53 SNP FN 459 / 486 InDel TP 2788 / 2697 InDel FP 1022 / 86 InDel GT 353 / 345 InDel FN 596 / 687 Old removal strategy: SNP Q>0 / Filtered SNP TP 18841 / 18813 SNP FP 270 / 243 SNP GT 56 / 54 SNP FN 448 / 476 InDel TP 2754 / 2663 InDel FP 985 / 83 InDel GT 413 / 404 InDel FN 630 / 721 This PR: SNP Q>0 / Filtered SNP TP 18841 / 18814 SNP FP 272 / 242 SNP GT 55 / 53 SNP FN 448 / 475 InDel TP 2765 / 2679 InDel FP 996 / 85 InDel GT 382 / 375 InDel FN 619 / 705 The CPU cost on bcftools mpileup | bcftools call between the latter two tests was 0.4% (which may also just be random fluctuation). Vs the old removal system, this is a marginal improvement for SNPs and, oddly, a significant improvement to Indels. (It's still behind no overlap removal for indels, but I'm unsure on the veracity of small indels in that truth set). Fixes samtools/bcftools#1459

Random numbers are used by ks_shuffle, which is called in htslib's errmod code for sub-sampling the data in regions of > 255 depth. This corrects the OS specific randomness seen in samtools#1459.

charlesfoster · 2021-07-05T23:29:46Z

Hi @pd3 @jkbonfield,
I think I'm seeing a similar (maybe?) problem with my own data from SARS-CoV-2. I've got sequencing data that I previously ran through a different program (ivar), which uses the output of samtools mpileup as the input. When using that program, I find a 17 base deletion in the genome occurring at a frequency of ~26% of reads. However, when using bcftools to do variant calling, the deletion is not recovered at all. I've tried a few of the recommendations from this thread to no avail.

ivar command:

samtools mpileup -A -d 0 -B -Q 0 --reference $REF $BAM | \
ivar variants -q 20 -m 10 -t 0.1 -r $REF

Result for deletion:

REGION	POS	REF	ALT	TOTAL_DP	ALT_DP	ALT_FREQ
NC_045512.2	27606	AAAACACGTCTATCAGTT	A	909	239	0.262926

bcftools command:

bcftools mpileup --count-orphans --no-BAQ --max-idepth 10000000 -d 10000000 -Q 0 \
-Ou --min-BQ 20 \
--annotate FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/SP,INFO/AD,INFO/ADF,INFO/ADR \
-f $REF $BAM | \
bcftools call --ploidy 1 --keep-alts --keep-masked-ref --multiallelic-caller --variants-only -Ov --threads 20 | \
bcftools +fill-tags - -Oz  -o ${NAME}.bcftools.raw.vcf.gz -- -t FORMAT/VAF

I've also tried adding -h500 (and -h1000) and -x to the mpileup command. None of these options has recovered the 17 base deletion.

Screenshot of the region:

Since the region has some soft clipping of amplicon primers, I also tried the bcftools commands on a non-soft-clipped BAM file from the same sample, in case that was a problem (no difference).

Any idea why this variant might be being missed? Should I be adjusting the command? I have a closely related sample that has the 17 base deletion at a higher frequency (IMF=0.96) and it's correctly called by bcftools.

pd3 · 2021-07-06T07:23:09Z

Any chance you could create a small test case to reproduce the problem? The screenshot you've shown is not sufficient. Also next time please open a new issue and if it is related to an existing issue, just link it.

zeeev · 2021-10-19T01:15:27Z

@charlesfoster, we experienced the same deletion issue in sars cov2, which @pd3 was kind enough to quickly fix.

#1569

@pd3, on another note, we are now experiencing non-determinism for variant calling. Nothing to share that will consistently reproduce. ~1/25 times we get an extra variant in high coverage.

pd3 · 2021-10-20T10:37:58Z

@zeeev The read subsetting in high coverage regions uses a random number generator. Can you make the calling reproducible by adding the mpileup --seed option? A test case would be helpful indeed.

jrharting · 2022-01-26T22:23:55Z

I'm still seeing this issue where bcftools does not call a small indel in Omicron -- in this case its the 211DEL (22194_22196del) in the spike region. We have two samples with ample data covering this variant, but almost no reads pass bcftools quality filter.

In this one, 291/292 reads support the deletion, but only 1 read is counted after filtering, and it does not make it into the call set.

bcftools call:

bcftools --version bcftools 1.13-37-g4c71bfb Using htslib 1.13

bcftools mpileup --open-prob 25 --indel-size 450 --gap-frac 0.01 --ext-prob 1 --min-ireads 3 --max-depth 1000 --max-idepth 5000 --seed 1984 -h 500 -B -a FORMAT/AD -f NC_045512.2.fasta aligned.bam | bcftools call -mv -Ov | bcftools norm -f NC_045512.2.fasta

pileup output:
NC_045512.2 22193 . AATT A 0 . INDEL;IDV=291;IMF=0.996575;DP=292;I16=0,0,1,0,0,0,27,729,0,0,60,3600,0,0,25,625;QS=0,1;SGB=-0.379885;RPBZ=-0.518777;FS=0;MQ0F=0 PL:AD 22,3,0:0,1

(filtered by "bcftools call").
sample_bams.zip

Is there anything that can be done to call this variant?
I'm including the data for two samples.

Thanks!

jrharting · 2022-01-27T00:41:12Z

re the 211DEL -- there does appear to be a difference by strands.

pileup of forward reads only:

indel called, but mostly filtered.

rev strand reads -- no call in the pileup

jkbonfield · 2022-01-27T11:42:05Z

Thank you for the data. I can reproduce the problem, although I don't yet know what causes it.

What is your ultimate goal here? If it is to produce a consensus, then maybe the "samtools consensus" PR (samtools/samtools#1557) would help.

In the simple mode (basic frequency counting) it gets the correct answer for this sample, as the data is sufficiently deep and of sufficient high quality the result is plainly obvious:

samtools consensus -a -m simple -f fastq B_A11R6.mapped.bam

The default Bayesian consensus method is working on a diploid assumption, and some of the sites it determines as being more likely heterozygous than not. This leads to either ambiguity codes (if option -A is used), or in the default mode of only "pure" bases being output then "N"s instead. Note this is probably a weakness we need to fix in the consensus algorithm, as all such cases appear to be bogus alignments which some heuristics could probably cull, eg around reference position 11282.

It's still possible to force it into a homozygous style call by setting the probability of a heterozygous base appearing to be vanishingly small:

samtools consensus -a --P-het=1e-200 -f fastq B_A11R6.mapped.bam

Alternatively -C0 to show all pure bases regardless of the het likelihood also works.

This is still a work in progress, but we hope the consensus subcommand should appear in the next release of samtools. Hopefully a few weeks away.

PS. Meanwhile I'll see if I can figure out why bcftools is rejecting this obvious deletion.

by switching off an existing heuristics which reduces indelQ and, in the extreme case, has the power to discard an indel entirely when the reference alignment has low score. This is a problem for long reads, so newly `--indel-bias 1.01` is added to 'ont' and 'pacbio-ccs' presets. Tentative fix to #1459

pd3 · 2022-01-27T17:09:06Z

@jrharting I just pushed a tentative fix to this problem. Please try to run with the -X pacbio-ccs preset or set the --indel-bias to a value bigger than 1 manually.

zeeev · 2022-01-27T18:34:43Z

awesome, thank you so much @pd3

jrharting · 2022-01-27T18:44:39Z

@jkbonfield and @pd3 Thanks for the quick feedback! Will be testing this pronto.

Re end goal, both VCF and consensus are desired outputs for our hifiviral workflow. I do like the consensus PR with samtools, and will absolutely give that a try for future use.

…lling The option switches off an existing heuristics which reduces indelQ and, in the extreme case, has the power to discard an indel entirely when the reference alignment has low score. This is a problem for long reads, so newly `--no-indelQ-tweaks` is added to 'ont' and 'pacbio-ccs' presets. Tentative fix to #1459

The fixed +/- indel_win_size is still used in construction of the type[] array, but then we reduce that size down before doing the alignments and evaluating them. This means, where appropriate (nice unique data without lots of STRs) we can assess over a smaller window and avoid the negative impact of other nearby indels. It cures the example covid19 problem, but also reduces recall elsewhere as if we *do* still get other nearby indels (eg due to low complexity data between our candidate indel and a neighbouring one) then we're now paying a much larger normalised per-base hit as the length is smaller.

pd3 · 2022-11-23T09:54:10Z

An update on this and similar indel related questions: there is now a new experimental option mpileup --indels-2.0, introduced by #1824, which attempts to solve issues like this and provides a testing ground for improvements in indel calling. Note that this is an EXPERIMENTAL feature. If you test it, please let us know the results.

daviesrob assigned jkbonfield Apr 20, 2021

This was referenced Apr 21, 2021

Add a --seed option to mpileup. #1469

Merged

Make mpileup's overlap removal choose a random sequence. samtools/htslib#1273

Merged

pd3 closed this as completed in samtools/htslib#1273 Apr 27, 2021

charlesfoster mentioned this issue Jul 7, 2021

bcftools missing lower frequency variants? #1523

Closed

pd3 mentioned this issue Jan 28, 2022

Make --indel-bias more sensitive to indels #1648

Open

Rohit-Satyam mentioned this issue Oct 26, 2022

Improvement in Variant calling TRON-Bioinformatics/covigator-ngs-pipeline#28

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bcftools call ignores deletion with high coverage #1459

bcftools call ignores deletion with high coverage #1459

LaraFuhrmann commented Apr 7, 2021

jkbonfield commented Apr 7, 2021

LaraFuhrmann commented Apr 7, 2021

jkbonfield commented Apr 7, 2021

LaraFuhrmann commented Apr 7, 2021

jkbonfield commented Apr 7, 2021

kpj commented Apr 7, 2021

jkbonfield commented Apr 7, 2021

kpj commented Apr 7, 2021 •

edited

Loading

LaraFuhrmann commented Apr 8, 2021

pd3 commented Apr 12, 2021

kpj commented Apr 12, 2021 •

edited

Loading

jkbonfield commented Apr 13, 2021

jkbonfield commented Apr 13, 2021

jkbonfield commented Apr 13, 2021 •

edited

Loading

jkbonfield commented Apr 13, 2021 •

edited

Loading

jkbonfield commented Apr 13, 2021

jmarshall commented Apr 13, 2021

charlesfoster commented Jul 5, 2021 •

edited

Loading

pd3 commented Jul 6, 2021

zeeev commented Oct 19, 2021 •

edited

Loading

pd3 commented Oct 20, 2021

jrharting commented Jan 26, 2022 •

edited

Loading

jrharting commented Jan 27, 2022

jkbonfield commented Jan 27, 2022 •

edited

Loading

pd3 commented Jan 27, 2022 •

edited

Loading

zeeev commented Jan 27, 2022

jrharting commented Jan 27, 2022 •

edited

Loading

pd3 commented Nov 23, 2022 •

edited

Loading

bcftools call ignores deletion with high coverage #1459

bcftools call ignores deletion with high coverage #1459

Comments

LaraFuhrmann commented Apr 7, 2021

jkbonfield commented Apr 7, 2021

LaraFuhrmann commented Apr 7, 2021

jkbonfield commented Apr 7, 2021

LaraFuhrmann commented Apr 7, 2021

jkbonfield commented Apr 7, 2021

kpj commented Apr 7, 2021

jkbonfield commented Apr 7, 2021

kpj commented Apr 7, 2021 • edited Loading

LaraFuhrmann commented Apr 8, 2021

pd3 commented Apr 12, 2021

kpj commented Apr 12, 2021 • edited Loading

jkbonfield commented Apr 13, 2021

jkbonfield commented Apr 13, 2021

jkbonfield commented Apr 13, 2021 • edited Loading

jkbonfield commented Apr 13, 2021 • edited Loading

jkbonfield commented Apr 13, 2021

jmarshall commented Apr 13, 2021

charlesfoster commented Jul 5, 2021 • edited Loading

pd3 commented Jul 6, 2021

zeeev commented Oct 19, 2021 • edited Loading

pd3 commented Oct 20, 2021

jrharting commented Jan 26, 2022 • edited Loading

jrharting commented Jan 27, 2022

jkbonfield commented Jan 27, 2022 • edited Loading

pd3 commented Jan 27, 2022 • edited Loading

zeeev commented Jan 27, 2022

jrharting commented Jan 27, 2022 • edited Loading

pd3 commented Nov 23, 2022 • edited Loading

kpj commented Apr 7, 2021 •

edited

Loading

kpj commented Apr 12, 2021 •

edited

Loading

jkbonfield commented Apr 13, 2021 •

edited

Loading

jkbonfield commented Apr 13, 2021 •

edited

Loading

charlesfoster commented Jul 5, 2021 •

edited

Loading

zeeev commented Oct 19, 2021 •

edited

Loading

jrharting commented Jan 26, 2022 •

edited

Loading

jkbonfield commented Jan 27, 2022 •

edited

Loading

pd3 commented Jan 27, 2022 •

edited

Loading

jrharting commented Jan 27, 2022 •

edited

Loading

pd3 commented Nov 23, 2022 •

edited

Loading