Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcftools call ignores deletion with high coverage #1459

Closed
LaraFuhrmann opened this issue Apr 7, 2021 · 28 comments · Fixed by samtools/htslib#1273
Closed

bcftools call ignores deletion with high coverage #1459

LaraFuhrmann opened this issue Apr 7, 2021 · 28 comments · Fixed by samtools/htslib#1273
Assignees

Comments

@LaraFuhrmann
Copy link

When executing bcftools call on the output of bcftools mpileup it sometimes fails to retain deletions with approriate coverage.

For example, consider the following bcf-file, in particular the entry:

NC_045512.2     4179    .       AGG     AG      0       .       INDEL;IDV=752;IMF=0.516838;DP=1455;AD=1050,405;I16=360,690,405,0,15744,597840,14027,487333,63000,3.78e+06,24300,1.458e+06,12476,243234,10125,253125;QS=0.548652,0.451348;VDB=0.514509;SGB=-0.693147;MQSB=1;MQ0F=0  PL      0,52,61

When executing the command bcftools call -Oz -mv --keep-alts --output indel_problem.bcf important_positions.bcf, the deletion is lost and not listed in the output bcf-file, despite having 752 (out of 1455) raw reads supporting the deletion.

Can we somehow keep the deletion?

@jkbonfield
Copy link
Contributor

There is a maximum depth parameter. Places that are too deep get skipped. This can be increased with the -L or --max-idepth option, which defaults to 250.

@LaraFuhrmann
Copy link
Author

From what I have seen (and also tested) bcftools call does not have the option --max-idepth.

We used that option though when executing bcftools mpileup:
bcftools mpileup -Ou -f "reference.fasta" --max-depth 1000000 --max-idepth 1000000 --annotate INFO/AD important_positions.bam --output important_positions.bcf.

@jkbonfield
Copy link
Contributor

Sorry, I see what you mean.

It's hard to diagnose this without seeing the BAM file too. The record in question has PL 0,52,61 which, if my interpretation is correct, is claiming it's a REF match. I don't know why mpileup concluded this, but I have some suspicions.

Please try mpileup -h500. I found many quite deep indels are rejected because bcftools concludes they're far more likely to be sequencing errors, presumably caused to due polymerase slippage in homopolymers, which this is a small one. The defaults for mpileup were written many years ago when sequencing artifacts were considerably more common than they are now, and for indels are not appropriate for modern data sets. (I have a branch changing this and numerous other things to do with calling, but it's still a work in progress.)

@LaraFuhrmann
Copy link
Author

Thank you for your idea and suggestion.

Using the BAM file and the reference, I executed the following command with the -h option:
bcftools mpileup -Ou -f "NC_045512.2.fasta" --max-depth 1000000 --max-idepth 100000 -h 500 --annotate INFO/AD important_positions.bam --output important_positions.bcf.

Unfortunately it didn't change anything.

@jkbonfield
Copy link
Contributor

I cannot reproduce this. It works absolutely fine for me as a basic pipeline:

$; bcftools mpileup --max-depth 1000000 --max-idepth 100000 --annotate INFO/AD  -f NC_045512.2.fasta important_positions.bam  | bcftools call -mv --keep-alts - | egrep -v '^#'
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 1000000
NC_045512.2	4179	.	AGG	AG	32.3652	.	INDEL;IDV=752;IMF=0.516838;DP=1455;AD=1050,405;VDB=0.514509;SGB=-0.693147;MQSB=1;MQ0F=0;AC=1;AN=2;DP4=360,690,405,0;MQ=60	GT:PL	0/1:65,0,90
NC_045512.2	4253	.	C	T	222.391	.	DP=1531;AD=360,406;VDB=0;SGB=-0.693147;RPB=0.984628;MQB=1;MQSB=1;BQB=0.911924;MQ0F=0;AC=1;AN=2;DP4=359,1,406,0;MQ=60	GT:PL	0/1:255,0,255

That's with 1.11, the same as listed in your bcf file. The mpileup bcf output (pre call) is:

NC_045512.2	4179	.	AGG	AG	0	.	INDEL;IDV=752;IMF=0.516838;DP=1455;AD=1050,405;I16=360,690,405,0,15744,597840,14027,487333,63000,3.78e+06,24300,1.458e+06,12476,243234,10125,253125;QS=0.548652,0.451348;VDB=0.514509;SGB=-0.693147;MQSB=1;MQ0F=0	PL	65,0,90

Note the PL here indicates a REF/ALT call rather than REF only.

@kpj
Copy link

kpj commented Apr 7, 2021

Thanks for giving this a try! The results do look quite puzzling.

To maximize reproducibility, we ran the following script for both bcftools 1.11 and 1.12:

#!/bin/bash

# download data
rm -f important_positions.bam* NC_045512.2.fasta*

wget https://github.com/samtools/bcftools/files/6271997/important_positions.bam.gz
gunzip important_positions.bam.gz

wget https://github.com/samtools/bcftools/files/6271998/NC_045512.2.fasta.txt
mv NC_045512.2.fasta.txt NC_045512.2.fasta

# check version
bcftools --version

# call variants
bcftools mpileup \
    --max-depth 1000000 \
    --max-idepth 100000 \
    --annotate INFO/AD \
    -f NC_045512.2.fasta \
    important_positions.bam \
| bcftools call \
    -mv \
    --keep-alts \
    - \
| grep -v '^#'

For bcftools 1.11 the output is (ignoring wget):

bcftools 1.11
Using htslib 1.11
Copyright (C) 2020 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 1000000
NC_045512.2	4253	.	C	T	222	.	DP=1531;AD=360,406;VDB=0;SGB=-0.693147;RPB=0.984628;MQB=1;MQSB=1;BQB=0.911924;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=359,1,406,0;MQ=60	GT:PL	0/1:255,0,255

For bcftools 1.12 we get:

bcftools 1.12
Using htslib 1.12
Copyright (C) 2021 Genome Research Ltd.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 1000000
NC_045512.2	4253	.	C	T	222.391	.	DP=1531;AD=360,406;VDB=0;SGB=-0.693147;RPB=0.984628;MQB=1;MQSB=1;BQB=0.911924;MQ0F=0;AC=1;AN=2;DP4=359,1,406,0;MQ=60	GT:PL	0/1:255,0,255

The deletion does seem to be missing in both cases. The PL for the substitution at position 4253 matches.

Do you have any suggestion what else we could do to debug this?

@jkbonfield
Copy link
Contributor

It still seems to be giving different answers for me. I note you don't have a samtools faidx NC_045512.2.fasta in there. Could it be you have an out of date index somehow? I'm clutching at straws as I went to a totally fresh directory and it worked without an index anyway (I guess it made one).

Also try disabling BAQ (-B), but this doesn't explain why we see different behaviours.

I've tried various bcftools versions, and 1.7 onwards work for me with 1.6 missing this variant. I cannot explain why it behaves differently for you, unless it's some bizarre platform specific behaviour. @pd3 know of anything like this?

@kpj
Copy link

kpj commented Apr 7, 2021

It still seems to be giving different answers for me. I note you don't have a samtools faidx NC_045512.2.fasta in there. Could it be you have an out of date index somehow? I'm clutching at straws as I went to a totally fresh directory and it worked without an index anyway (I guess it made one).

I made sure to delete all auxiliary files beforehand, so this is unfortunately not it.

Also try disabling BAQ (-B), but this doesn't explain why we see different behaviours.

I added that option, so the mpileup call is now:

bcftools mpileup \
    --max-depth 1000000 \
    --max-idepth 100000 \
    --annotate INFO/AD \
    --no-BAQ \
    -f NC_045512.2.fasta \
    important_positions.bam

and the deletion appears 🎉

NC_045512.2	4179	.	AGG	AG	14.5181	.	INDEL;IDV=752;IMF=0.516838;DP=1455;AD=1050,405;VDB=0.514509;SGB=-0.693147;MQSB=1;MQ0F=0;AC=1;AN=2;DP4=360,690,405,0;MQ=60	GT:PL	0/1:47,0,113

QUAL and the PL still seem to be different than yours though.
So it seems like the per-Base Alignment Quality was handled differently for some arcane reason?

@LaraFuhrmann
Copy link
Author

@kpj and I executed the commands now on our real alignment (BAM file). Before we provided you with a restriction of this alignment to the reads overlapping position 4179.

We used bcftools 1.11 and 1.12 and executed the following commands:

bcftools mpileup \
    --max-depth 1000000 \
    --max-idepth 100000 \
    --annotate INFO/AD \
    --no-BAQ \
    -Ou \
    -f NC_045512.2.fasta \
    REF_aln.bam \
| bcftools call \
    -mv \
    --keep-alts \
    -Ou \
| bcftools norm \
    -f NC_045512.2.fasta \
    -Oz \
    --output temp.bcf

The deletion appears after the execution of bcftools mpileup but is not called by bcftools call.

Do you have any suggestion what the reason for this behaviour could be and what we could do to debug this?

@pd3
Copy link
Member

pd3 commented Apr 12, 2021

I obtained the following (correct) result

NC_045512.2	4179	.	AGG	AG	32.3652	.	INDEL;IDV=752;IMF=0.516838;DP=1455;AD=1050,405;VDB=0.514509;SGB=-0.693147;MQSB=1;MQ0F=0;AC=1;AN=2;DP4=360,690,405,0;MQ=60	GT:PL	0/1:65,0,90
NC_045512.2	4253	.	C	T	222.391	.	DP=1531;AD=360,406;VDB=0;SGB=-0.693147;RPB=0.984628;MQB=1;MQSB=1;BQB=0.911924;MQ0F=0;AC=1;AN=2;DP4=359,1,406,0;MQ=60	GT:PL	0/1:255,0,255

using this script, which includes also bcftools download and compilation

I never encountered such a non-deterministic behavior before, it was always down to the user running with different options or different version. Can you please send the full output of the script, including the VCF header? It might be also useful to run the mpileup step in valgrind and see if anything comes up.

@kpj
Copy link

kpj commented Apr 12, 2021

This is getting really interesting now!
We executed the script you provided on two machines and the results differ (dramatically).

On macOS (MacBookPro16,2 Darwin) we obtain only one substitution (full log, VCF header):

NC_045512.2	4253	.	C	T	222.391	.	DP=1531;AD=360,406;VDB=0;SGB=-0.693147;RPB=0.984628;MQB=1;MQSB=1;BQB=0.911924;MQ0F=0;AC=1;AN=2;DP4=359,1,406,0;MQ=60	GT:PL	0/1:255,0,255

On Linux (x86_64 GNU/Linux) we obtain both the substitution as well as the deletion (full log,
VCF header):

NC_045512.2	4179	.	AGG	AG	32.3652	.	INDEL;IDV=752;IMF=0.516838;DP=1455;AD=1050,405;VDB=0.514509;SGB=-0.693147;MQSB=1;MQ0F=0;AC=1;AN=2;DP4=360,690,405,0;MQ=60	GT:PL	0/1:65,0,90
NC_045512.2	4253	.	C	T	222.391	.	DP=1531;AD=360,406;VDB=0;SGB=-0.693147;RPB=0.984628;MQB=1;MQSB=1;BQB=0.911924;MQ0F=0;AC=1;AN=2;DP4=359,1,406,0;MQ=60	GT:PL	0/1:255,0,255

What should we be on the lookout for when running mpileup through valgrind?

@jkbonfield
Copy link
Contributor

Uninitialised memory accesses can be one cause for non-deterministic behaviour. I'll have an explore myself, also using gcc's address-sanitizer.

@jkbonfield
Copy link
Contributor

I can't get errors from either Valgrind nor gcc -fsanitize=address. This appears to be a system difference, possibly in ks_shuffle which uses random numbers, called by errmod_cal.

Petr, can you confirm that random sampling could show this effect? If so I think we should be using our own implementation for all OSes and not just Windows, as we don't want system specific behaviour patterns like this.

@jkbonfield
Copy link
Contributor

jkbonfield commented Apr 13, 2021

I can confirm my suspicions.

From the bcftools-1.12 tarball, I ran configure and then manually edited the htslib-1.12/config.h file to comment out the #define HAVE_DRAND48 1 line. This forces it to build its own implementation of drand48, as it would on Windows. It appears that this particular implementation chooses a subset that triggers the same issue seen before (and I assume it is this or something close to it which is also used on MacOS). It now reports:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	important_positions.bam
NC_045512.2	4253	.	C	T	222.391	.	DP=1531;AD=360,406;VDB=0;SGB=-0.693147;RPB=0.984628;MQB=1;MQSB=1;BQB=0.911924;MQ0F=0;AC=1;AN=2;DP4=359,1,406,0;MQ=60	GT:PL	0/1:255,0,255

So at least that explains the non-deterministic behaviour, caused by random sub-sampling when the depth goes above 255.

What's curious is why it's still failing to identify the deletion.

I can confirm at least that my work-in-progress PR overhauling calling (including indel detection) does fix this, at least in my current release (which needs pushing back up again, so it may differ on github atm).

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	important_positions.bam
NC_045512.2	4179	.	AGG	AG	101.365	.	INDEL;IDV=752;IMF=0.516838;DP=1455;AD=696,752;VDB=0.508484;SGB=-0.693147;RPBZ=7.24249;MQBZ=0;SCBZ=-0.0397288;FS=0;MQ0F=0;AC=1;AN=2;DP4=360,336,405,347;MQ=60	GT:PL	0/1:134,0,117
NC_045512.2	4253	.	C	T	222.391	.	DP=1531;AD=360,406;VDB=0;SGB=-0.693147;RPBZ=-0.234093;BQBZ=-0.341734;SCBZ=-1.06197;FS=0;MQ0F=0;AC=1;AN=2;DP4=359,1,406,0;MQ=60	GT:PL	0/1:255,0,255

With HAVE_DRAND48 defined the qual changes to 102.365, so a very marginal difference, but still called.

@jkbonfield
Copy link
Contributor

jkbonfield commented Apr 13, 2021

The other thing you may wish to try is mpileup -x to disable the read-pair overlap detection. This gives a very confident call on this variant (with / without the local drand48 implementation).

In theory this helps remove over-confidence elsewhere, but I think at the moment the overlap removal has issues as it can create strand inbalance as it's always removing the same member of a pair. Ideally it should be done at random (deterministically! either via our own implementation of a simple hash of the read name). I don't know if a more random removal would fix this or not though.

[Note to self - rerun my variant calling accuracy assessments with -x. Does it actually help overall?]

@jkbonfield
Copy link
Contributor

I'm not sure if this plays into it or not, but this particularly variant has the G present somewhere centralling in all the forward strand reads and right at the start (adjacent to a poly-A) in the reverse strand reads. It's likely BAQ will be removing many if not all of those. Similarly the deletion is present centrally in the forward strand reads and at the start of the reverse strand reads.

This shows a heavy strand bias, indicating that overlap removal must be strand agnostic if it's not going to generate extreme biases, affecting various bcftools INFO filtering rules and (apparently) call confidence.

Witness DP4 with and without -x (a different bcftools branch, but you get the picture):

DP4=359,1,405,0
DP4=359,360,406,406

This is definitely a failing of how mpileup works.

@jmarshall
Copy link
Member

If so I think we should be using our own [drand48 etc] implementation for all OSes and not just Windows, as we don't want system specific behaviour patterns like this.
[…]
I can confirm my suspicions. From the bcftools-1.12 tarball, I [commented] out the #define HAVE_DRAND48 1 line. This forces it to build its own implementation of drand48 [and the results were different].

The drand48()/lrand48()/etc algorithm is specified by POSIX, and equivalently-initialised implementations on different POSIX-compliant platforms are guaranteed to produce the same sequence of random numbers.

However bcftools mpileup is not calling srand48(), so the system implementations are not equivalently-initialised. You can call drand48() without a previous srand48() call, but POSIX says this “is not recommended practice” and does not specify the default initializer values to be used in this case — hence the observed problem.

So I reckon the way to fix this is to have bcftools mpileup (and anything else that uses ks_shuffle() really) seed the random number generator, ideally in the same way as samtools sort does:

jkbonfield added a commit to jkbonfield/htslib that referenced this issue Apr 21, 2021
Currently it always chooses the second sequence (except for the
circumstance of differing base calls).  This is essentially random
strand and random coordinate in most library strategies, but some
targetted sequencing methods have a very strong strand bias (first is
+ strand, second is - strand) or positional bias (eg PCR amplicons).
Given SNPs near the end of sequences can give rise to poor BAQ scores,
both position and strand bias are detrimental.

This change makes it select either read 'a' or 'b' based on a hash of
the read name.  Unlike using a traditional random number generator,
this gives it consistent behaviour regardless of how many sequences
have gone before.

An example from SynDip region 1:185M-200M:

No overlap removal:
SNP          Q>0 /   Filtered
SNP   TP   18830 /     18803
SNP   FP     264 /       238
SNP   GT      56 /        53
SNP   FN     459 /       486

InDel TP    2788 /      2697
InDel FP    1022 /        86
InDel GT     353 /       345
InDel FN     596 /       687

Old removal strategy:
SNP          Q>0 /   Filtered
SNP   TP   18841 /     18813
SNP   FP     270 /       243
SNP   GT      56 /        54
SNP   FN     448 /       476

InDel TP    2754 /      2663
InDel FP     985 /        83
InDel GT     413 /       404
InDel FN     630 /       721

This PR:
SNP          Q>0 /   Filtered
SNP   TP   18841 /     18814
SNP   FP     272 /       242
SNP   GT      55 /        53
SNP   FN     448 /       475

InDel TP    2765 /      2679
InDel FP     996 /        85
InDel GT     382 /       375
InDel FN     619 /       705

The CPU cost on bcftools mpileup | bcftools call between the latter
two tests was 0.4% (which may also just be random fluctuation).
Vs the old removal system, this is a marginal improvement for SNPs
and, oddly, a significant improvement to Indels.  (It's still behind
no overlap removal for indels, but I'm unsure on the veracity of
small indels in that truth set).

Fixes samtools/bcftools#1459
jkbonfield added a commit to jkbonfield/bcftools that referenced this issue Apr 21, 2021
Random numbers are used by ks_shuffle, which is called in htslib's
errmod code for sub-sampling the data in regions of > 255 depth.

This corrects the OS specific randomness seen in samtools#1459.
jkbonfield added a commit to jkbonfield/bcftools that referenced this issue Apr 21, 2021
Random numbers are used by ks_shuffle, which is called in htslib's
errmod code for sub-sampling the data in regions of > 255 depth.

This corrects the OS specific randomness seen in samtools#1459.
jkbonfield added a commit to jkbonfield/bcftools that referenced this issue Apr 21, 2021
Random numbers are used by ks_shuffle, which is called in htslib's
errmod code for sub-sampling the data in regions of > 255 depth.

This corrects the OS specific randomness seen in samtools#1459.
jkbonfield added a commit to jkbonfield/bcftools that referenced this issue Apr 21, 2021
Random numbers are used by ks_shuffle, which is called in htslib's
errmod code for sub-sampling the data in regions of > 255 depth.

This corrects the OS specific randomness seen in samtools#1459.
pd3 pushed a commit to samtools/htslib that referenced this issue Apr 27, 2021
Currently it always chooses the second sequence (except for the
circumstance of differing base calls).  This is essentially random
strand and random coordinate in most library strategies, but some
targetted sequencing methods have a very strong strand bias (first is
+ strand, second is - strand) or positional bias (eg PCR amplicons).
Given SNPs near the end of sequences can give rise to poor BAQ scores,
both position and strand bias are detrimental.

This change makes it select either read 'a' or 'b' based on a hash of
the read name.  Unlike using a traditional random number generator,
this gives it consistent behaviour regardless of how many sequences
have gone before.

An example from SynDip region 1:185M-200M:

No overlap removal:
SNP          Q>0 /   Filtered
SNP   TP   18830 /     18803
SNP   FP     264 /       238
SNP   GT      56 /        53
SNP   FN     459 /       486

InDel TP    2788 /      2697
InDel FP    1022 /        86
InDel GT     353 /       345
InDel FN     596 /       687

Old removal strategy:
SNP          Q>0 /   Filtered
SNP   TP   18841 /     18813
SNP   FP     270 /       243
SNP   GT      56 /        54
SNP   FN     448 /       476

InDel TP    2754 /      2663
InDel FP     985 /        83
InDel GT     413 /       404
InDel FN     630 /       721

This PR:
SNP          Q>0 /   Filtered
SNP   TP   18841 /     18814
SNP   FP     272 /       242
SNP   GT      55 /        53
SNP   FN     448 /       475

InDel TP    2765 /      2679
InDel FP     996 /        85
InDel GT     382 /       375
InDel FN     619 /       705

The CPU cost on bcftools mpileup | bcftools call between the latter
two tests was 0.4% (which may also just be random fluctuation).
Vs the old removal system, this is a marginal improvement for SNPs
and, oddly, a significant improvement to Indels.  (It's still behind
no overlap removal for indels, but I'm unsure on the veracity of
small indels in that truth set).

Fixes samtools/bcftools#1459
jkbonfield added a commit to jkbonfield/bcftools that referenced this issue May 25, 2021
Random numbers are used by ks_shuffle, which is called in htslib's
errmod code for sub-sampling the data in regions of > 255 depth.

This corrects the OS specific randomness seen in samtools#1459.
@charlesfoster
Copy link

charlesfoster commented Jul 5, 2021

Hi @pd3 @jkbonfield,
I think I'm seeing a similar (maybe?) problem with my own data from SARS-CoV-2. I've got sequencing data that I previously ran through a different program (ivar), which uses the output of samtools mpileup as the input. When using that program, I find a 17 base deletion in the genome occurring at a frequency of ~26% of reads. However, when using bcftools to do variant calling, the deletion is not recovered at all. I've tried a few of the recommendations from this thread to no avail.

ivar command:

samtools mpileup -A -d 0 -B -Q 0 --reference $REF $BAM | \
ivar variants -q 20 -m 10 -t 0.1 -r $REF

Result for deletion:

REGION POS REF ALT TOTAL_DP ALT_DP ALT_FREQ
NC_045512.2 27606 AAAACACGTCTATCAGTT A 909 239 0.262926

bcftools command:

bcftools mpileup --count-orphans --no-BAQ --max-idepth 10000000 -d 10000000 -Q 0 \
-Ou --min-BQ 20 \
--annotate FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/SP,INFO/AD,INFO/ADF,INFO/ADR \
-f $REF $BAM | \
bcftools call --ploidy 1 --keep-alts --keep-masked-ref --multiallelic-caller --variants-only -Ov --threads 20 | \
bcftools +fill-tags - -Oz  -o ${NAME}.bcftools.raw.vcf.gz -- -t FORMAT/VAF

I've also tried adding -h500 (and -h1000) and -x to the mpileup command. None of these options has recovered the 17 base deletion.

Screenshot of the region:

image

Since the region has some soft clipping of amplicon primers, I also tried the bcftools commands on a non-soft-clipped BAM file from the same sample, in case that was a problem (no difference).

Any idea why this variant might be being missed? Should I be adjusting the command? I have a closely related sample that has the 17 base deletion at a higher frequency (IMF=0.96) and it's correctly called by bcftools.

@pd3
Copy link
Member

pd3 commented Jul 6, 2021

Any chance you could create a small test case to reproduce the problem? The screenshot you've shown is not sufficient. Also next time please open a new issue and if it is related to an existing issue, just link it.

@zeeev
Copy link

zeeev commented Oct 19, 2021

@charlesfoster, we experienced the same deletion issue in sars cov2, which @pd3 was kind enough to quickly fix.

#1569

@pd3, on another note, we are now experiencing non-determinism for variant calling. Nothing to share that will consistently reproduce. ~1/25 times we get an extra variant in high coverage.

@pd3
Copy link
Member

pd3 commented Oct 20, 2021

@zeeev The read subsetting in high coverage regions uses a random number generator. Can you make the calling reproducible by adding the mpileup --seed option? A test case would be helpful indeed.

@jrharting
Copy link

jrharting commented Jan 26, 2022

I'm still seeing this issue where bcftools does not call a small indel in Omicron -- in this case its the 211DEL (22194_22196del) in the spike region. We have two samples with ample data covering this variant, but almost no reads pass bcftools quality filter.

In this one, 291/292 reads support the deletion, but only 1 read is counted after filtering, and it does not make it into the call set.

image

bcftools call:

bcftools --version bcftools 1.13-37-g4c71bfb Using htslib 1.13

bcftools mpileup --open-prob 25 --indel-size 450 --gap-frac 0.01 --ext-prob 1 --min-ireads 3 --max-depth 1000 --max-idepth 5000 --seed 1984 -h 500 -B -a FORMAT/AD -f NC_045512.2.fasta aligned.bam | bcftools call -mv -Ov | bcftools norm -f NC_045512.2.fasta

pileup output:
NC_045512.2 22193 . AATT A 0 . INDEL;IDV=291;IMF=0.996575;DP=292;I16=0,0,1,0,0,0,27,729,0,0,60,3600,0,0,25,625;QS=0,1;SGB=-0.379885;RPBZ=-0.518777;FS=0;MQ0F=0 PL:AD 22,3,0:0,1

(filtered by "bcftools call").
sample_bams.zip

Is there anything that can be done to call this variant?
I'm including the data for two samples.

Thanks!

@jrharting
Copy link

re the 211DEL -- there does appear to be a difference by strands.

pileup of forward reads only:

image (15)
indel called, but mostly filtered.

rev strand reads -- no call in the pileup

image (16)

@jkbonfield
Copy link
Contributor

jkbonfield commented Jan 27, 2022

Thank you for the data. I can reproduce the problem, although I don't yet know what causes it.

What is your ultimate goal here? If it is to produce a consensus, then maybe the "samtools consensus" PR (samtools/samtools#1557) would help.

In the simple mode (basic frequency counting) it gets the correct answer for this sample, as the data is sufficiently deep and of sufficient high quality the result is plainly obvious:

samtools consensus -a -m simple -f fastq B_A11R6.mapped.bam

The default Bayesian consensus method is working on a diploid assumption, and some of the sites it determines as being more likely heterozygous than not. This leads to either ambiguity codes (if option -A is used), or in the default mode of only "pure" bases being output then "N"s instead. Note this is probably a weakness we need to fix in the consensus algorithm, as all such cases appear to be bogus alignments which some heuristics could probably cull, eg around reference position 11282.

It's still possible to force it into a homozygous style call by setting the probability of a heterozygous base appearing to be vanishingly small:

samtools consensus -a --P-het=1e-200 -f fastq B_A11R6.mapped.bam

Alternatively -C0 to show all pure bases regardless of the het likelihood also works.

This is still a work in progress, but we hope the consensus subcommand should appear in the next release of samtools. Hopefully a few weeks away.

PS. Meanwhile I'll see if I can figure out why bcftools is rejecting this obvious deletion.

pd3 added a commit that referenced this issue Jan 27, 2022
by switching off an existing heuristics which reduces indelQ
and, in the extreme case, has the power to discard an indel
entirely when the reference alignment has low score.

This is a problem for long reads, so newly `--indel-bias 1.01`
is added to 'ont' and 'pacbio-ccs' presets.

Tentative fix to #1459
@pd3
Copy link
Member

pd3 commented Jan 27, 2022

@jrharting I just pushed a tentative fix to this problem. Please try to run with the -X pacbio-ccs preset or set the --indel-bias to a value bigger than 1 manually.

@zeeev
Copy link

zeeev commented Jan 27, 2022

awesome, thank you so much @pd3

@jrharting
Copy link

jrharting commented Jan 27, 2022

@jkbonfield and @pd3 Thanks for the quick feedback! Will be testing this pronto.

Re end goal, both VCF and consensus are desired outputs for our hifiviral workflow. I do like the consensus PR with samtools, and will absolutely give that a try for future use.

pd3 added a commit that referenced this issue Feb 8, 2022
…lling

The option switches off an existing heuristics which reduces indelQ
and, in the extreme case, has the power to discard an indel
entirely when the reference alignment has low score.

This is a problem for long reads, so newly `--no-indelQ-tweaks`
is added to 'ont' and 'pacbio-ccs' presets.

Tentative fix to #1459
jkbonfield added a commit to jkbonfield/bcftools that referenced this issue Mar 10, 2022
The fixed +/- indel_win_size is still used in construction of the
type[] array, but then we reduce that size down before doing the
alignments and evaluating them.

This means, where appropriate (nice unique data without lots of STRs)
we can assess over a smaller window and avoid the negative impact of
other nearby indels.

It cures the example covid19 problem,  but also reduces recall
elsewhere as if we *do* still get other nearby indels (eg due to low
complexity data between our candidate indel and a neighbouring one)
then we're now paying a much larger normalised per-base hit as the
length is smaller.
jkbonfield added a commit to jkbonfield/bcftools that referenced this issue Mar 15, 2022
The fixed +/- indel_win_size is still used in construction of the
type[] array, but then we reduce that size down before doing the
alignments and evaluating them.

This means, where appropriate (nice unique data without lots of STRs)
we can assess over a smaller window and avoid the negative impact of
other nearby indels.

It cures the example covid19 problem,  but also reduces recall
elsewhere as if we *do* still get other nearby indels (eg due to low
complexity data between our candidate indel and a neighbouring one)
then we're now paying a much larger normalised per-base hit as the
length is smaller.
jkbonfield added a commit to jkbonfield/bcftools that referenced this issue Mar 17, 2022
The fixed +/- indel_win_size is still used in construction of the
type[] array, but then we reduce that size down before doing the
alignments and evaluating them.

This means, where appropriate (nice unique data without lots of STRs)
we can assess over a smaller window and avoid the negative impact of
other nearby indels.

It cures the example covid19 problem,  but also reduces recall
elsewhere as if we *do* still get other nearby indels (eg due to low
complexity data between our candidate indel and a neighbouring one)
then we're now paying a much larger normalised per-base hit as the
length is smaller.
@pd3
Copy link
Member

pd3 commented Nov 23, 2022

An update on this and similar indel related questions: there is now a new experimental option mpileup --indels-2.0, introduced by #1824, which attempts to solve issues like this and provides a testing ground for improvements in indel calling. Note that this is an EXPERIMENTAL feature. If you test it, please let us know the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants