Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mpileup revamp #1474

Merged
merged 16 commits into from
May 10, 2021
Merged

Mpileup revamp #1474

merged 16 commits into from
May 10, 2021

Conversation

jkbonfield
Copy link
Contributor

@jkbonfield jkbonfield commented Apr 22, 2021

This is a rebased, reordered / squashed version of #1363.

A summary of the changes are:

  • BAQ is now tri-state; fully on (still the default), off, and partial. Partial BAQ only runs BAQ on reads which overlap potentially problematic indels. This is both a significant speed increase, but also an improvement in calling rates as BAQ can sometimes be detrimental so we only want to use it when we have a significant chance of fixing something.

  • Added a new calc_mwu_biasZ function (Z-score) to replace calc_mwu_bias. These are currently used if MWU_ZSCORE is defined in bam2bcf.h (which it is). If we decide we prefer these metrics, then we can remove this ifdef and the old code, but for now there is a compilation choice to be made. (And no user way to configure.) Alternatively it could become a run-time choice with a bit more work. Please advise.

  • Added MQBZ and RPBZ INFO fields to indels, for filtering on.

  • Added a soft-clip bias SCBZ for SNPs, based on length of soft-clips near the SNP site, indicating potential misalignments. (Although questionable how helpful this is.)

  • Very significant improvement to GT indel assignment accuracy. In some data sets it reduces incorrect assignment by an order of magnitude.

  • Adjusted default parameters to mpileup. -h has gone from 100 to 500, -Q from 13 to 1, -F from 0.002 to 0.05 and -m from 1 to 2.

  • Added a --indel-bias mpileup option to skew FN vs FP. This appears to be better than modifying call -P as it's specific to indels.

  • Changed the indel scoring algorithm, giving better FP vs FN discrimination.

  • Added the short tandem repeat finder from Crumble. This is used for correcting the indel score. Put simply, indels in low-complexity regions having a higher bar to pass than those in normal regions. This is a modest improvement to FN, FP and genotype assignment errors. (To consider - it may also be worth utilising this on SNPs too.)

  • A major refactoring of the extremely long bcf_call_gap_prep function. This was done part way through my work, meaning it's very hard to reorder / squash commits from both sides of that refactor commit. You may also wish to diff in two stages - up to that commit and beyond that commit, as otherwise it's hard to see what's changed.

  • get_position() has been rewritten, making it significantly faster on long read data. Previously it was quite easy for a 1/3rd of the entire CPU time to be spent here. Now it's negligible.

  • Improved calling accuracy for long-read technologies. This has mainly been tested on PacBio CCS data. ONT SNP calling is significantly more accurate too, but indels are still a right-off and glacial. There is more work to be done on the indel caller's use of BAQ, which cripples speed on long-read technology.

  • The indel caller now only has one set of probaln_glocal parameters instead of two. I initially removed the second to speed up long-read calling, but it improved it too. It turned out fortuitously to also improve the Illumina data. I assume it helped at one point, but on the more modern data sets it served no purpose (was detrimental).

  • Added -X, --config (although perhaps --platform is better? Feel free to suggest) to mpileup. This is simply a collection of parameter values, tuned to work on that specific name. It could perhaps auto-detect by platform RG fields, but I feel being explicit is likely the safer option.

  • Added a --max-BQ option to clip (rather than discard) high quality bases. This is used to limit overly confident platforms from producing unrealistic SNP calls (I'm looking at you CCS!).

  • Added a --delta-BQ option to permit the neighbouring base quality to be utilised, if lower by a specific delta to the SNP base itself. This helps CCS calling where often incorrect bases have the low confidence in the next base rather than the one in error.

  • Replaced the heuristic for indel allele (REF vs ALT) with a newer, albeit more complex, one. This fixes bcftools call is detecting lots of REF reads that's not in the bam file (small test data set included) #1446 and significantly reduces indel false positive rates.

Some questions to consider:

Should we make -D the default? I think yes, but then how do we define full BAQ? I guess that can become -D or a similar option.

I have no idea what impact this has on rare variants or on many-sample variant calling, as I don't have truth sets for either so have no way of validating this myself. I suggest interested parties offer up their services!

This will fail tests, as the output differs in so many ways. However I'm not really sure I'm the best person to judge which differences matter. I think that's a task for review perhaps. Obviously if we're happy, we can just generate new expected test results to silence it.

@jkbonfield jkbonfield mentioned this pull request Apr 22, 2021
@jkbonfield
Copy link
Contributor Author

Demonstration of the results based on known truth sets (SynDip and GIAB) are in the previous PR here:

#1363 (comment)

#1363 (comment)

#1363 (comment)

I need to redo these again as a few tweaks have happened since, if only to check I haven't created any regressions. When done I'll post here, but if someone has a specific truth set they want testing then do please let me know. Note it must be a truth set, and not just "does it increase TS/TV". My experiments there showed it often makes that metric poorer, while getting more results correct, as the previously missing variants were enriched for low-complexity data.

@jkbonfield
Copy link
Contributor Author

As expected I spotted some more minor code tidyups immediately on posting this, so I'll correct and squash again. My apologies.

Also a minor refactoring of bcf_call_glfgen for speed.

BAQ is the dominant CPU time for mpileup.  However it's not needed for
the bulk of reads.

We delay BAQ until we observe that a read has an overlap with others
containing an indel.  No indels means no BAQ.

Furthermore, if we do have an indel, but this particular read spans
the indel with more than 50bp either side then also don't realign it.
This is based on the observation that most misalignments occur near
the ends of reads where the optimal pair-wise alignment (read vs ref)
differs from the optimal multiple alignment (read vs all others).

Evaluations so far show this is around a 1.5 to 2-fold speed increase
(depending on tuning parameters), while simultaneously slightly
increasing SNP calling accuracy.

The latter appears to be due to BAQ in non-indel regions sometimes
reducing all reads to QUAL 0 in low complexity areas, blocking SNP
calling from working.  SNPs flanking indels also get blocked for
similar reasons.

This has two variants (see REALN_DIST #if) as an exploration of
methods.

This is enabled with a new -D option.
By default it is as before, and -B still disables completely.

If we like how it works though it's possible we could make this the
default and add a +B for all-BAQ, -B for no-BAQ and have the default
inbetween as auto-select BAQ.

TODO: Consider adding a depth filter as low depth portions aren't
optimal.

Honour MPLP_REDO_BAQ in mpileup
The previous calc_mwu_bias returns values in the range 0 to 1.
In experimentation I can find little correlation between these values
and the likelihood of a variant being true or false.  This is odd,
given GATK's MQRankSum does a good job here.

It appears it's the squashing of Mann Whitney U score to this range
which may be penalising it, but perhaps there is something different
in implementations too?  (I started again from scratch based on
Wikipedia.)

This new function takes the same approach as GATK, implementing the
Z-score version where a result of 0 means both distributions are
comparable and above / below zero is used to indicate how extreme the
U score is in either direction.  Under testing appears to offer better
discrimination between true and false SNPs, particularly for MQB.  For
clarity, we have new MQBZ, RPBZ, BQBZ and MQSBZ fields as the range of
values differs so old filtering rules would be inappropriate.

On a test set, this filter rule seemed to work effectively:

    -e "QUAL < $qual || DP>$DP || MQB < -4.5 || RPB > 4 || RPB < -4"

On Chr22 of HG002.GRCh38.60x.bam this gave for 3 different values of
$qual:

SNP          Q>0 /   Q>=0 / Filtered
SNP   TP   39777 /   39777 /   39769
SNP   FP     542 /     542 /     483
SNP   FN    1316 /    1316 /    1324

InDel TP    4998 /    4998 /    4981
InDel FP     983 /     983 /      52
InDel FN    1945 /    1945 /    1962

SNP          Q>0 /   Q>=100 / Filtered
SNP   TP   39777 /   39567 /   39559
SNP   FP     542 /      94 /      71
SNP   FN    1316 /    1526 /    1534

InDel TP    4998 /    3934 /    4297
InDel FP     983 /     756 /      22
InDel FN    1945 /    3009 /    2646

SNP          Q>0 /   Q>=200 / Filtered
SNP   TP   39777 /   38074 /   38068
SNP   FP     542 /      36 /      27
SNP   FN    1316 /    3019 /    3025

InDel TP    4998 /    3200 /    2643
InDel FP     983 /     570 /      10
InDel FN    1945 /    3743 /    4300

Filtered is as per Q>=x but with the additional MQB / RPB filters.

Versus naive QUAL filtering, at Q>=100 it removes 24.5% of the FP
while increasing FN by 0.5%.

Indel filtering is essential to remove the excessive FPs and used:

    -e "IDV < 3 || IMF < 0.08 || DP>$DP || QUAL < $qual/5 || \
        ((5+$qual/6)/(1.01-IMF) < $qual && IDV<(3+$DP*$qual/1000))"

[Something like this needs folding into the indel QUAL score as by
itself it is a very poor indicator of quality.  It's still not up to
par even with it, but that's for a future project.]

NB: Above scores are also with a modified BAQ calculation, to come in
another PR.

The old MWU test also now sports a uni-directional mode:
The old p(U) calc gave eg <-4 and >4 both as near 0, but for some
calculations we expect a bias in one direction and not the other.  Eg
MQB (mapping quality bias) is skewed as REF matching reads have higher
MQUAL and ALT matching reads.

The U and Z scores are now selectable via #define using the same
function, but U is default for compatibility.

Also add a soft-clip MWU test.

This looks for REF / ALT bias in the numbers of soft-clipped bases
near variants.  NB: Unsure if it's helpful currently.
This is a multiplicative factor (albeit 1/x) on the BAQ score in the
indel caller.  Given the constant, arbitrary, cap of 111 this is
more or less equivalent to permitting higher scores.  The effect is
to be more permissive in calling, giving greater recall rate at a cost
of slightly reduced precision.

A value of 1 would match the previous computation, but the new default
is 2 (=> score * 1/2).  On a 30x GIAB HG002 chr1 sample filtering a
Q20 this removes 1/3rd of the false negatives, but at a cost of about
1/3rd more FP.  At 60x it's about -40% FN + 20% FP. (Given the
quantity of true calls hugely outweights false, this is a reasonable
trade.)

Also adjusted the default -h parameter from 100 to 500. This gives a
huge reduction in false negative rates (as much as 3 fold), but also
usually has a slightly beneficial decrease in FP.  This parameter is
used to indicate the likely sequencing error rates in homopolymers,
which (for Illumina instruments at least) have greatly reduced since
the original bcftools indel caller was written.  High -h parameters
though does marginally harm SNP calling precision, but not massively.

Minimum base quality has changed from 13 to 1.  Higher quality clip
doesn't seem to convert to more accurate calls.

Adjusted the minimum number of reads needed to call an indel from 1 to
2, and fraction from 0.2% to 5%.
These values weren't captured in the indel code so we can't
post-filter on them.  It's hard to know what is ALT and what is REF,
but for now we take the naive approach of a read that doesn't have an
indel is REF and ALT otherwise, irrespective of quality of match.

Of these, MQBZ shows some discrimination (< -5 for 30x and < -9 for
150x), but RPBZ and SCBZ have overlapping distributions (although RPBZ
may just about have a slim margin worthy of filtering at the right
end).  That said for both the dist of the true variants extends much
further negative than the dist for the false ones, so while we cannot
remove FP without removing TP, we could at least use this for
computing a refined QUAL value.
These are now computed using their own arrays, as there was some
leakage with the SNP stats.

Also noticed while assessing the new metrics that there is a very high
rate of incorrect indel genotypes, particularly HOM to HET.  After
experimentation, I realised that seqQ governs which candidate types
end up in the genotype, and not indelQ.  seqQ is now capped at indelQ
and no higher.  This has had a huge reduction in GT miscalls.

[With amended compare_vcf.sh script to now validate GT also matches,
in their own output line]

15x GIAB HG002 before:
SNP          Q>0 /   Q>=10 / Filtered
SNP   TP  257554 /  255433 /  255224
SNP   FP    1846 /    1049 /     998
SNP   GT     883 /     853 /     882
SNP   FN    6756 /    8877 /    9086

InDel TP   42021 /   41651 /   40762
InDel FP     709 /     594 /     347
InDel GT    7326 /    7306 /    7242
InDel FN    3583 /    3953 /    4842

15x GIAB HG002 after:
SNP          Q>0 /   Q>=10 / Filtered
SNP   TP  257554 /  255433 /  255224
SNP   FP    1846 /    1049 /     998
SNP   GT     883 /     853 /     882
SNP   FN    6756 /    8877 /    9086

InDel TP   43008 /   42651 /   41723
InDel FP     820 /     706 /     454
InDel GT    1098 /    1087 /    1036
InDel FN    2596 /    2953 /    3881

Note reduction to indel FN too (and modest increase in FP).

It's even more pronounced on deep data.

150x GIAB HG002 before:
SNP          Q>0 /   Q>=10 / Filtered
SNP   TP  262711 /  262628 /  262523
SNP   FP    1947 /    1621 /    1323
SNP   GT     253 /     244 /     239
SNP   FN    1599 /    1682 /    1787

InDel TP   44451 /   44435 /   44384
InDel FP    4617 /    4455 /    1279
InDel GT    6333 /    6327 /    6321
InDel FN    1153 /    1169 /    1220

150x GIAB HG002 after:
SNP          Q>0 /   Q>=10 / Filtered
SNP   TP  262711 /  262628 /  262523
SNP   FP    1947 /    1621 /    1323
SNP   GT     253 /     244 /     239
SNP   FN    1599 /    1682 /    1787

InDel TP   45061 /   45053 /   44992
InDel FP    4798 /    4628 /    1408
InDel GT     448 /     446 /     441
InDel FN     543 /     551 /     612

Here we're seeing a 14 fold reduction in the rate of incorrect indel
genotype assignment, and a 2 fold reduction in false negative rates.

Also adjusted the default indel-bias down again from 2 to 1.  This
tweak was needed before, but the change in scoring function makes it
better again with the default params.
The Short Tandem Repeat from Crumble has been added to the indel
caller.  This is used to calculate the STRBZ INFO metric (not so
useful I think, so maybe to remove) and to skew the indel score
slightly for reads overlapping an indel site that start or end in an
STR.

At low filtering levels I get these results or HG002 chr1:

15x previous:
             Q>0 /   Q>=10 / Filtered
InDel TP   42256 /   41744 /   41082
InDel FP     601 /     540 /     375
InDel GT    1265 /    1234 /    1191
InDel FN    3348 /    3860 /    4522

15x this:
             Q>0 /   Q>=10 / Filtered
InDel TP   42354 /   41806 /   41186
InDel FP     554 /     501 /     343
InDel GT    1180 /    1151 /    1121
InDel FN    3250 /    3798 /    4418

That's a decrease to all of FP, GT and FN.

At 150x previous:
             Q>0 /   Q>=10 / Filtered
InDel TP   44854 /   44838 /   44775
InDel FP    3541 /    3387 /     869
InDel GT     517 /     516 /     507
InDel FN     750 /     766 /     829

At 150x this:
             Q>0 /   Q>=10 / Filtered
InDel TP   44917 /   44897 /   44890
InDel FP     778 /     753 /     750
InDel GT     412 /     409 /     410
InDel FN     687 /     707 /     714

This is also a decrease to all of FP, GT and FN.  In particular
unfiltered FP has a huge drop, so naive usage of bcftools will be
considerably more accurate.

TODO: this clearly works well for indels.  Maybe consider adding it
for SNPs too?
I noticed a few issues while doing this, but they'll be resolved in
subsequent commits as it'd be nigh on impossible to identify the
changes if included here.  This should give identical results to the
previous commit, aiding validation that the refactoring hasn't changed
anything.
The checking for tbeg/tend was incorrect as one is in ref coords and
the other is local to the buffer, making the test basically always
true.  This just accepts that and removes the condition, as the end
result appears to show the code works.
- Remove STRBZ and BQBZ as unuseful stats.

- Only compute stats once per indel rather than per type.

- Replace some pointless mallocs with fixed arrays as the maximum
  number of types was already hard-coded.

- General code tidyups
It only needs to check first few and last few elements to obtain
soft-clip data, as it's trying to turn absolute position / length into
clip-adjusted position / length.

This function can easily be 30%+ of the CPU time when using long read
technologies.
These needs new parameters for BAQ (also analous ones in htslib
realn.c) as the error models differ with most errors in indels instead
of substitutions.  Detection of platform is very crude currently -
just length.
TODO: use @rg PL field and DS containing "CCS".

Protect against attempts to call BAQ on negatived sized regions.
"tend" can become less than "tbeg" as tend is left-justified and tbeg
right-justified when it comes to deletions.  If the entire left..right
portion is contained within a single cigar D op, then tend<tbeg.

Protect against excessive BAQ band widths, which is both very costly
in CPU time (with ONT I saw one single BAQ call take 30 mins), and
uses vast amounts of memory.  For now this is a hard cutoff, but we
may consider making it adjustable.  BAQ has minimal benefit to PacBio
CCS and it's untested on ONT as it triggered the out-of-memory killer.

Also add a maximum read length for BAQ (--max-read-len) so we can
trivially simply disable it for all long-read data while keeping it
enabled on short reads, eg mixed data sets with ONT, PacBio and
Illumina.
This was already done for long reads, but it turns out the second call
is detrimental for the short read data sets I tested too.
There is now a maximum quality value, defaulting to 60 so above
Illumina values.  This is useful or PacBio CCS which produces
abnormally high qualities leading to over-certainty in calls,
especially for incorrect genotype assignment.

SNP base quality is now the minimum of this base qual plus the base
quality either-side with an additional delta (0 to just use MIN of
this / neighbour, but also permitting neighbour to only affect us if
significantly lower). CCS data commonly has a neighbouring low qual
value adjacent to in incorrect but high quality substitution.  However
it turns out this is also beneficial to the HG002 Illumina data.

An example of the effect for Illumina 60x data (with adjustment on min
QUAL to get the same FN rate as the scores are slightly suppressed
now):

Before:
SNP          Q>0 /   Q>=20 / Filtered
SNP   TP  262541 /  262287 /  262283
SNP   FP    2313 /    1442 /    1405
SNP   GT     287 /     269 /     269
SNP   FN    1769 /    2023 /    2027

After:
SNP          Q>0 /   Q>=15 / Filtered
SNP   TP  262503 /  262298 /  262294
SNP   FP    1787 /    1349 /    1312   -6.6%
SNP   GT     283 /     268 /     268   ~=
SNP   FN    1807 /    2012 /    2016   -0.5%

For 31x PacBio CCS, the quality assignment suppression is more
pronounced due to the excessively high QUALs in the bam records, but
the FN/FP tradeoff is the same (again tuning to try and compare with
~= FN):

Before:
SNP          Q>0 /   Q>=47 / Filtered
SNP   TP  263847 /  263693 /  263693
SNP   FP    4908 /    3089 /    3089
SNP   GT     804 /     793 /     793
SNP   FN     463 /     617 /     617

After:
SNP          Q>0 /   Q>=15 / Filtered
SNP   TP  263779 /  263692 /  263692
SNP   FP    3493 /    2973 /    2973   -10%
SNP   GT     509 /     503 /     503   -37%
SNP   FN     531 /     618 /     618   ~=

Finally, added -X,--config STR option to specify sets of parameters
tuned at specific platforms. Earlier experiments showed the indel
caller on PacBio CCS data needs very different gap-open and gap-extend
parameters.  It also benefits a bit from raising minimum qual to 5 and
because this is a new option we also enable partial realignment by
default.

The ONT configuration disables indels (they're *glacial* to compute at
the moment for long-cigar reads and really don't give any real
accuracy at present), disables BAQ, and tweaks min(filter) and
max(clip) qualities.

Man page updates for new options: --max-read-len (-M), --config (-X),
--max-BQ and --delta-BQ.
The fixes in commit 3c1205c while still here, no longer worked in
this branch due to the swapping of seqQ and baseQ.  The fix however
seemed inappropriate and caused other issues, such as samtools#1446.

The new logic is to check p->indel to decide if the read needs b=0
(REF) setting rather than a naive qual check.  This greatly reduces
the false positive rate of indel calling, especially when deep,
however it also increases GT errors a bit.  I've tried to get some
middle ground where the new seqQ used for REF calls isn't mostly the
old qual, but with a partial mix in of the base qual.  I don't like
heuristics, but it works well on the data sets tested and fixes both
the issue addressed by 3c1205c and also samtools#1446, while keeping the FP
reduction.

Beneficial at 30x, 60x and 150x both filtered and unfiltered (huge
win).  At 15x it's a small win for unfiltered and a moderate loss for
filtered, almost entirely due to higher GT error rate.  Possibly this
is filtering rules no longer appropriate, or maybe it's simply not
good on shallower data.  Needs investigation.
For backwards compatibility, the MWU_ZSCORE #ifdef has been replaced
by the -U option.  The default is the new Z score, but we can now
output the same INFO fields as before.
The -D option is now --full-BAQ and reinstates the previous mpileup
default.
@jkbonfield
Copy link
Contributor Author

I tidied up a couple of those commits to remove old debug code or fix some no-longer relevant comments.

I also added a couple new commits.

  • Partial-BAQ is now the default behaviour, as it appears to be a win/win scenario and we'd like bcftools out of the box to be fast and effective. The old full-BAQ has now become the -D option.

  • Added a -U, --mwu-u option to enable the older Mann-Whitney U scoring scale (probabilities between 0 and 1). This also changes the names of the INFO tags back to the former ones too (BQB vs BQBZ). This is to permit backwards compatibility, incase people have existing filters.

A question regarding that, should we undo the uni-directional tweak to MWU that I put in for mapping quality, so it is identical? (Also maybe should it call the old implementation, which still exists but does give slightly different scores I think?)

@jkbonfield
Copy link
Contributor Author

I confirmed I got identical results to my more previous PR, so the mass juggling of commits didn't break things it appears.

I decided to cross-validate my assessments using a different tool, this time RTG's vcf-eval by ploting the produced roc files (without --squash-ploidy so genotype errors are taken into consideration). The VCF has already been filtered with basic rules as per best practices. I'm just using Chr1 of HG002.

SNPs:
image

image

Bcftools improved, albeit mostly due to fixing the default parameters for low coverage. On higher coverage the new code is doing much better, beating GATK even.

Indels:
image

image

This is vastly different. I couldn't zoom up to the more appropriate top-right corner as v1.12 doesn't get remotely close. It's clear indels are still poor in Bcftools. Compared to GATK it simply has too many false positives (reducing precision), however the recall is good. The old code had very poor recall and precision on this data set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bcftools call is detecting lots of REF reads that's not in the bam file (small test data set included)
2 participants