Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor sensitivity for somatic mutations with VAF below 1% #29

Closed
dancooke opened this issue Jul 18, 2018 · 11 comments
Closed

Poor sensitivity for somatic mutations with VAF below 1% #29

dancooke opened this issue Jul 18, 2018 · 11 comments
Assignees

Comments

@dancooke
Copy link
Member

@chapmanb has reported poor sensitivity for somatic mutations below 1% variant allele frequency in tumour-only mode. I'm opening this issue to move the discussion from a separate VarDict issue that I opened.

@dancooke
Copy link
Member Author

@chapmanb Some of my initial thoughts on this issue:

  • In 0.4.1-alpha, the QUAL score for variants called in tumours (i.e. both SOMATIC and germline) is actually the posterior probability the variant segregates and the variant is correctly classified. This can lead to low QUAL scores, especially in tumour-only mode where the classification is uncertain. From 0.4.2-alpha,QUAL will be calculated as the probability the variant segregates (marginalising over classification), and the posterior probability will be reported separately as an INFO field. This may improve sensitivity, especially affecting variants filtered with low QUAL.
  • In 0.4.1-alpha, octopus' variant proposal algorithm will only find variants with VAF > ~1% (depending on coverage). This will change in 0.4.2-alpha so that the threshold frequency is determined by the min-expected-somatic-frequency command line option.
  • There are six command line options which directly influence somatic variant sensitivity:
    • somatic-snv-mutation-rate / somatic-indel-mutation-rate: Pretty self explanatory; decrease for better sensitivity at cost of more false positives.
    • min-expected-somatic-frequency: Octopus infers a posterior distribution of haplotype frequencies given the read data. To calculate the posterior probability a somatic haplotype segregates we integrate over frequencies greater than this value. If the true somatic haplotype frequency is lower than this then not much of the mass will be captures in the integral and the posterior mass will be low.
    • min-credible-somatic-frequency: A Bayesian credible region is inferred from this posterior frequency distribution (e.g. if the true VAF is 5% then octopus might infer a credible interval of [2%, 7%]. Haplotypes that have lower bound credible mass above this value (e.g. 2% in given example) are called. This complements the min-expected-somatic-frequency option and is intended to improve specificity. It's possible this option could be removed and only use min-expected-somatic-frequency. We need further validation.
    • credible-mass: The probability mass used to calculate the credible interval used by min-credible-somatic-frequency. Higher mass will widen the inferred frequency credible interval meaning better sensitivity but lower specificity.
    • min-somatic-posterior: The probability the variant segregates and is somatic in the samples. Decrease for better sensitivity.
  • How well are haplotypes respected in the test datasets (i.e. are somatic mutations always correctly phased with germline mutations)? If haplotype are not well respected then Octopus will generally do a poor job as it explicitly models haplotypes (e.g. if it the same mutation occurs on separate germline haplotypes in the reads then it will model this as two separate mutation events).
  • Regarding the question of octopus' downsampling raised here. Octopus is unlikely to call variants with less than three supporting reads. Downsampling to 500x results in an expected 2.5 observations with 0.5% VAF so it is just below that threshold. Downsampling to somewhere between 800-1000x might be a better choice.
  • The preferred way to filter variants in octopus is to use the random forest classifier. The current somatic forest has only been trained on (~60x) data so the current version may not be optimal for high depth data. However, going forward, improving the training set for the random forest is definitely a better option than fine-tuning threshold filters (and involves far less effort).

@dancooke dancooke self-assigned this Jul 18, 2018
@dancooke dancooke modified the milestone: v0.4.2-alpha Jul 18, 2018
@chapmanb
Copy link
Contributor

Dan;
This is incredible, thanks for the detailed breakdown of the different optionals. This answers a lot of questions I wasn't sure about when looking at tweaking the defaults. Based on this I'll:

  • Wait for 0.4.2a with the first two changes.
  • Set both --min-expected-somatic-frequency and --min-credible-somatic-frequency to our lower bound frequency.
  • Increase sampling depth, probably to 2000x (unless you think this will kill runtime).
  • Work on integrating the random forest classifier instead of using the threshold filters.

and then can re-run with these on this dataset. This will hopefully give us a baseline and then we can determine if further tweaks will help improve sensitivity/specificity tradeoffs. Thank you again for the help and discussion.

@dancooke
Copy link
Member Author

Hi @chapmanb,

I've been looking into this more and am now getting much better results. I have a couple of questions which may help me resolve a few remaining issues:

  • Some of the validation sets contain 'clustered' mutations. For example, the M0253 BRAF region has five mutations in 2bp, resulting in five unique haplotypes. While a M0253 KRAS region has 7 mutations over a 5bp window, including all three possible SNV mutations at two bases. Do you know how biologically plausible these cases are? In the former example, any tumour phylogeny will require at least one identical independent mutation. In the latter, all mutations are on separate haplotypes implying a completely flat phylogeny. I'm no expert here but this would surprise me. I can get octopus to call these cases but I have to allow a large number of 'somatic haplotypes' (--max-somatic-haplotypes). This essentially determines the maximum 'ploidy' of the tumour. Increasing this results in a considerable runtime/memory performance hit as there are potentially an exponential number of possible 'somatic genotypes' in the ploidy.
  • How are germline variants factored into the analysis as the truth sets only report somatic mutations? Are callers expected to only report somatic mutations? By default octopus will call both germline and somatics (with classification). Sometimes misclassification errors are made, especially around moderate VAFs, but this doesn't seem as bad as making an outright false variant call to me. Is there some way to separate misclassification errors versus call errors into the analysis?

Thanks,
Dan

@chapmanb
Copy link
Contributor

chapmanb commented Sep 4, 2018

Dan -- thanks so much for this work and apologies for the delay in getting back to you on these:

  • M0253 is a Qiagen panel (https://www.qiagen.com/us/shop/sample-technologies/dna/genomic-dna/qiaseq-targeted-dna-panels/?catno=DHS-101Z) so you're right that is has some non-standard characteristics. Your description of how to tune for more sensitivity at the cost of runtime makes good sense, and I think sticking with biologically reasonable defaults which keep the computation feasible is the right thing to do rather than chasing these too much.

  • These truth sets are only for somatic, so we don't factor germline in. Calling germline is nice but only some callers like VarDict attempt to do it alongside somatic, while others focus only on somatic. Considering germline/somatic misclassification separately is a good idea, I'll look into working that into the stratifications and interpretation.

Thank you again for all this work, looking forward to testing it out when the new version is ready.

@dancooke
Copy link
Member Author

dancooke commented Sep 17, 2018

Hi @chapmanb

I've just released v0.5.0-beta and made a pull request to Bioconda. Amongst various other things, this includes improvements for UMI tumour-only calling.

I've created a configs directory in the top-level source tree that contains a configuration file that I put together for UMI calling: UMI.config. This sets various options that improve somatic calling accuracy in UMI data. You can pass this to the --config command line option.

A couple of minor points:

  • I've set the --max-somatic-haplotypes option to 5 in the config file. This is to ensure the clustered somatics in M0253 are called, but it does also result in a very slight accuracy improvement in the other two samples. However, as I mentioned before this does come with a large runtime penalty so you may want to lower it if you find runtimes are too long.
  • You may also want to set the --target-working-memory option to limit peak memory use. The inference model that Octopus uses for somatic calling has an optimisation that makes heavy use of memory. The optimisation is turned on by default as it is doesn't usually cause any issues with WGS data, however, for UMI calling it can result in very high memory use. Setting --target-working-memory will automatically disable the optimisation whenever the memory use goes over the given limit (e.g. 30g). I've added this option to the config file but commented it out.
  • There are two regions (6:30,671,975-30,673,625 and 7:100,673,700-100,687,839) in NA13532 that Octopus struggles to get through due to many systematic errors. Normally, these regions are not an issue, but with the ultra-sensitive UMI setup they can really slow things down. I'm working on a way to better handle these cases, but for now I'd recommend manually skipping them entirely with the --skip-regions or --skip-regions-file option.

Looking forward to seeing the updated benchmarks!

@dancooke
Copy link
Member Author

@chapmanb I've added a case study of N0261 to the wiki. I've also added a wiki section on random forest training, although I haven't yet looked into using random forests for filtering the UMI calls (due to lack of training data).

@chapmanb
Copy link
Contributor

chapmanb commented Oct 9, 2018

Daniel;
Wow, thank you for all this work. I've been working on incorporating all these improvements and ran a validation on 0.5.1 beta on the same low frequency test set we've been working on. Unfortunately I must not have some key parameters set as I still ran into the issue of not being sensitive enough to detect these low frequency events:

https://github.com/bcbio/bcbio_validations/tree/master/somatic-lowfreq#vardict-156-octopus-051b

This has the command line parameters used, which I took from your awesome UMI config. The tweaks I had to make during validation that I know are different from your suggested parameters were reducing downsampling to 2000 and max-haplotypes to 200. This was in response to a number of regions overwhelming the available memory and being killed. Did I destroy sensitivity with these? Do you see other mistakes I made in translating your work over? I'm definitely happy to re-run these with more memory and restored parameters but wanted to sanity check first that my base implementation is decent and these are the likely issues. I can also dig into outputs as well if it should perform decently with these settings and something else is problematic.

Thanks again for all this work and walking me through translating over into automated runs.

@dancooke
Copy link
Member Author

dancooke commented Oct 9, 2018

Hi Brad,

Thanks for trying the new version. The main problem with the parameters that you have used is the combination of decreasing downsampling and decreasing --min-expected-somatic-frequency. This results in a very large number of candidate variants that overwhelms the model - I found many cases of apparently systematic low-frequency errors in these datasets that I haven't noticed elsewhere. To avoid generating too many candidates variants I had to limit --min-expected-somatic-frequency to 0.04. From there, increasing the downsampling limits should only improve accuracy, although I didn't see much improvement beyond 4000.

To address the memory problems that you encountered, I'd suggest lowering the memory-specific options before adjusting algorithmic parameters:

  • --target-working-memory. Controls some dynamic algorithmic working memory. This is set to 20G in the UMI config, and your copied parameters.
  • --target-read-buffer-footprint. Controls read buffering. Defaults to 6G.
  • --max-reference-cache-footprint. Controls reference buffering. Defaults too 500M.

The first two options are not strictly enforced, so memory use can still exceed the total specified in these three options. I'd start with setting --target-working-memory=5g --target-read-buffer-footprint=5g, which will hopefully keep memory use below your 30G limit.

Another thing. How are you comparing variant calls? Octopus emits non-diploid genotypes for somatic variants so germline and somatic variants can be phased. Some comparison tools won't handle this - especially when comparing to diploid truth sets. In the tutorial I wrote here (note: results may have changed slightly as this was written using v0.5.0-beta - I'll update it soon), I use RTG Tools to compare variant calls. However, I had to fudge Octopus' output to diploid to get it to work.

Cheers
Dan

chapmanb added a commit to bcbio/bcbio_validations that referenced this issue Oct 15, 2018
chapmanb added a commit to bcbio/bcbio-nextgen that referenced this issue Oct 15, 2018
Provides parameter adjustments based on discussions and validation work in
luntergroup/octopus#29

https://github.com/bcbio/bcbio_validations/tree/master/somatic-lowfreq#vardict-156-octopus-051b

- Parameter adjustments for low frequency and UMI samples based
  on Octopus configuration
- Correctly detected UMI bams lacking duplicates even on pre-aligned input BAMs.
- Fix issues with problem '*' alleles in outputs.
- Convert output file GTs into standard diploid calls.
@chapmanb
Copy link
Contributor

Dan;
Thank you for these incredibly helpful tips on both memory usage and validation. This saved me so much time in trying parameters and iterating. I've got a good validation I'm happy with as a starting point. octopus isn't quite as sensitive as VarDict, which is the most sensitive caller I've got right now on these type of data, but compares favorably and picks up some events VarDict filters out:

https://github.com/bcbio/bcbio_validations/tree/master/somatic-lowfreq#vardict-156-octopus-051b

The main change I had to make was that I couldn't get reasonable memory usage with the recommended parameters and had to revert back to --max-haplotypes 200. With the more difficult N13532 sample octopus would still get killed due to memory usage using 400. How much would upping this back to 400 help with sensitivity? Do I just need more cores/memory? I could definitely do but was trying as a first pass to get it going under this set of constraints before scaling out. Here are the parameters we're using:

https://github.com/bcbio/bcbio-nextgen/blob/b2e5d1f8e7a5dffa3eb5a7c3dc29a289ba0f8560/bcbio/variation/octopus.py#L102

On a related topic, what sort of runtimes do you expect for these? Many of the N13532 regions are quite slow (6+ hours) and would be great to speed up to both iterate faster and make it easier to use in production. If this is again just a matter of adding more cores/memory I can try but wanted to get a baseline for expectations.

Thanks also for the tip on converting the GT output. I'd totally missed this and as a first pass added in your conversion approach directly to bcbio to get standard GT genotypes. I'd like to more directly support the super useful phased output but am not sure the best way to do that as keeping the more complex GTs will make it incompatible with most downstream tools. We've also had this issue with strelka2 and our solution there was to encode this in the INFO field and then add standard genotypes (https://github.com/bcbio/bcbio-nextgen/blob/b2e5d1f8e7a5dffa3eb5a7c3dc29a289ba0f8560/bcbio/variation/strelka2.py#L162).

I know VCF and downstream tool support is not really keeping up with the complexity and density of information on changes and phasing you're able to detect. I'd love to bring @ctsa into this discussion as we have also been trying to come up with a good strelka2 specific solution (Illumina/strelka#16). Having this synchronized across callers would at least help with building downstream tools that take advantage of complex phasing and representing germline to somatic changes correctly.

Thanks again for all the help with getting this going and I'm excited to keep pushing octopus support forward.

@dancooke
Copy link
Member Author

Many thanks for the feedback Brad - I'm glad things are looking a bit better now!

Setting --max-haplotypes 200 doesn't appear to make a big difference - I ran a test on N0261 and results were actually slightly better than with --max-haplotypes 400. I was primarily using M0253 for development which I think had a few cases where --max-haplotypes 400 helped, but then M0253 does have the weird haplotype structure that I mentioned before so that could have been the issue. I'll leave max-haplotypes at the default (200) in the UMI config in the next release.

As for runtimes, 6 hours or so is in line with what I'm getting. Octopus is very much geared towards high accuracy rather than speed by default. Unfortunately, there are times when the algorithm gets stuck trying to resolve many missmapped reads leading to dense regions with many haplotypes. I'm yet to find a way to ignore these regions without also ignoring regions with genuine high diversity (e.g. HLA).

It is possible to make Octopus go fast with the --fast and --very-fast options - which produce runtimes similar to Platypus on germline data - however, this will almost certainly result in a noticeable loss of accuracy. The --fast option just sets three options:

  • --max-haplotypes 50
  • --phasing-level=minimal
  • --assembly-candidate-generator=off

Of these, the second has the largest influence on runtime. It controls the length of the haplotypes that Octopus builds - which affects both phasing lengths and accuracy.

Looking at your new results, you're still getting quite a few more false positives than I am. I think the reason is that you're considering both germline and somatic variants called by Octopus; Octopus calls both germline and somatic variants in tumour data (paired-normal and tumour-only), and reports the variants classification with the SOMATIC INFO field. Since germline variants are not in the validation sets, I believe that you only want somatic variants. You can either ask Octopus to remove germline calls with the --somatics-only option, or filter them manually by just selecting calls with the SOMATIC INFO field. Note that you will likely loose a bit of sensitivity after doing this as some true positives will have been called but misclassified as germline variants. Octopus does also report a 'confidence' in classification:

https://github.com/luntergroup/octopus/wiki/Calling-models:-Cancer#qual-vs-pp

so you could try to use these statistics to rescue some of the lost sensitivity from misclassification (i.e. by keeping germline variants with low PP but high QUAL), but then you will likely see some of the false positives caused by real germline variants again.

As for the VCF output. I'm really keen for Octopus' representation to be more widely accepted as I believe it solves many of the problems that current conventions cause. The example that Chris gives is an excellent example of some of these problems. Quite clearly, the tumour data has three segregating haplotypes, yet current convention dictates that we represent the tumour genotype as diploid - leading to ad hoc workarounds. I believe that Octopus would call Chris' example like:

REF	ALT	QUAL	FILTER	INFO	FORMAT NORMAL	TUMOUR
AT	A,*	.	.	.	GT	0|1	0|1|2
ATT	A	.	.	SOMATIC	GT	0|.	0|.|1

The * and . are explained here. This representation clearly shows:

a) That the tumour has three unique haplotypes.
b) Which alleles are somatic and which are germline.
c) Which bases of which haplotypes are reference and non-reference.

I can imagine an argument against this representation would be that the germline can suffer copy-number changes and is therefore not really diploid, but I would say that this information is better encoded in a separate variable that indicates the copy-number or frequency of each segregating haplotype.

Thanks again for the feedback, and for all your work getting Octopus integrated into Bioconda & bcbio.

dancooke pushed a commit that referenced this issue Oct 16, 2018
This doesn't seem to help accuracy in general and increases runtimes (#29).
@dancooke
Copy link
Member Author

dancooke commented Nov 5, 2018

@chapmanb I'm closing this issue as I believe the sensitivity issue has largely been resolved. I've updated the UMI case-study for v0.5.2-beta. Our results agree on sensitivity (TP: 174 vs. 179), but have a difference in false positives (FP: 57 vs. 120). As I mentioned in my previous post, I believe this is because germline calls are being considered from Octopus in your evaluation (which may also explain the small difference in sensitivity). Many thanks for your feedback and work on this issue, and for the bcbio validations.

@dancooke dancooke closed this as completed Nov 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants