-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor sensitivity for somatic mutations with VAF below 1% #29
Comments
@chapmanb Some of my initial thoughts on this issue:
|
Dan;
and then can re-run with these on this dataset. This will hopefully give us a baseline and then we can determine if further tweaks will help improve sensitivity/specificity tradeoffs. Thank you again for the help and discussion. |
Hi @chapmanb, I've been looking into this more and am now getting much better results. I have a couple of questions which may help me resolve a few remaining issues:
Thanks, |
Dan -- thanks so much for this work and apologies for the delay in getting back to you on these:
Thank you again for all this work, looking forward to testing it out when the new version is ready. |
Hi @chapmanb I've just released v0.5.0-beta and made a pull request to Bioconda. Amongst various other things, this includes improvements for UMI tumour-only calling. I've created a A couple of minor points:
Looking forward to seeing the updated benchmarks! |
@chapmanb I've added a case study of N0261 to the wiki. I've also added a wiki section on random forest training, although I haven't yet looked into using random forests for filtering the UMI calls (due to lack of training data). |
Daniel; This has the command line parameters used, which I took from your awesome UMI config. The tweaks I had to make during validation that I know are different from your suggested parameters were reducing downsampling to 2000 and max-haplotypes to 200. This was in response to a number of regions overwhelming the available memory and being killed. Did I destroy sensitivity with these? Do you see other mistakes I made in translating your work over? I'm definitely happy to re-run these with more memory and restored parameters but wanted to sanity check first that my base implementation is decent and these are the likely issues. I can also dig into outputs as well if it should perform decently with these settings and something else is problematic. Thanks again for all this work and walking me through translating over into automated runs. |
Hi Brad, Thanks for trying the new version. The main problem with the parameters that you have used is the combination of decreasing downsampling and decreasing To address the memory problems that you encountered, I'd suggest lowering the memory-specific options before adjusting algorithmic parameters:
The first two options are not strictly enforced, so memory use can still exceed the total specified in these three options. I'd start with setting Another thing. How are you comparing variant calls? Octopus emits non-diploid genotypes for somatic variants so germline and somatic variants can be phased. Some comparison tools won't handle this - especially when comparing to diploid truth sets. In the tutorial I wrote here (note: results may have changed slightly as this was written using v0.5.0-beta - I'll update it soon), I use RTG Tools to compare variant calls. However, I had to fudge Octopus' output to diploid to get it to work. Cheers |
Thanks to Dan Cooke for helping tips in luntergroup/octopus#29
Provides parameter adjustments based on discussions and validation work in luntergroup/octopus#29 https://github.com/bcbio/bcbio_validations/tree/master/somatic-lowfreq#vardict-156-octopus-051b - Parameter adjustments for low frequency and UMI samples based on Octopus configuration - Correctly detected UMI bams lacking duplicates even on pre-aligned input BAMs. - Fix issues with problem '*' alleles in outputs. - Convert output file GTs into standard diploid calls.
Dan; The main change I had to make was that I couldn't get reasonable memory usage with the recommended parameters and had to revert back to On a related topic, what sort of runtimes do you expect for these? Many of the N13532 regions are quite slow (6+ hours) and would be great to speed up to both iterate faster and make it easier to use in production. If this is again just a matter of adding more cores/memory I can try but wanted to get a baseline for expectations. Thanks also for the tip on converting the GT output. I'd totally missed this and as a first pass added in your conversion approach directly to bcbio to get standard GT genotypes. I'd like to more directly support the super useful phased output but am not sure the best way to do that as keeping the more complex GTs will make it incompatible with most downstream tools. We've also had this issue with strelka2 and our solution there was to encode this in the INFO field and then add standard genotypes (https://github.com/bcbio/bcbio-nextgen/blob/b2e5d1f8e7a5dffa3eb5a7c3dc29a289ba0f8560/bcbio/variation/strelka2.py#L162). I know VCF and downstream tool support is not really keeping up with the complexity and density of information on changes and phasing you're able to detect. I'd love to bring @ctsa into this discussion as we have also been trying to come up with a good strelka2 specific solution (Illumina/strelka#16). Having this synchronized across callers would at least help with building downstream tools that take advantage of complex phasing and representing germline to somatic changes correctly. Thanks again for all the help with getting this going and I'm excited to keep pushing octopus support forward. |
Many thanks for the feedback Brad - I'm glad things are looking a bit better now! Setting As for runtimes, 6 hours or so is in line with what I'm getting. Octopus is very much geared towards high accuracy rather than speed by default. Unfortunately, there are times when the algorithm gets stuck trying to resolve many missmapped reads leading to dense regions with many haplotypes. I'm yet to find a way to ignore these regions without also ignoring regions with genuine high diversity (e.g. HLA). It is possible to make Octopus go fast with the
Of these, the second has the largest influence on runtime. It controls the length of the haplotypes that Octopus builds - which affects both phasing lengths and accuracy. Looking at your new results, you're still getting quite a few more false positives than I am. I think the reason is that you're considering both germline and somatic variants called by Octopus; Octopus calls both germline and somatic variants in tumour data (paired-normal and tumour-only), and reports the variants classification with the https://github.com/luntergroup/octopus/wiki/Calling-models:-Cancer#qual-vs-pp so you could try to use these statistics to rescue some of the lost sensitivity from misclassification (i.e. by keeping germline variants with low As for the VCF output. I'm really keen for Octopus' representation to be more widely accepted as I believe it solves many of the problems that current conventions cause. The example that Chris gives is an excellent example of some of these problems. Quite clearly, the tumour data has three segregating haplotypes, yet current convention dictates that we represent the tumour genotype as diploid - leading to ad hoc workarounds. I believe that Octopus would call Chris' example like:
The a) That the tumour has three unique haplotypes. I can imagine an argument against this representation would be that the germline can suffer copy-number changes and is therefore not really diploid, but I would say that this information is better encoded in a separate variable that indicates the copy-number or frequency of each segregating haplotype. Thanks again for the feedback, and for all your work getting Octopus integrated into Bioconda & bcbio. |
This doesn't seem to help accuracy in general and increases runtimes (#29).
@chapmanb I'm closing this issue as I believe the sensitivity issue has largely been resolved. I've updated the UMI case-study for v0.5.2-beta. Our results agree on sensitivity (TP: 174 vs. 179), but have a difference in false positives (FP: 57 vs. 120). As I mentioned in my previous post, I believe this is because germline calls are being considered from Octopus in your evaluation (which may also explain the small difference in sensitivity). Many thanks for your feedback and work on this issue, and for the bcbio validations. |
@chapmanb has reported poor sensitivity for somatic mutations below 1% variant allele frequency in tumour-only mode. I'm opening this issue to move the discussion from a separate VarDict issue that I opened.
The text was updated successfully, but these errors were encountered: