-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CreateSomaticPanelOfNormals output PoN has much less variants in 4.1.8.0 than before #6744
Comments
@nalinigans @mlathara and/or @fleharty care to comment on this one? |
@yl-h One thing that might help narrow down where the issue is a bit...would it be possible to compare the size of the total vcfs between the 4.1.7.0 tool version and the versions where number of records+alleles decreases significantly? Given the significant difference you see, I would expect that if the issue is with Mutect2, then the total size of the vcfs should also decrease significantly. If the issue is with GenomicsDBImport, the size of the vcfs should be roughly comparable. edit: nevermind, @nalinigans pointed out that you see this issue with either version of Mutect2/GenomicsDBImport so the issue is likely on the query end after GenomicsDBImport. |
@yl-h I have created a new branch genomicsdb_6744 that exposes GenomicsDBArgument Collection to CreateSomaticPanelOfNormals. Can you please run the following to help us narrow down the issues?
|
@nalinigans I tested (1) using your branch, i.e. running I got ~100% records and alleles compared to 4.1.7.0 with some change probably due to minor changes in Mutect2 between versions (allele overlap is about 99.6%). As for using 4.1.7.0 M2 + genomicsDB (still running) as input, the PoN output was identical (all records/variants retained) to one created using 4.1.7.0. As you suggested, loss of PoN variants in 4.1.8.0 appears to be caused by the change to VCFCodec. |
Thanks @yl-h for following up. Would it be possible to include an example vcf line that is not showing up with VCFCodec? That will help us debug this issue further. |
Sure thing @nalinigans. I compared the PoNs created with 4.1.7.0 and 4.1.8.0 using 4.1.7.0 M2 and genomicsDB with
giving
Actually, I just noticed that all of the missing records are multiallelic. Some of the retained records (0.8% of all 4.1.7.0 records) are also multiallelic, however. |
@yl-h, thanks for the information. I have been able to reproduce the dropped variant with VCFCodec using two separate gvcf samples with one overlapping multi-allelic interval e.g.
and Sample 2:
Running SelectVariants after GenomicsDBImport with VCFCodec has these lines in the output. Notice that the
That is different from what we get from running SelectVariants with BCFCodec. The GenomicsDB workspace is the same in both these cases.
@lbergelson, @droazen, any idea why |
FWIW htslib seems to handle
Discussing with @kgururaj, he mentions that the VCFCodec in htsjdk uses AbstractVCFCodec:decodeInts which doesn't like '.' and hence drops the full AD field. BCF2Codec on the other hand uses BCF2Decoder:decodeIntArray - it seems to drop elements after the first missing value is seen - htsjdk is not dealing with missing and vector end characters separately (in conflict with the BCF2 spec). |
@lbergelson, should I open an issue with htsjdk/codecs? |
Looks like these issues are known - see samtools/htsjdk#340 , samtools/htsjdk#961 and samtools/hts-specs#232. |
@yl-h just to confirm, your input VCFs (input to GenomicsDBImport) don't have the <NON_REF> allele (primarily for the multi allelic sites), right? |
@kgururaj Correct, there are no <NON_REF> alleles in the VCFs given to GenomicsDBImport in our PoN pipeline, as expected for Mutect2. |
@nalinigans After looking into this issue a bit more, I have some additional comments/questions:
|
They are from the GenomicsDB query stream to CreateSomaticPanelOfNormals. FWIW : both |
@nalinigans For AD, I don't think we care about the distinction between '.' and 0. If AD is the only problematic field, and we're not seeing any issues with PL or any other attribute, then I'd advocate for a simple '.' -> 0 translation (for AD only!) within GenomicsDB, if such a thing is possible. @ldgauthier do you agree? |
Sounds good to me. |
Missing values will appear in all fields whose lengths are of type A, R or G (so PL field also). I'm assuming that missing values in allele specific annotation fields are handled gracefullyby GATK. |
Why do you say "no <NON_REF> alleles exist" @kgururaj ? In a GVCF, NON_REF should always be present, including at multi-allelic sites. |
And if the inputs here are VCFs and not GVCFs, then that is a problem. We've never supported combining regular VCFs using GenomicsDB, have we @ldgauthier ? |
So, the user is running the workflow as outlined here, which would mean they are using VCFs, not GVCFs, right? Does that workflow need to be altered? It seems that Mutect2 does have a (beta) feature that would support emitting GVCFs - should that be used here? |
Yes, @yl-h seems to have followed what was described in CreateSomaticPanelOfNormals Tools Doc. Perhaps the documentation should be changed to include |
@nalinigans @mlathara @kgururaj Yes, we're having an internal discussion now about whether this panel creation workflow needs to be altered. I think the crux of the problem is that we're trying to combine VCFs in GenomicsDB, which is not something we've ever claimed to support. Tagging @fleharty @davidbenjamin and @ldgauthier |
The solution is definitely not to run Mutect2 in GVCF mode. It's too different from VCF mode and has a big performance cost. |
@davidbenjamin @fleharty In that case, I see two remaining options:
|
@droazen I think the BCF code is broken here too. The problem is fundamental to htsjdk. CombineVariants almost certainly has the same or similar problems because it's fundamental to combining vcfs and the fact that htsjdk doesn't handle partially empty lists. Bcftools likely has similar issues. Or loading the correct output from bcftools will recreate the issuue. What about fixing the combine operation so it can substitute default missing values with a per attribute configuration for what value to substitute? |
@lbergelson We're looking for a pragmatic fix that can go out in the next GATK release (this month!). This is our only pipeline (that I know of...) that uses GenomicsDB in an unsupported way to combine VCFs rather than GVCFs. If this pipeline produced reasonable results with the flawed BCF codec behavior, then it might make sense to revert to that for now -- eg., maybe it doesn't rely on the annotation values that get truncated by BCF in any meaningful way (@davidbenjamin and/or @fleharty can hopefully chime in on this point) In the future we could consider making the more comprehensive changes needed to be able to claim that we support combination of VCFs in GenomicsDB, but this would have to be a project for a future quarter, as it's going to involve a significant amount of development work. |
I don't want to use CombineVariants. A) It's got some hinky combine behavior for certain attributes, B) I still don't want to port it to GATK4, and C) it's a slow as cold molasses. I was stapling together about 200 CNV VCFs, which are on the order of 100 variants each. CombineVariants took 11 hours and bcftools was so fast I couldn't push the button on my stopwatch (which is 17 minutes in the cloud with localizing and pulling Docker images and everything). 11 hours is unacceptable. I'm in favor of specifying the BCF codec for the PoN workflow AND adding a small test to Travis -- it can honestly just be like 10 variants with two multi-allelics. We should test the PoN WDL anyway, which I don't think we do anywhere right now (i.e. not in Terra either.) |
@nalinigans pointed out that even if we do switch this pipeline back to the BCF codec, it's still likely to encounter errors with 64-bit values due to some recent changes in GenomicsDB, which now throws an exception for these values instead of silently truncating them as it did in the past. So just switching to the BCF codec in the WDL might not be enough. A couple of other options we discussed today for a quick fix for this issue:
Whatever option we go with, we'll need to add a good regression test for the PoN workflow that would have caught this issue. Longer term, we'll plan on developing a fix at the HTSJDK level for the way missing values are handled for the AD and PL annotations in the VCF codec (@lbergelson is currently looking into how this could be done). @fleharty What are your thoughts? Is there someone on the M2 team who could work on a hotfix for the next GATK release? |
@droazen I am up for making the change. suggested by @nalingans, but do we believe this is sufficient to resolve this issue? |
@fleharty If you can come up with a way to handle genotypes with missing AD in the tool (that does not involve dropping them completely) I believe that would solve the problem, yes. The following code snippet from the tool shows the current behavior, where genotypes get filtered out if AD is missing:
AD will be missing completely if there were any missing values present after combination, due to the issues in HTSJDK discussed above. These missing values are there only because the pipeline is combining VCFs rather than GVCFs -- with GVCFs, you have the NON_REF allele and can fill in these missing values. Back when GenomicsDB used the BCF codec by default, the AD value would get truncated at the first missing value, instead of completely removed as it does with the VCF codec. This suggests that the AD values the tool was seeing were never correct in the first place, and the tool should probably be relying on a different attribute. Only AD and PL are affected by this HTSJDK issue. Is there another attribute you could use instead of AD in Let me know what @davidbenjamin says |
AD is very important because it's what the tool uses to decide if an allele is germline (and should be excluded from the panel) or an artifact (and should be kept in the panel). If it's missing we could instead rely on the GERMQ annotation emitted by FilterMutectCalls or the AF from Mutect2, but would those be combined properly by GenomicsDB? |
@davidbenjamin, looks like you may be able to use As for using |
@nalinigans |
@fleharty Great, let us know whether it turns out to be viable! As a fallback option, if AF doesn't work out, it might be possible to get a fix for just the AD behavior into HTSJDK within a few weeks. My reading of the code suggests that it could be relatively painless, especially if we don't care about the distinction between '.' and 0 for this annotation. A fix for PL, if it's needed, would be more difficult. |
@droazen Specifically: |
@fleharty It's line 810 in that class (https://github.com/samtools/htsjdk/blob/f15bc9d2c0297a1bde6b89aa95cf2dc45dfc567f/src/main/java/htsjdk/variant/vcf/AbstractVCFCodec.java#L810). We need to switch from calling |
@davidbenjamin Has there been any progress on this? |
@davidbenjamin Thanks for the workaround in the 4.1.9.0 release! I tested the updated Using the record and variant counts in 4.1.7.0 as 100% reference, I'm getting 57% more records (all multiallelic) or 142% more variants. No sites from 4.1.7.0 are missing in 4.1.9.0. As a side note, all of the new records have |
@yl-h Thank you for the update! Although we write tests for bug fixes, there is nothing so reassuring as hearing from users. As you noted, the |
@davidbenjamin Sorry, I have a follow-up comment regarding the new sites in 4.1.9.0. Around 79% of the new sites can be emitted by disabling the germline filter in 4.1.7.0. I had a look at the rest of records in The remaining ~39% and ~33% are respectively sites with ALT alleles each supported by one exclusive sample and the rest. |
Bug Report
Affected tool
CreateSomaticPanelOfNormals
Affected version
Tested on version 4.1.8.0 (likely commit 3e921c6, GenomicsDB 1.3.0 #6654)
Description
Panel of normals generated from version 4.1.8.0 has some ~28% less records (~52% less ALT alleles) than one created with 4.1.7.0 (tested at commit 9cc92e3) with all input data and arguments unchanged. The GenomicsDB version does not seem to matter as PoN created running CreateSomaticPanelOfNormals on 4.1.8.0 has the result is about the same regardless of whether GenomicsDBImport was run on 4.1.7.0 or 4.1.8.0. CreateSomaticPanelOfNormals on 4.1.7.0 fails to run on the new GenomicsDBs.
Steps to reproduce
The PoN was created with GRCh38, scattered over chromosomes. Mutect command:
GenomicsDBImport command:
CreateSomaticPanelOfNormals:
Expected behavior
Based on description of the GenomicsDB 1.3.0 update, CreateSomaticPanelOfNormals is expected to behave similarly in 4.1.8.0 as before with the output PoN containing a similar number of variants.
Actual behavior
28% of PoN records (52% alleles) are missing in 4.1.8.0 compared to 4.1.7.0. Although all spanning deletions are dropped in the new version, they account for only a small portion of the loss (6.8% / 52% missing ALT alleles).
The text was updated successfully, but these errors were encountered: