-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter variant tranches removes old filters by default #5042
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5042 +/- ##
===============================================
+ Coverage 86.504% 86.625% +0.121%
- Complexity 29249 30042 +793
===============================================
Files 1814 1813 -1
Lines 135585 137605 +2020
Branches 15060 15564 +504
===============================================
+ Hits 117286 119200 +1914
- Misses 12832 12906 +74
- Partials 5467 5499 +32
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor changes, and one question about what the scope of filter removal should be.
@@ -82,6 +82,9 @@ | |||
@Argument(fullName = "info-key", shortName = "info-key", doc = "The key must be in the INFO field of the input VCF.") | |||
private String infoKey = GATKVCFConstants.CNN_2D_KEY; | |||
|
|||
@Argument(fullName = "keep-old-filters", shortName = "keep-old-filters", doc = "Keeps filters already in the VCF.", optional=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may as well use the same command line argument name (invalidate-previous-filters
) used by VariantFiltration. The string constant should be moved out of VariantFiltration and into StandardArgumentDefinitions
so it can be shared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having said that, it seems like we're casting a pretty wide net by having an option on this tool that removes all filters, even ones not added by this tool. Should we limit this to only remove the old tranche filters derived from the same key being currently used (i.e., if the user specifies CNN-2D
, only remove the CNN-2D
tranche filters wheninvalidate-previous-filters
this is true), and maybe direct user to VariantFiltration to remove all filters. @ldgauthier @lucitronix opinions ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My motivation for adding this was getting burned by a VCF which had been filtered by VQSR, so I definitely want the ability to remove filters added by other tools, even if it is not the default. My vote is for it to be the default and to emit a warning for every filter removed from the header. @ldgauthier thoughts?
@@ -82,6 +82,9 @@ | |||
@Argument(fullName = "info-key", shortName = "info-key", doc = "The key must be in the INFO field of the input VCF.") | |||
private String infoKey = GATKVCFConstants.CNN_2D_KEY; | |||
|
|||
@Argument(fullName = "keep-old-filters", shortName = "keep-old-filters", doc = "Keeps filters already in the VCF.", optional=true) | |||
private boolean keepOldFilters = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a destructive operation, I think it should require opt-in, rather than be on by default. If we rename it to invalidate-previous-filters
, I'd initialize it to false.
2e3b210
to
9330172
Compare
After seeing plots like the one above, where we have lots of false negative SNPs and lots of false positive indels, it became clear that we should have separate tranche sensitivites for the two variant types. I've added that into this PR. I made the filter removal opt-in as suggested and it now shares the argument with VariantFiltration. @cmnbroad back to you, sorry for moving the goal line for this PR. |
@lucidtronix Just got back to this and noticed that many of the builds failed. Some look like transient git-lfs issues (so I restarted them), but this one looks real. Can you take a look before I re-review ? |
@cmnbroad you're right, I forgot to update the CNNVariantPipeline test with the new arguments. Should be fixed now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good - mostly minor changes requested.
@@ -170,7 +169,7 @@ | |||
/** | |||
* Invalidate previous filters applied to the VariantContext, applying only the filters here | |||
*/ | |||
@Argument(fullName=INVALIDATE_LONG_NAME,doc="Remove previous filters applied to the VCF",optional=true) | |||
@Argument(fullName=StandardArgumentDefinitions.REMOVE_OLD_FILTERS_LONG_NAME, doc="Remove previous filters applied to the VCF", optional=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the command line interface to VariantFiltration - it would be preferable to make FilterVariantTranches conform to this tool, rather than change VariantFiltration, which is already production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I liked my pithy argument, but fair point, fixed.
@@ -82,6 +88,9 @@ | |||
@Argument(fullName = "info-key", shortName = "info-key", doc = "The key must be in the INFO field of the input VCF.") | |||
private String infoKey = GATKVCFConstants.CNN_2D_KEY; | |||
|
|||
@Argument(fullName = StandardArgumentDefinitions.REMOVE_OLD_FILTERS_LONG_NAME, doc = "Remove filters already in the VCF.", optional=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth explicitly saying "Remove all filters that already exist in the vcf."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
hInfo.add(new VCFFilterHeaderLine(filterKey, filterDescription)); | ||
} | ||
if (removeOldFilters){ | ||
hInfo.removeIf(x -> x instanceof VCFFilterHeaderLine); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the right thing to do, but for some reason it doesn't look like VariantFiltration does this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it would be good to check the input header to make sure it has the info header line for the info key to be used. Right now we don't find out there is a mismatch until after a complete first pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
@Argument(fullName="indel-tranche", | ||
shortName="indel-tranche", | ||
doc="The levels of truth sensitivity at which to slice the data. (in percents, i.e. 99.9 for 99.9 percent and 1.0 for 1 percent)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc strings for snp-tranche and indel-tranche are both the same now - maybe update and specify which each refers to (snp and indel respectively).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -48,8 +50,11 @@ public void testTrancheFilteringDuplicate() throws IOException { | |||
.addArgument(StandardArgumentDefinitions.OUTPUT_LONG_NAME, "%s") | |||
.addArgument("resource", snpTruthVCF) | |||
.addArgument("resource", indelTruthVCF) | |||
.addArgument("tranche", "99.0") | |||
.addArgument("tranche", "99.0") | |||
.addArgument("snp-tranche", "99.5") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the convention of using the tranches in the file names, but now the expected file’s name doesn’t match the tranches. Can we change g94982_20_1m_10m_tranched_99.vcf to g94982_20_1m_10m_tranched_99_99.5.vcf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
.addArgument("snp-tranche", "99.0") | ||
.addArgument("indel-tranche", "99.0") | ||
.addArgument("info-key", "MIX_SMALL_2D_W_DROPOUT") | ||
.addArgument("remove-old-filters", "false") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line can be removed, and then it will verify the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
logger.info(String.format("Found %d SNPs %d indels with INFO score key:%s.", scoredSnps, scoredIndels, infoKey)); | ||
logger.info(String.format("Found %d SNPs %d indels in the resources.", snpScores.size(), indelScores.size())); | ||
logger.info(String.format("Found %d SNPs and %d indels with INFO score key:%s.", scoredSnps, scoredIndels, infoKey)); | ||
logger.info(String.format("Found %d SNPs and %d indels in the resources.", snpScores.size(), indelScores.size())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity can you rename snpScores/indelScores to something like resourceSNPScores/resourceIndelScores or truthSNPScores/truthIndelScores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
private List<Double> validateTranches(List<Double> tranches){ | ||
if (tranches.size() < 1 || tranches.stream().anyMatch(d -> d < 0 || d >= 100.0)){ | ||
throw new GATKException("At least 1 tranche value must be given and all tranches must be greater than 0 and less than 100."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should throw CommandLineException.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
logger.info(String.format("Found %d SNPs %d indels with INFO score key:%s.", scoredSnps, scoredIndels, infoKey)); | ||
logger.info(String.format("Found %d SNPs %d indels in the resources.", snpScores.size(), indelScores.size())); | ||
logger.info(String.format("Found %d SNPs and %d indels with INFO score key:%s.", scoredSnps, scoredIndels, infoKey)); | ||
logger.info(String.format("Found %d SNPs and %d indels in the resources.", snpScores.size(), indelScores.size())); | ||
|
||
if (scoredSnps == 0 || scoredIndels == 0 || snpScores.size() == 0 || indelScores.size() == 0){ | ||
throw new GATKException("VCF must contain SNPs and indels with scores and resources must contain matching SNPs and indels."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a UserException. GATKException is for unexpected errors in the code that are outside the user’s control, like the one thrown by filterStringFromScore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@cmnbroad Thanks for the review, back to you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One very minor request, then feel free to merge.
if (variant.hasAttribute(infoKey)) { | ||
final double score = Double.parseDouble((String) variant.getAttribute(infoKey)); | ||
if (variant.isSNP() && isTrancheFiltered(score, snpCutoffs)) { | ||
builder.filter(filterStringFromScore(score, snpCutoffs)); | ||
builder.filter(filterStringFromScore("SNP", score, snpTranches, snpCutoffs)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extract string constants for "SNP" and "INDEL" and use them in the calls to filterStringFromScore
and addTrancheHeaderFields
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Back to @lucidtronix. |
@lucidtronix I also just noticed that the tool-level gatkdoc for this tool is incorrect. It has a line that says:
Might as well fix it while we're in here. |
b3e622a
to
9d2aa2d
Compare
This addresses: #4928
@cmnbroad care to review?