Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter variant tranches removes old filters by default #5042

Merged
merged 1 commit into from
Aug 16, 2018

Conversation

lucidtronix
Copy link
Contributor

This addresses: #4928
@cmnbroad care to review?

@cmnbroad cmnbroad self-assigned this Jul 20, 2018
@codecov-io
Copy link

codecov-io commented Jul 20, 2018

Codecov Report

Merging #5042 into master will increase coverage by 0.121%.
The diff coverage is 94.059%.

@@               Coverage Diff               @@
##              master     #5042       +/-   ##
===============================================
+ Coverage     86.504%   86.625%   +0.121%     
- Complexity     29249     30042      +793     
===============================================
  Files           1814      1813        -1     
  Lines         135585    137605     +2020     
  Branches       15060     15564      +504     
===============================================
+ Hits          117286    119200     +1914     
- Misses         12832     12906       +74     
- Partials        5467      5499       +32
Impacted Files Coverage Δ Complexity Δ
...ellbender/cmdline/StandardArgumentDefinitions.java 0% <ø> (ø) 0 <0> (ø) ⬇️
...der/tools/walkers/vqsr/CNNVariantPipelineTest.java 100% <100%> (ø) 8 <0> (ø) ⬇️
...ender/tools/walkers/filters/VariantFiltration.java 89.076% <100%> (ø) 41 <0> (ø) ⬇️
...ers/vqsr/FilterVariantTranchesIntegrationTest.java 100% <100%> (ø) 6 <1> (+1) ⬆️
...nder/tools/walkers/vqsr/FilterVariantTranches.java 92.105% <89.286%> (-0.526%) 41 <14> (+9)
...itute/hellbender/tools/funcotator/Funcotation.java 33.333% <0%> (-22.222%) 3% <0%> (-2%)
...Sources/gencode/DataProviderForMuc16IndelData.java 80% <0%> (-14.118%) 1% <0%> (ø)
...ead/markduplicates/sparkrecords/EmptyFragment.java 78.571% <0%> (-6.044%) 5% <0%> (ø)
...tools/spark/validation/CompareDuplicatesSpark.java 84.946% <0%> (-4.683%) 24% <0%> (-16%)
...idation/CompareDuplicatesSparkIntegrationTest.java 93.103% <0%> (-3.563%) 5% <0%> (-2%)
... and 72 more

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor changes, and one question about what the scope of filter removal should be.

@@ -82,6 +82,9 @@
@Argument(fullName = "info-key", shortName = "info-key", doc = "The key must be in the INFO field of the input VCF.")
private String infoKey = GATKVCFConstants.CNN_2D_KEY;

@Argument(fullName = "keep-old-filters", shortName = "keep-old-filters", doc = "Keeps filters already in the VCF.", optional=true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may as well use the same command line argument name (invalidate-previous-filters) used by VariantFiltration. The string constant should be moved out of VariantFiltration and into StandardArgumentDefinitions so it can be shared.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having said that, it seems like we're casting a pretty wide net by having an option on this tool that removes all filters, even ones not added by this tool. Should we limit this to only remove the old tranche filters derived from the same key being currently used (i.e., if the user specifies CNN-2D, only remove the CNN-2D tranche filters wheninvalidate-previous-filters this is true), and maybe direct user to VariantFiltration to remove all filters. @ldgauthier @lucitronix opinions ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My motivation for adding this was getting burned by a VCF which had been filtered by VQSR, so I definitely want the ability to remove filters added by other tools, even if it is not the default. My vote is for it to be the default and to emit a warning for every filter removed from the header. @ldgauthier thoughts?

@@ -82,6 +82,9 @@
@Argument(fullName = "info-key", shortName = "info-key", doc = "The key must be in the INFO field of the input VCF.")
private String infoKey = GATKVCFConstants.CNN_2D_KEY;

@Argument(fullName = "keep-old-filters", shortName = "keep-old-filters", doc = "Keeps filters already in the VCF.", optional=true)
private boolean keepOldFilters = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a destructive operation, I think it should require opt-in, rather than be on by default. If we rename it to invalidate-previous-filters, I'd initialize it to false.

@lucidtronix
Copy link
Contributor Author

screenshot 2018-08-01 11 42 49

After seeing plots like the one above, where we have lots of false negative SNPs and lots of false positive indels, it became clear that we should have separate tranche sensitivites for the two variant types. I've added that into this PR. I made the filter removal opt-in as suggested and it now shares the argument with VariantFiltration.

@cmnbroad back to you, sorry for moving the goal line for this PR.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Aug 7, 2018

@lucidtronix Just got back to this and noticed that many of the builds failed. Some look like transient git-lfs issues (so I restarted them), but this one looks real. Can you take a look before I re-review ?

@lucidtronix
Copy link
Contributor Author

@cmnbroad you're right, I forgot to update the CNNVariantPipeline test with the new arguments. Should be fixed now.

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good - mostly minor changes requested.

@@ -170,7 +169,7 @@
/**
* Invalidate previous filters applied to the VariantContext, applying only the filters here
*/
@Argument(fullName=INVALIDATE_LONG_NAME,doc="Remove previous filters applied to the VCF",optional=true)
@Argument(fullName=StandardArgumentDefinitions.REMOVE_OLD_FILTERS_LONG_NAME, doc="Remove previous filters applied to the VCF", optional=true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the command line interface to VariantFiltration - it would be preferable to make FilterVariantTranches conform to this tool, rather than change VariantFiltration, which is already production.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked my pithy argument, but fair point, fixed.

@@ -82,6 +88,9 @@
@Argument(fullName = "info-key", shortName = "info-key", doc = "The key must be in the INFO field of the input VCF.")
private String infoKey = GATKVCFConstants.CNN_2D_KEY;

@Argument(fullName = StandardArgumentDefinitions.REMOVE_OLD_FILTERS_LONG_NAME, doc = "Remove filters already in the VCF.", optional=true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth explicitly saying "Remove all filters that already exist in the vcf."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

hInfo.add(new VCFFilterHeaderLine(filterKey, filterDescription));
}
if (removeOldFilters){
hInfo.removeIf(x -> x instanceof VCFFilterHeaderLine);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the right thing to do, but for some reason it doesn't look like VariantFiltration does this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it would be good to check the input header to make sure it has the info header line for the info key to be used. Right now we don't find out there is a mismatch until after a complete first pass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


@Argument(fullName="indel-tranche",
shortName="indel-tranche",
doc="The levels of truth sensitivity at which to slice the data. (in percents, i.e. 99.9 for 99.9 percent and 1.0 for 1 percent)",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc strings for snp-tranche and indel-tranche are both the same now - maybe update and specify which each refers to (snp and indel respectively).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -48,8 +50,11 @@ public void testTrancheFilteringDuplicate() throws IOException {
.addArgument(StandardArgumentDefinitions.OUTPUT_LONG_NAME, "%s")
.addArgument("resource", snpTruthVCF)
.addArgument("resource", indelTruthVCF)
.addArgument("tranche", "99.0")
.addArgument("tranche", "99.0")
.addArgument("snp-tranche", "99.5")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the convention of using the tranches in the file names, but now the expected file’s name doesn’t match the tranches. Can we change g94982_20_1m_10m_tranched_99.vcf to g94982_20_1m_10m_tranched_99_99.5.vcf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

.addArgument("snp-tranche", "99.0")
.addArgument("indel-tranche", "99.0")
.addArgument("info-key", "MIX_SMALL_2D_W_DROPOUT")
.addArgument("remove-old-filters", "false")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be removed, and then it will verify the default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

logger.info(String.format("Found %d SNPs %d indels with INFO score key:%s.", scoredSnps, scoredIndels, infoKey));
logger.info(String.format("Found %d SNPs %d indels in the resources.", snpScores.size(), indelScores.size()));
logger.info(String.format("Found %d SNPs and %d indels with INFO score key:%s.", scoredSnps, scoredIndels, infoKey));
logger.info(String.format("Found %d SNPs and %d indels in the resources.", snpScores.size(), indelScores.size()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity can you rename snpScores/indelScores to something like resourceSNPScores/resourceIndelScores or truthSNPScores/truthIndelScores.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


private List<Double> validateTranches(List<Double> tranches){
if (tranches.size() < 1 || tranches.stream().anyMatch(d -> d < 0 || d >= 100.0)){
throw new GATKException("At least 1 tranche value must be given and all tranches must be greater than 0 and less than 100.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should throw CommandLineException.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

logger.info(String.format("Found %d SNPs %d indels with INFO score key:%s.", scoredSnps, scoredIndels, infoKey));
logger.info(String.format("Found %d SNPs %d indels in the resources.", snpScores.size(), indelScores.size()));
logger.info(String.format("Found %d SNPs and %d indels with INFO score key:%s.", scoredSnps, scoredIndels, infoKey));
logger.info(String.format("Found %d SNPs and %d indels in the resources.", snpScores.size(), indelScores.size()));

if (scoredSnps == 0 || scoredIndels == 0 || snpScores.size() == 0 || indelScores.size() == 0){
throw new GATKException("VCF must contain SNPs and indels with scores and resources must contain matching SNPs and indels.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a UserException. GATKException is for unexpected errors in the code that are outside the user’s control, like the one thrown by filterStringFromScore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@lucidtronix
Copy link
Contributor Author

@cmnbroad Thanks for the review, back to you!

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One very minor request, then feel free to merge.

if (variant.hasAttribute(infoKey)) {
final double score = Double.parseDouble((String) variant.getAttribute(infoKey));
if (variant.isSNP() && isTrancheFiltered(score, snpCutoffs)) {
builder.filter(filterStringFromScore(score, snpCutoffs));
builder.filter(filterStringFromScore("SNP", score, snpTranches, snpCutoffs));
Copy link
Collaborator

@cmnbroad cmnbroad Aug 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract string constants for "SNP" and "INDEL" and use them in the calls to filterStringFromScore and addTrancheHeaderFields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@cmnbroad
Copy link
Collaborator

Back to @lucidtronix.

@cmnbroad cmnbroad assigned lucidtronix and unassigned cmnbroad Aug 10, 2018
@cmnbroad
Copy link
Collaborator

cmnbroad commented Aug 10, 2018

@lucidtronix I also just noticed that the tool-level gatkdoc for this tool is incorrect. It has a line that says:

Apply tranche filters based on CNN_1D scores

Might as well fix it while we're in here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants