Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a new suite of tools for variant filtering based on site-level annotations. #7954

Merged
merged 10 commits into from
Aug 9, 2022

Conversation

samuelklee
Copy link
Contributor

@samuelklee samuelklee commented Jul 21, 2022

This adds the following tools, which supplant the VQSR workflow: ExtractVariantAnnotations, TrainVariantAnnotationsModel, and ScoreVariantAnnotations. See meta issue #7724.

@codecov
Copy link

codecov bot commented Jul 21, 2022

Codecov Report

Merging #7954 (32ce261) into master (f1e7265) will decrease coverage by 0.030%.
The diff coverage is 83.582%.

❗ Current head 32ce261 differs from pull request most recent head 9de8b1c. Consider uploading reports for the commit 9de8b1c to get more accurate results

@@               Coverage Diff               @@
##              master     #7954       +/-   ##
===============================================
- Coverage     86.689%   86.659%   -0.030%     
- Complexity     38394     38771      +377     
===============================================
  Files           2308      2328       +20     
  Lines         180119    181659     +1540     
  Branches       19823     19946      +123     
===============================================
+ Hits          156143    157424     +1281     
- Misses         17036     17239      +203     
- Partials        6940      6996       +56     
Impacted Files Coverage Δ
...hellbender/tools/copynumber/CollectReadCounts.java 85.484% <ø> (ø)
...ools/copynumber/CreateReadCountPanelOfNormals.java 89.831% <ø> (ø)
...e/hellbender/tools/copynumber/utils/HDF5Utils.java 79.787% <ø> (ø)
...scalable/modeling/BGMMVariantAnnotationsModel.java 0.000% <0.000%> (ø)
...calable/modeling/BGMMVariantAnnotationsScorer.java 0.000% <0.000%> (ø)
...oadinstitute/hellbender/utils/NaturalLogUtils.java 77.143% <0.000%> (ø)
...ls/clustering/BayesianGaussianMixtureModeller.java 0.000% <0.000%> (ø)
.../tools/walkers/vqsr/scalable/data/VariantType.java 60.000% <60.000%> (ø)
.../walkers/vqsr/scalable/SystemCommandUtilsTest.java 60.870% <60.870%> (ø)
.../scalable/data/LabeledVariantAnnotationsDatum.java 72.222% <72.222%> (ø)
... and 20 more

* adding filtering wdl

* renaming pipeline

* addressing comments

* added bash

* renaming json

* adding glob to extract for extra files

* changing dollar signs

* small comments
@samuelklee samuelklee force-pushed the sl_sklearnvarianttrain_scalable branch 2 times, most recently from 1cbd55b to 3e00758 Compare July 28, 2022 13:44
@samuelklee samuelklee force-pushed the sl_sklearnvarianttrain_scalable branch from 3e00758 to 1e0da0e Compare July 28, 2022 13:49
@samuelklee
Copy link
Contributor Author

samuelklee commented Jul 28, 2022

I still need to finish up the tool-level Javadocs for the TrainVariantAnnotationsModel tool. But since I'll be off on vacation until the end of the week, I wanted to go ahead and open this up for review.

There's a lot here, but not too much of it is production code (<2k LOC). I've split things up into commits that should hopefully make it more easy to review. The first commit contains the WDL added in #7932 and has already been reviewed by me, although it may benefit from a second pass. The second commit updates that WDL to account for some changes I added after review.

There are TODOs scattered throughout the code, but some of them are intentionally left as an exercise for future developers. See the meta issue linked above to get an idea of what might be appropriate to leave to future work. Also note that tools are marked BETA, so there’s certainly room for improvement or changes!

There are also stubs throughout for the BGMM implementation, which will be added in a separate PR. Hopefully we can get some ML club reviewers then.

@meganshand @droazen @davidbenjamin mind taking a look or suggesting other reviewers? I would hope that we can get this in by the next release after the other flow-based methods are released, since the IsolationForest filtering method added here is also used in that pipeline. It would also be nice to get this merged by the next release to keep us on track on the malaria side.

@samuelklee samuelklee force-pushed the sl_sklearnvarianttrain_scalable branch from 1e0da0e to 122cb18 Compare July 28, 2022 14:04
@samuelklee
Copy link
Contributor Author

Also, if the WDL-generation and tab-completion tests continue to fail, I’ll address it after I get back. I think this has something to do with non-ASCII characters in the Javadocs, but I thought I had gotten all of them…

@meganshand meganshand self-requested a review July 28, 2022 14:26
@broadinstitute broadinstitute deleted a comment from gatk-bot Jul 28, 2022
@broadinstitute broadinstitute deleted a comment from gatk-bot Jul 28, 2022
@broadinstitute broadinstitute deleted a comment from gatk-bot Jul 28, 2022
@broadinstitute broadinstitute deleted a comment from gatk-bot Jul 28, 2022
@broadinstitute broadinstitute deleted a comment from gatk-bot Jul 28, 2022
Copy link
Contributor

@meganshand meganshand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @samuelklee! I thought the layout of classes was very clear and the testing was thorough. I had a few questions and comments. Also, I noticed a few places where there were temp files to read and write data within the tool and I was surprised that was necessary.

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really clean. The documentation is excellent.

@broadinstitute broadinstitute deleted a comment from gatk-bot Aug 3, 2022
@broadinstitute broadinstitute deleted a comment from gatk-bot Aug 3, 2022
@broadinstitute broadinstitute deleted a comment from gatk-bot Aug 3, 2022
@broadinstitute broadinstitute deleted a comment from gatk-bot Aug 5, 2022
@samuelklee samuelklee force-pushed the sl_sklearnvarianttrain_scalable branch from 32ce261 to b3d6fec Compare August 8, 2022 16:19
@samuelklee
Copy link
Contributor Author

OK, thanks for the thorough reviews, @meganshand and @davidbenjamin! I think I've addressed everything or left them as TODOs; I'll break these out into issues (or at least add them to the meta issue) later today. I also added the docs for the training tool.

Apologies for the slight delay, I had to get my eyes dilated on Friday morning and was completely useless for the rest of the day!

@samuelklee samuelklee force-pushed the sl_sklearnvarianttrain_scalable branch from b3d6fec to d853337 Compare August 8, 2022 16:23
@samuelklee samuelklee force-pushed the sl_sklearnvarianttrain_scalable branch from d853337 to 9de8b1c Compare August 9, 2022 15:31
Copy link
Contributor

@meganshand meganshand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks @samuelklee! The added documentation for the training tool looks great.

@samuelklee samuelklee merged commit 05a7634 into master Aug 9, 2022
@samuelklee samuelklee deleted the sl_sklearnvarianttrain_scalable branch August 9, 2022 17:11
rsasch pushed a commit that referenced this pull request Oct 17, 2022
…annotations. (#7954)

* Adds wdl that tests joint VCF filtering tools (#7932)

* adding filtering wdl

* renaming pipeline

* addressing comments

* added bash

* renaming json

* adding glob to extract for extra files

* changing dollar signs

* small comments

* Added changes for specifying model backend and other tweaks to WDLs and environment.

* Added classes for representing a collection of labeled variant annotations.

* Added interfaces for modeling and scoring backends.

* Added a new suite of tools for variant filtering based on site-level annotations.

* Added integration tests.

* Added test resources and expected results.

* Miscellaneous changes.

* Removed non-ASCII characters.

* Added documentation for TrainVariantAnnotationsModel and addressed review comments.

Co-authored-by: meganshand <mshand@broadinstitute.org>
koncheto-broad added a commit that referenced this pull request Feb 2, 2023
* Added a new suite of tools for variant filtering based on site-level annotations. (#7954)

* Adds wdl that tests joint VCF filtering tools (#7932)

* adding filtering wdl

* renaming pipeline

* addressing comments

* added bash

* renaming json

* adding glob to extract for extra files

* changing dollar signs

* small comments

* Added changes for specifying model backend and other tweaks to WDLs and environment.

* Added classes for representing a collection of labeled variant annotations.

* Added interfaces for modeling and scoring backends.

* Added a new suite of tools for variant filtering based on site-level annotations.

* Added integration tests.

* Added test resources and expected results.

* Miscellaneous changes.

* Removed non-ASCII characters.

* Added documentation for TrainVariantAnnotationsModel and addressed review comments.

Co-authored-by: meganshand <mshand@broadinstitute.org>

* Added toggle for selecting resource-matching strategies and miscellaneous minor fixes to new annotation-based filtering tools. (#8049)

* Adding use_allele_specific_annotation arg and fixing task with empty input in JointVcfFiltering WDL (#8027)

* Small changes to JointVCFFiltering WDL

* making default for use_allele_specific_annotations

* addressing comments

* first stab

* wire through WDL changes

* fixed typo

* set model_backend input value

* add gatk_override to JointVcfFiltering call

* typo in indel_annotations

* make model_backend optional

* tabs and spaces

* make all model_backends optional

* use gatk 4.3.0

* no point in changing the table names as this is a POC

* adding new branch to dockstore

* adding in branching logic for classic VQSR vs VQSR-Lite

* implementing the separate schemas for the VQSR vs VQSR-Lite branches, including Java changes necessary to produce the different tsv files

* passing classic flag to indel run of CreateFilteringFiles

* Update GvsCreateFilterSet.wdl

cleaning up verbiage

* Removed mapping error rate from estimate of denoised copy ratios output by gCNV and updated sklearn. (#7261)

* cleanup up sloppy comment

---------

Co-authored-by: samuelklee <samuelklee@users.noreply.github.com>
Co-authored-by: meganshand <mshand@broadinstitute.org>
Co-authored-by: Rebecca Asch <rasch@broadinstitute.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants