-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds wdl that tests joint VCF filtering tools #7932
Conversation
Codecov Report
@@ Coverage Diff @@
## sl_sklearnvarianttrain_scalable #7932 +/- ##
===================================================================
Coverage ? 87.031%
Complexity ? 37304
===================================================================
Files ? 2238
Lines ? 175124
Branches ? 18897
===================================================================
Hits ? 152412
Misses ? 16010
Partials ? 6702 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for putting this together, @meganshand! Mostly minor comments and questions, but there are a few that get a little deeper into questions of tool/pipeline design that we might want to discuss. At the same time, we can talk about plans/timelines for getting this and the branch it's based off merged.
.github/workflows/gatk-tests.yml
Outdated
@@ -291,7 +291,7 @@ jobs: | |||
runs-on: ubuntu-latest | |||
strategy: | |||
matrix: | |||
wdlTest: [ 'RUN_CNV_GERMLINE_COHORT_WDL', 'RUN_CNV_GERMLINE_CASE_WDL', 'RUN_CNV_SOMATIC_WDL', 'RUN_M2_WDL', 'RUN_CNN_WDL' ] | |||
wdlTest: [ 'RUN_CNV_GERMLINE_COHORT_WDL', 'RUN_CNV_GERMLINE_CASE_WDL', 'RUN_CNV_SOMATIC_WDL', 'RUN_M2_WDL', 'RUN_CNN_WDL', 'RUN_FILTERING_WDL' ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine for now, but we may want to be a little more specific than "RUN_FILTERING_WDL" in the future. Any thoughts on alternatives?
(Just to be clear, I think we'd want to specify that this is INFO-annotation-based filtering, but not close the door on using this pipeline for single-sample, somatic, non-human, etc.)
|
||
**This directory is for GATK devs only** | ||
|
||
This directory contains scripts for running CNN Variant WDL tests in the automated travis build environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update CNN Variant
as appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps also a little blurb about the test data?
@@ -0,0 +1,301 @@ | |||
version 1.0 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps some basic docs here? Doesn't have to be lengthy. I'd also note, for example, that the input VCFs are sharded.
input_vcf_index = sites_only_vcf_index, | ||
mode = "SNP", | ||
annotations = snp_annotations, | ||
resources = "-resource:hapmap,training=true,calibration=true gs://gcp-public-data--broad-references/hg38/v0/hapmap_3.3.hg38.vcf.gz -resource:omni,training=true,calibration=true gs://gcp-public-data--broad-references/hg38/v0/1000G_omni2.5.hg38.vcf.gz -resource:1000G,training=true,calibration=false gs://gcp-public-data--broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's extract resources here and for indels. Perhaps also use the --resource
long name to be consistent with elsewhere.
export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override} | ||
|
||
gatk --java-options "-Xmx${command_mem}m" \ | ||
ExtractVariantAnnotations \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps an additional level of indentation here and in the following lines? And the same elsewhere.
runtime { | ||
docker: gatk_docker | ||
disks: "local-disk " + disk_size + " LOCAL" | ||
memory: "14 GB" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to expose memory and disk size throughout?
--model-prefix ~{basename} \ | ||
~{annotations} \ | ||
-mode ~{mode} \ | ||
--resource:extracted-training,training=true,calibration=false ~{extracted_training_vcf} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the VCF output by the extract tool contains those training and calibration variants that were actually extracted to the HDF5 to be used as input to the training tool---i.e., only those that were in the genomic region over which the extract tool was run, and not any other variants in the training/calibration resources that fall outside of this region.
The purpose of providing this VCF to the score tool is simply to correctly propagate this information to the *.annot.hdf5 output of the score tool (which could be considered secondary to the primary output of the scored/filtered VCF, but is very useful if one wants to do downstream analysis of the annotations). We can represent this information by labeling such extracted variants in the HDF5 with an additional label of our choosing; e.g., for the label "extracted", we would have:
--resource:extracted,extracted=true ~{extracted_training_vcf}
The choice of label is arbitrary, as is the name for the resource (here, also chosen to be "extracted", which is probably preferable to "extracted-training" because variants from the "calibration" set are also included). However, it's incorrect to use this VCF to label everything contained in it as "training", as is currently implemented here.
Hope that all makes sense! Sorry if the details of this have changed from the version you might've pulled this line from.
Another note: unfortunately, this functionality currently only extends to the positive training data. Because the current implementation does reservoir sampling of the unlabeled data, should one choose to extract it for the positive-negative approach, we cannot write the retained/sampled unlabeled data to VCF as we traverse. Furthermore, even the HDF5 output containing the extracted sample of unlabeled data is not currently genomically sorted---we just dump the data in the order that the reservoir happens to end up in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, this wasn't super clear to me. I changed it to extracted as you suggested above.
output { | ||
File scorer = "~{basename}.~{mode_lc}.scorer.pkl" | ||
File training_scores = "~{basename}.~{mode_lc}.trainingScores.hdf5" | ||
File truth_scores = "~{basename}.~{mode_lc}.calibrationScores.hdf5" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we add the BGMM, the scorer will have a different suffix (*.ser). Furthermore, when running the positive-negative approach, there is an additional *.unlabeledScores.hdf5 output here and an additional *.unlabeled.annot.hdf5 output in the extract (that is used as input here in training).
Do you think we could instead consider the tool input/output interfaces to be specified by the output-prefix
and model-prefix
basenames and use corresponding globs? Thus, the WDL task interfaces would better reflect the tool interfaces. Perhaps the only thing we would want to separate out is the extracted VCF (so we can pass it to the score task), in which case we could at least glob the labeled/unlabeled HDF5s output in the extract (hmm, although we might then have to check which HD5s are present in the training step and build the command line accordingly---so perhaps we should glob everywhere except for the extract outputs...) Feel free to let me know if you need more pointers here!
Alternatively, we could have optional inputs/outputs where necessary to cover all the possible workflow branches, but this might be a pain to test and/or update in the future.
This all stems from the question of what the default/canonical workflow should be (i.e., BGMM or IsolationForest? positive-only or positive-negative? etc.) and how many branches we want to expose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think glob makes sense here, but probably not for the Extract task. Does the "~{basename}.~{mode}.annot.hdf5"
output from Extract change filenames as well?
It seems like eventually we'll want multiple test jsons with different sets of inputs to test BGMM/IsolationForest and positive-only/positive-negative. But I'd suggest we make an issue for that and add them in another day to keep this version on track.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The *.annot.hdf5 output of the extract tool, which contains the (positive) training and calibration data, remains unchanged. However, an additional *.unlabeled.annot.hdf5 output is generated, which contains the unlabeled data separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yes, we can work towards testing a few workflow branches and adding functionality to this initial WDL as we go. See the somatic CNV WDL tests for a rough idea.
"JointVcfFiltering.basename": "test_10_samples", | ||
"JointVcfFiltering.indel_sensitivity_threshold": 97.0, | ||
"JointVcfFiltering.snp_sensitivity_threshold": 99.9, | ||
"JointVcfFiltering.snp_annotations": "-A ReadPosRankSum -A FS -A SOR -A QD -A AVERAGE_TREE_SCORE -A AVERAGE_ASSEMBLED_HAPS -A AVERAGE_FILTERED_HAPS", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up on comments elsewhere: would it be possible to use the AVERAGE_ASSEMBLED_HAPS
and AVERAGE_FILTERED_HAPS
annotations for INDELs, eventually?
@samuelklee I tried to address your comments and tests are now back to passing so I think this is ready for re-review. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent, thanks again @meganshand! As discussed on Slack, just extremely minor comments, which I'm happy to handle after we 1) open, review, and merge the branch containing the tools, and then 2) do any necessary rebasing or adding of features here.
After that, I will also try to put something in CARROT that perhaps marginally expands the functionality of the test done here, just to familiarize myself with where things are at in that framework.
.github/workflows/gatk-tests.yml
Outdated
@@ -350,8 +350,8 @@ jobs: | |||
echo "Running CNN WDL"; | |||
bash scripts/cnn_variant_cromwell_tests/run_cnn_variant_wdl.sh; | |||
|
|||
- name: "FILTERING_WDL_TEST" | |||
if: ${{ matrix.wdlTest == 'RUN_FILTERING_WDL' }} | |||
- name: "VCF_SITE_LEVEL_FILTERING_WDL" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking this! Perhaps *_WDL_TEST just to be consistent with the other tests.
@@ -28,9 +42,9 @@ workflow JointVcfFiltering { | |||
input_vcf_index = sites_only_vcf_index, | |||
mode = "SNP", | |||
annotations = snp_annotations, | |||
resources = "-resource:hapmap,training=true,calibration=true gs://gcp-public-data--broad-references/hg38/v0/hapmap_3.3.hg38.vcf.gz -resource:omni,training=true,calibration=true gs://gcp-public-data--broad-references/hg38/v0/1000G_omni2.5.hg38.vcf.gz -resource:1000G,training=true,calibration=false gs://gcp-public-data--broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz", | |||
resources = snp_training_resource_command_line, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps just snp_resource_args
and indel_resource_args
, so that 1) we don't implicitly focus on training resources (since we can include non-training calibration resources, or even just any resources we want to use to label variants), and 2) we are more consistent with the *_args WDL parameters in e.g. the M2 WDL.
@@ -160,6 +160,7 @@ task ExtractVariantAnnotations { | |||
File annots = "~{basename}.~{mode}.annot.hdf5" | |||
File extracted_training_vcf = "~{basename}.~{mode}.vcf.gz" | |||
File extracted_training_vcf_index = "~{basename}.~{mode}.vcf.gz.tbi" | |||
Array[File] outputs = glob("~{basename}.~{mode}.*") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking---so we are going to output both the glob and the individual files here?
Another thing I missed in my initial review: let's make sure all outputs of the workflow are exposed, and not just the final VCF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that was my intention (to both glob and have them listed out as individual files). This was to make it easier to import the individual files to other tasks and use them in the command line appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addressed the other two easy comments above, but I think I'm going to leave adding all the outputs of the workflow to you. I don't think we want to highlight any outputs that are only intended for debugging or analysis. Those files will still be there since they are delocalized by the tasks, but I'm not sure if we want to keep them around long term (making them outputs of the WDL rather than just "intermediate files" that will eventually be cleaned up). If you think they're all worth keeping though then feel free to add them as final workflow outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem, thanks!
We expose all outputs in the CNV WDLs, which makes it easier to use them as subworkflows downstream. I think it’s also worth encouraging the investigation of the intermediate outputs (which are relatively small here) as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, makes sense!
@samuelklee would you like to merge this branch into yours? Or did you want to PR your branch first? |
Hmm, I guess I can go ahead and merge into my branch once tests pass. |
* adding filtering wdl * renaming pipeline * addressing comments * added bash * renaming json * adding glob to extract for extra files * changing dollar signs * small comments
* adding filtering wdl * renaming pipeline * addressing comments * added bash * renaming json * adding glob to extract for extra files * changing dollar signs * small comments
…annotations. (#7954) * Adds wdl that tests joint VCF filtering tools (#7932) * adding filtering wdl * renaming pipeline * addressing comments * added bash * renaming json * adding glob to extract for extra files * changing dollar signs * small comments * Added changes for specifying model backend and other tweaks to WDLs and environment. * Added classes for representing a collection of labeled variant annotations. * Added interfaces for modeling and scoring backends. * Added a new suite of tools for variant filtering based on site-level annotations. * Added integration tests. * Added test resources and expected results. * Miscellaneous changes. * Removed non-ASCII characters. * Added documentation for TrainVariantAnnotationsModel and addressed review comments. Co-authored-by: meganshand <mshand@broadinstitute.org>
…annotations. (#7954) * Adds wdl that tests joint VCF filtering tools (#7932) * adding filtering wdl * renaming pipeline * addressing comments * added bash * renaming json * adding glob to extract for extra files * changing dollar signs * small comments * Added changes for specifying model backend and other tweaks to WDLs and environment. * Added classes for representing a collection of labeled variant annotations. * Added interfaces for modeling and scoring backends. * Added a new suite of tools for variant filtering based on site-level annotations. * Added integration tests. * Added test resources and expected results. * Miscellaneous changes. * Removed non-ASCII characters. * Added documentation for TrainVariantAnnotationsModel and addressed review comments. Co-authored-by: meganshand <mshand@broadinstitute.org>
* Added a new suite of tools for variant filtering based on site-level annotations. (#7954) * Adds wdl that tests joint VCF filtering tools (#7932) * adding filtering wdl * renaming pipeline * addressing comments * added bash * renaming json * adding glob to extract for extra files * changing dollar signs * small comments * Added changes for specifying model backend and other tweaks to WDLs and environment. * Added classes for representing a collection of labeled variant annotations. * Added interfaces for modeling and scoring backends. * Added a new suite of tools for variant filtering based on site-level annotations. * Added integration tests. * Added test resources and expected results. * Miscellaneous changes. * Removed non-ASCII characters. * Added documentation for TrainVariantAnnotationsModel and addressed review comments. Co-authored-by: meganshand <mshand@broadinstitute.org> * Added toggle for selecting resource-matching strategies and miscellaneous minor fixes to new annotation-based filtering tools. (#8049) * Adding use_allele_specific_annotation arg and fixing task with empty input in JointVcfFiltering WDL (#8027) * Small changes to JointVCFFiltering WDL * making default for use_allele_specific_annotations * addressing comments * first stab * wire through WDL changes * fixed typo * set model_backend input value * add gatk_override to JointVcfFiltering call * typo in indel_annotations * make model_backend optional * tabs and spaces * make all model_backends optional * use gatk 4.3.0 * no point in changing the table names as this is a POC * adding new branch to dockstore * adding in branching logic for classic VQSR vs VQSR-Lite * implementing the separate schemas for the VQSR vs VQSR-Lite branches, including Java changes necessary to produce the different tsv files * passing classic flag to indel run of CreateFilteringFiles * Update GvsCreateFilterSet.wdl cleaning up verbiage * Removed mapping error rate from estimate of denoised copy ratios output by gCNV and updated sklearn. (#7261) * cleanup up sloppy comment --------- Co-authored-by: samuelklee <samuelklee@users.noreply.github.com> Co-authored-by: meganshand <mshand@broadinstitute.org> Co-authored-by: Rebecca Asch <rasch@broadinstitute.org>
This adds a small test case for the WDL of the filtering pipeline. This still has indels and snps separated out. I can combine them if needed, but we'd like to use different annotations for each mode. This also doesn't actually apply the final filtering (with a threshold) since we still need to add a step to determine the correct threshold. The final VCFs from this workflow should have SCORE INFO annotations for each site.
This takes in an array of VCFs (and outputs an array of VCFs) because this is an option for large callsets in the WARP joint genotyping WDL which is where this WDL will eventually be integrated.
This test only ensures that the WDL runs and doesn't compare to expected results (the same as the other WDL tests in this repo).