Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds wdl that tests joint VCF filtering tools #7932

Merged
merged 8 commits into from
Jul 19, 2022

Conversation

meganshand
Copy link
Contributor

This adds a small test case for the WDL of the filtering pipeline. This still has indels and snps separated out. I can combine them if needed, but we'd like to use different annotations for each mode. This also doesn't actually apply the final filtering (with a threshold) since we still need to add a step to determine the correct threshold. The final VCFs from this workflow should have SCORE INFO annotations for each site.

This takes in an array of VCFs (and outputs an array of VCFs) because this is an option for large callsets in the WARP joint genotyping WDL which is where this WDL will eventually be integrated.

This test only ensures that the WDL runs and doesn't compare to expected results (the same as the other WDL tests in this repo).

@meganshand meganshand requested a review from samuelklee July 8, 2022 18:23
@samuelklee samuelklee self-assigned this Jul 8, 2022
@codecov
Copy link

codecov bot commented Jul 8, 2022

Codecov Report

❗ No coverage uploaded for pull request base (sl_sklearnvarianttrain_scalable@16e686c). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head a894c2a differs from pull request most recent head 93eebb0. Consider uploading reports for the commit 93eebb0 to get more accurate results

@@                         Coverage Diff                         @@
##             sl_sklearnvarianttrain_scalable     #7932   +/-   ##
===================================================================
  Coverage                                   ?   87.031%           
  Complexity                                 ?     37304           
===================================================================
  Files                                      ?      2238           
  Lines                                      ?    175124           
  Branches                                   ?     18897           
===================================================================
  Hits                                       ?    152412           
  Misses                                     ?     16010           
  Partials                                   ?      6702           

Copy link
Contributor

@samuelklee samuelklee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for putting this together, @meganshand! Mostly minor comments and questions, but there are a few that get a little deeper into questions of tool/pipeline design that we might want to discuss. At the same time, we can talk about plans/timelines for getting this and the branch it's based off merged.

@@ -291,7 +291,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
wdlTest: [ 'RUN_CNV_GERMLINE_COHORT_WDL', 'RUN_CNV_GERMLINE_CASE_WDL', 'RUN_CNV_SOMATIC_WDL', 'RUN_M2_WDL', 'RUN_CNN_WDL' ]
wdlTest: [ 'RUN_CNV_GERMLINE_COHORT_WDL', 'RUN_CNV_GERMLINE_CASE_WDL', 'RUN_CNV_SOMATIC_WDL', 'RUN_M2_WDL', 'RUN_CNN_WDL', 'RUN_FILTERING_WDL' ]
Copy link
Contributor

@samuelklee samuelklee Jul 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now, but we may want to be a little more specific than "RUN_FILTERING_WDL" in the future. Any thoughts on alternatives?

(Just to be clear, I think we'd want to specify that this is INFO-annotation-based filtering, but not close the door on using this pipeline for single-sample, somatic, non-human, etc.)


**This directory is for GATK devs only**

This directory contains scripts for running CNN Variant WDL tests in the automated travis build environment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update CNN Variant as appropriate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps also a little blurb about the test data?

@@ -0,0 +1,301 @@
version 1.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps some basic docs here? Doesn't have to be lengthy. I'd also note, for example, that the input VCFs are sharded.

input_vcf_index = sites_only_vcf_index,
mode = "SNP",
annotations = snp_annotations,
resources = "-resource:hapmap,training=true,calibration=true gs://gcp-public-data--broad-references/hg38/v0/hapmap_3.3.hg38.vcf.gz -resource:omni,training=true,calibration=true gs://gcp-public-data--broad-references/hg38/v0/1000G_omni2.5.hg38.vcf.gz -resource:1000G,training=true,calibration=false gs://gcp-public-data--broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz",
Copy link
Contributor

@samuelklee samuelklee Jul 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's extract resources here and for indels. Perhaps also use the --resource long name to be consistent with elsewhere.

scripts/filtering_wdl/JointVcfFiltering.wdl Outdated Show resolved Hide resolved
export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override}

gatk --java-options "-Xmx${command_mem}m" \
ExtractVariantAnnotations \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps an additional level of indentation here and in the following lines? And the same elsewhere.

runtime {
docker: gatk_docker
disks: "local-disk " + disk_size + " LOCAL"
memory: "14 GB"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to expose memory and disk size throughout?

--model-prefix ~{basename} \
~{annotations} \
-mode ~{mode} \
--resource:extracted-training,training=true,calibration=false ~{extracted_training_vcf} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the VCF output by the extract tool contains those training and calibration variants that were actually extracted to the HDF5 to be used as input to the training tool---i.e., only those that were in the genomic region over which the extract tool was run, and not any other variants in the training/calibration resources that fall outside of this region.

The purpose of providing this VCF to the score tool is simply to correctly propagate this information to the *.annot.hdf5 output of the score tool (which could be considered secondary to the primary output of the scored/filtered VCF, but is very useful if one wants to do downstream analysis of the annotations). We can represent this information by labeling such extracted variants in the HDF5 with an additional label of our choosing; e.g., for the label "extracted", we would have:

--resource:extracted,extracted=true ~{extracted_training_vcf}

The choice of label is arbitrary, as is the name for the resource (here, also chosen to be "extracted", which is probably preferable to "extracted-training" because variants from the "calibration" set are also included). However, it's incorrect to use this VCF to label everything contained in it as "training", as is currently implemented here.

Hope that all makes sense! Sorry if the details of this have changed from the version you might've pulled this line from.

Another note: unfortunately, this functionality currently only extends to the positive training data. Because the current implementation does reservoir sampling of the unlabeled data, should one choose to extract it for the positive-negative approach, we cannot write the retained/sampled unlabeled data to VCF as we traverse. Furthermore, even the HDF5 output containing the extracted sample of unlabeled data is not currently genomically sorted---we just dump the data in the order that the reservoir happens to end up in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, this wasn't super clear to me. I changed it to extracted as you suggested above.

output {
File scorer = "~{basename}.~{mode_lc}.scorer.pkl"
File training_scores = "~{basename}.~{mode_lc}.trainingScores.hdf5"
File truth_scores = "~{basename}.~{mode_lc}.calibrationScores.hdf5"
Copy link
Contributor

@samuelklee samuelklee Jul 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we add the BGMM, the scorer will have a different suffix (*.ser). Furthermore, when running the positive-negative approach, there is an additional *.unlabeledScores.hdf5 output here and an additional *.unlabeled.annot.hdf5 output in the extract (that is used as input here in training).

Do you think we could instead consider the tool input/output interfaces to be specified by the output-prefix and model-prefix basenames and use corresponding globs? Thus, the WDL task interfaces would better reflect the tool interfaces. Perhaps the only thing we would want to separate out is the extracted VCF (so we can pass it to the score task), in which case we could at least glob the labeled/unlabeled HDF5s output in the extract (hmm, although we might then have to check which HD5s are present in the training step and build the command line accordingly---so perhaps we should glob everywhere except for the extract outputs...) Feel free to let me know if you need more pointers here!

Alternatively, we could have optional inputs/outputs where necessary to cover all the possible workflow branches, but this might be a pain to test and/or update in the future.

This all stems from the question of what the default/canonical workflow should be (i.e., BGMM or IsolationForest? positive-only or positive-negative? etc.) and how many branches we want to expose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think glob makes sense here, but probably not for the Extract task. Does the "~{basename}.~{mode}.annot.hdf5" output from Extract change filenames as well?

It seems like eventually we'll want multiple test jsons with different sets of inputs to test BGMM/IsolationForest and positive-only/positive-negative. But I'd suggest we make an issue for that and add them in another day to keep this version on track.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The *.annot.hdf5 output of the extract tool, which contains the (positive) training and calibration data, remains unchanged. However, an additional *.unlabeled.annot.hdf5 output is generated, which contains the unlabeled data separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes, we can work towards testing a few workflow branches and adding functionality to this initial WDL as we go. See the somatic CNV WDL tests for a rough idea.

"JointVcfFiltering.basename": "test_10_samples",
"JointVcfFiltering.indel_sensitivity_threshold": 97.0,
"JointVcfFiltering.snp_sensitivity_threshold": 99.9,
"JointVcfFiltering.snp_annotations": "-A ReadPosRankSum -A FS -A SOR -A QD -A AVERAGE_TREE_SCORE -A AVERAGE_ASSEMBLED_HAPS -A AVERAGE_FILTERED_HAPS",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on comments elsewhere: would it be possible to use the AVERAGE_ASSEMBLED_HAPS and AVERAGE_FILTERED_HAPS annotations for INDELs, eventually?

@meganshand
Copy link
Contributor Author

@samuelklee I tried to address your comments and tests are now back to passing so I think this is ready for re-review. Thanks!

Copy link
Contributor

@samuelklee samuelklee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, thanks again @meganshand! As discussed on Slack, just extremely minor comments, which I'm happy to handle after we 1) open, review, and merge the branch containing the tools, and then 2) do any necessary rebasing or adding of features here.

After that, I will also try to put something in CARROT that perhaps marginally expands the functionality of the test done here, just to familiarize myself with where things are at in that framework.

@@ -350,8 +350,8 @@ jobs:
echo "Running CNN WDL";
bash scripts/cnn_variant_cromwell_tests/run_cnn_variant_wdl.sh;

- name: "FILTERING_WDL_TEST"
if: ${{ matrix.wdlTest == 'RUN_FILTERING_WDL' }}
- name: "VCF_SITE_LEVEL_FILTERING_WDL"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this! Perhaps *_WDL_TEST just to be consistent with the other tests.

@@ -28,9 +42,9 @@ workflow JointVcfFiltering {
input_vcf_index = sites_only_vcf_index,
mode = "SNP",
annotations = snp_annotations,
resources = "-resource:hapmap,training=true,calibration=true gs://gcp-public-data--broad-references/hg38/v0/hapmap_3.3.hg38.vcf.gz -resource:omni,training=true,calibration=true gs://gcp-public-data--broad-references/hg38/v0/1000G_omni2.5.hg38.vcf.gz -resource:1000G,training=true,calibration=false gs://gcp-public-data--broad-references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz",
resources = snp_training_resource_command_line,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just snp_resource_args and indel_resource_args, so that 1) we don't implicitly focus on training resources (since we can include non-training calibration resources, or even just any resources we want to use to label variants), and 2) we are more consistent with the *_args WDL parameters in e.g. the M2 WDL.

@@ -160,6 +160,7 @@ task ExtractVariantAnnotations {
File annots = "~{basename}.~{mode}.annot.hdf5"
File extracted_training_vcf = "~{basename}.~{mode}.vcf.gz"
File extracted_training_vcf_index = "~{basename}.~{mode}.vcf.gz.tbi"
Array[File] outputs = glob("~{basename}.~{mode}.*")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking---so we are going to output both the glob and the individual files here?

Another thing I missed in my initial review: let's make sure all outputs of the workflow are exposed, and not just the final VCF.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was my intention (to both glob and have them listed out as individual files). This was to make it easier to import the individual files to other tasks and use them in the command line appropriately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed the other two easy comments above, but I think I'm going to leave adding all the outputs of the workflow to you. I don't think we want to highlight any outputs that are only intended for debugging or analysis. Those files will still be there since they are delocalized by the tasks, but I'm not sure if we want to keep them around long term (making them outputs of the WDL rather than just "intermediate files" that will eventually be cleaned up). If you think they're all worth keeping though then feel free to add them as final workflow outputs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, thanks!

We expose all outputs in the CNV WDLs, which makes it easier to use them as subworkflows downstream. I think it’s also worth encouraging the investigation of the intermediate outputs (which are relatively small here) as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, makes sense!

@meganshand
Copy link
Contributor Author

@samuelklee would you like to merge this branch into yours? Or did you want to PR your branch first?

@samuelklee
Copy link
Contributor

samuelklee commented Jul 19, 2022

Hmm, I guess I can go ahead and merge into my branch once tests pass.

@samuelklee samuelklee merged commit 24cc416 into sl_sklearnvarianttrain_scalable Jul 19, 2022
@samuelklee samuelklee deleted the ms_filtering_wdl branch July 19, 2022 19:00
samuelklee pushed a commit that referenced this pull request Jul 21, 2022
* adding filtering wdl

* renaming pipeline

* addressing comments

* added bash

* renaming json

* adding glob to extract for extra files

* changing dollar signs

* small comments
samuelklee pushed a commit that referenced this pull request Jul 28, 2022
* adding filtering wdl

* renaming pipeline

* addressing comments

* added bash

* renaming json

* adding glob to extract for extra files

* changing dollar signs

* small comments
samuelklee added a commit that referenced this pull request Aug 9, 2022
…annotations. (#7954)

* Adds wdl that tests joint VCF filtering tools (#7932)

* adding filtering wdl

* renaming pipeline

* addressing comments

* added bash

* renaming json

* adding glob to extract for extra files

* changing dollar signs

* small comments

* Added changes for specifying model backend and other tweaks to WDLs and environment.

* Added classes for representing a collection of labeled variant annotations.

* Added interfaces for modeling and scoring backends.

* Added a new suite of tools for variant filtering based on site-level annotations.

* Added integration tests.

* Added test resources and expected results.

* Miscellaneous changes.

* Removed non-ASCII characters.

* Added documentation for TrainVariantAnnotationsModel and addressed review comments.

Co-authored-by: meganshand <mshand@broadinstitute.org>
rsasch pushed a commit that referenced this pull request Oct 17, 2022
…annotations. (#7954)

* Adds wdl that tests joint VCF filtering tools (#7932)

* adding filtering wdl

* renaming pipeline

* addressing comments

* added bash

* renaming json

* adding glob to extract for extra files

* changing dollar signs

* small comments

* Added changes for specifying model backend and other tweaks to WDLs and environment.

* Added classes for representing a collection of labeled variant annotations.

* Added interfaces for modeling and scoring backends.

* Added a new suite of tools for variant filtering based on site-level annotations.

* Added integration tests.

* Added test resources and expected results.

* Miscellaneous changes.

* Removed non-ASCII characters.

* Added documentation for TrainVariantAnnotationsModel and addressed review comments.

Co-authored-by: meganshand <mshand@broadinstitute.org>
koncheto-broad added a commit that referenced this pull request Feb 2, 2023
* Added a new suite of tools for variant filtering based on site-level annotations. (#7954)

* Adds wdl that tests joint VCF filtering tools (#7932)

* adding filtering wdl

* renaming pipeline

* addressing comments

* added bash

* renaming json

* adding glob to extract for extra files

* changing dollar signs

* small comments

* Added changes for specifying model backend and other tweaks to WDLs and environment.

* Added classes for representing a collection of labeled variant annotations.

* Added interfaces for modeling and scoring backends.

* Added a new suite of tools for variant filtering based on site-level annotations.

* Added integration tests.

* Added test resources and expected results.

* Miscellaneous changes.

* Removed non-ASCII characters.

* Added documentation for TrainVariantAnnotationsModel and addressed review comments.

Co-authored-by: meganshand <mshand@broadinstitute.org>

* Added toggle for selecting resource-matching strategies and miscellaneous minor fixes to new annotation-based filtering tools. (#8049)

* Adding use_allele_specific_annotation arg and fixing task with empty input in JointVcfFiltering WDL (#8027)

* Small changes to JointVCFFiltering WDL

* making default for use_allele_specific_annotations

* addressing comments

* first stab

* wire through WDL changes

* fixed typo

* set model_backend input value

* add gatk_override to JointVcfFiltering call

* typo in indel_annotations

* make model_backend optional

* tabs and spaces

* make all model_backends optional

* use gatk 4.3.0

* no point in changing the table names as this is a POC

* adding new branch to dockstore

* adding in branching logic for classic VQSR vs VQSR-Lite

* implementing the separate schemas for the VQSR vs VQSR-Lite branches, including Java changes necessary to produce the different tsv files

* passing classic flag to indel run of CreateFilteringFiles

* Update GvsCreateFilterSet.wdl

cleaning up verbiage

* Removed mapping error rate from estimate of denoised copy ratios output by gCNV and updated sklearn. (#7261)

* cleanup up sloppy comment

---------

Co-authored-by: samuelklee <samuelklee@users.noreply.github.com>
Co-authored-by: meganshand <mshand@broadinstitute.org>
Co-authored-by: Rebecca Asch <rasch@broadinstitute.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants