broadinstitute · gbggrant · Apr 29, 2022 · Apr 22, 2022 · Apr 26, 2022 · Apr 27, 2022
diff --git a/.dockstore.yml b/.dockstore.yml
@@ -173,6 +173,14 @@ workflows:
        branches:
          - master
          - ah_var_store
+   - name: GvsCalculatePrecisionAndSensitivity
+     subclass: WDL
+     primaryDescriptorPath: /scripts/variantstore/wdl/GvsCalculatePrecisionAndSensitivity.wdl
+     filters:
+       branches:
+         - master
+         - ah_var_store
+         - gg_AddPrecisionAndSensitivityWdl
    - name: MitochondriaPipeline
      subclass: WDL
      primaryDescriptorPath: /scripts/mitochondria_m2_wdl/MitochondriaPipeline.wdl

diff --git a/scripts/variantstore/tieout/AoU_PRECISION_SENSITIVITY.md b/scripts/variantstore/tieout/AoU_PRECISION_SENSITIVITY.md
@@ -1,104 +1,82 @@
 # Generating Callset Precision and Sensitivity Values
-## Prerequisites (hopefully only do once)
-1. Use `conda` to create a fresh environment to add these new tools to it:
- ```
- conda create --name gvs python=3.8
- conda activate gvs
- conda install -c bioconda samtools=1.9 --force-reinstall
- conda install -c bioconda bcftools
- conda install -c bioconda bedtools
- conda install -c bioconda rtg-tools
-```
-2. Download reference genome from `gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta` and pass it in to `rtg`:
-```
-rtg format --output human_REF_SDF /path/to/Homo_sapiens_assembly38.fasta
-```
-3. Copy truth sample VCFs to your local environment. Make sure you have truth files for all your control samples (e.g. `HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz` for `UW_HG-002` and `BI_HG-002`); you will need these for the "Evaluate" step.  If you are missing any see [Appendix A](#appendix-a-new-control-samples) and follow the steps there to add and prepare them.
-```
-mkdir -p truth && gsutil -m cp gs://broad-dsp-spec-ops/gvs/truth/* truth/
-```
-4. Complete (at least) through the training step (`GvsCreateFilterSet`) of the [GVS pipeline](../AOU_DELIVERABLES.md) with all the samples (control and non-control).  Then run the callset extract step with a view of the sample-mapping table that's controls-only but with the training (i.e. "filter_set") from all the samples.
-## Generate gVCFs for Chromosome 20
-Once you have the "controls only" run complete, inspect the shards to find the set of shards from `GvsExtractCallset` that encompass chr20 entirely, e.g. looking at different values of `shard-???` where the output of:
-```
-gsutil cat alpha1_1000_???.vcf.gz | gunzip | grep -v "#" | cut -f1-8 | head
-```
-starts with "chr20". Once the range of shards is known, gather those into a single chr20 gVCF for all samples with something like:
+
+## Prerequisites
+Complete (at least) through the training step (`GvsCreateFilterSet`) of the [GVS pipeline](../AOU_DELIVERABLES.md) with all the samples (control and non-control).  Then run the callset extract step with a view of the sample-mapping table that's controls-only but with the training (i.e. "filter_set") from all the samples.
+
+1. Use the GvsCalculatePrecisionAndSensitivity wdl to calculate the precision and sensitivity.
+
+The wdl takes several inputs:
+
+**input_vcfs** - A collection of VCFS - generated by `GvsExtractCallSet`. These need not be subsetted down to the chromosome.
+
 ```
-gatk --java-options -Xms6g GatherVcfsCloud --ignore-safety-checks --gather-type BLOCK \
--I aou_callset_1K_223.vcf.gz \
--I aou_callset_1K_224.vcf.gz \
--I aou_callset_1K_225.vcf.gz \
--I aou_callset_1K_226.vcf.gz \
--I aou_callset_1K_227.vcf.gz \
--I aou_callset_1K_228.vcf.gz \
--I aou_callset_1K_229.vcf.gz \
---output chr20.vcf.gz
+[ "aou_callset_1K_223.vcf.gz", \
+  "aou_callset_1K_224.vcf.gz", \
+  "aou_callset_1K_225.vcf.gz", \
+  "aou_callset_1K_226.vcf.gz", \
+  "aou_callset_1K_227.vcf.gz", \
+  "aou_callset_1K_228.vcf.gz", \
+  "aou_callset_1K_229.vcf.gz"
 ```
-And create an index for that VCF:
+
+**output_basename** - The base name for the output files generated by the pipeline.
+
+**chromosome** - The chromosome on which to run the analysis of Precision and Sensitivity. The default value for this is `chr20`. If it is set to `all` then the analysis will be run across *all* chromosomes.
+
+**sample_names** - A list of the sample names that are controls and that will be used for the analysis. For every element on the list of sample names there must be a corresponding element on the list of `truth_vcfs`, `truth_vcf_indices`, and `truth_beds`.
+
 ```
-tabix chr20.vcf.gz
+[ "NA12878", \
+  "NA24385"
 ```
-Now create single sample gVCFs for the control samples; in this example the sample names for the controls are "BI_HG-002", "UW_HG-002" and "BI_HG-003":
+**truth_vcfs** - A list of the VCFs that contain the truth data used for analyzing the samples in `sample_names`.
+
 ```
-INPUT_VCF=chr20.vcf.gz
-~/gatk SelectVariants -V ${INPUT_VCF} -L chr20 --sample-name BI_HG-002 --select-type-to-exclude NO_VARIATION -O BI_HG-002.chr20.vcf.gz
-~/gatk SelectVariants -V ${INPUT_VCF} -L chr20 --sample-name UW_HG-002 --select-type-to-exclude NO_VARIATION -O UW_HG-002.chr20.vcf.gz
-~/gatk SelectVariants -V ${INPUT_VCF} -L chr20 --sample-name BI_HG-003 --select-type-to-exclude NO_VARIATION -O BI_HG-003.chr20.vcf.gz
+[ "gs://broad-gotc-test-storage/gvs/truth/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz", \
+  "gs://broad-gotc-test-storage/gvs/truth/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz"
 ```
-## Run Script to Add "AS_MAX_VQSLOD" to VCFs
-The first line should contain only the control sample names in your callset (same list as previous step).
+**truth_vcf_indices** - A list of the VCF indices for the truth data VCFs supplied above.
+
 ```
-for sample in BI_HG-002 UW_HG-002 BI_HG-003
-do
-  python ../add_max_as_vqslod.py ${sample}.chr20.vcf.gz | bgzip > ${sample}.chr20.maxas.vcf.gz && tabix ${sample}.chr20.maxas.vcf.gz
-done
+[ "gs://broad-gotc-test-storage/gvs/truth/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz.tbi", \
+  "gs://broad-gotc-test-storage/gvs/truth/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi"
 ```
-## Evaluate
-First create the ROC curves using only the PASSing records.  For each sample, you will need to find the corresponding files under `truth/` and pass those into the `BASE_CMD`.
 
-```
-BASE_CMD="rtg vcfeval --region chr20 --roc-subset snp,indel  --vcf-score-field=INFO.MAX_AS_VQSLOD -t human_REF_SDF"
-SUFFIX="_roc_filtered"
+**truth_beds** - A list of the bed files for the truth data used for analyzing the samples in `sample_names`.
 
-${BASE_CMD} -b truth/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz -e truth/HG002.gvs.evaluation.bed -c BI_HG-002.chr20.maxas.vcf.gz -o BI_HG-002_aou-bq${SUFFIX}
-${BASE_CMD} -b truth/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz -e truth/HG002.gvs.evaluation.bed -c UW_HG-002.chr20.maxas.vcf.gz -o UW_HG-002_aou-bq${SUFFIX}
-${BASE_CMD} -b truth/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz -e truth/HG003.gvs.evaluation.bed -c BI_HG-003.chr20.maxas.vcf.gz -o BI_HG-003_aou-bq${SUFFIX}
 ```
-Then do the same thing but use all records.
+[ "gs://broad-gotc-test-storage/gvs/truth/HG001.gvs.evaluation.bed", \
+  "gs://broad-gotc-test-storage/gvs/truth/HG002.gvs.evaluation.bed"
 ```
-SOURCE="gvs"
-BASE_CMD="rtg vcfeval --region chr20 --all-records --roc-subset snp,indel  --vcf-score-field=INFO.MAX_AS_VQSLOD -t human_REF_SDF"
-SUFFIX="_roc_all"
 
-${BASE_CMD} -b truth/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz -e truth/HG002.gvs.evaluation.bed -c BI_HG-002.chr20.maxas.vcf.gz -o BI_HG-002_aou-bq${SUFFIX}
-${BASE_CMD} -b truth/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz -e truth/HG002.gvs.evaluation.bed -c UW_HG-002.chr20.maxas.vcf.gz -o UW_HG-002_aou-bq${SUFFIX}
-${BASE_CMD} -b truth/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz -e truth/HG003.gvs.evaluation.bed -c BI_HG-003.chr20.maxas.vcf.gz -o BI_HG-003_aou-bq${SUFFIX}
-```
-## Gather Precision and Sensitivity numbers
-Run the following script to display the data for the output file:
-```
-for type in "snp" "indel"
-do
-    for suffix in "_all" "_filtered"
-    do
-      for f in $(ls -1 *${suffix}/${type}*.tsv.gz); do
-        d=$(cat $f | gunzip | tail -1 | cut -f3,5,6,7)
-        s=$(echo $f | cut -d"/" -f1 )
-        echo -e "$s\t$type\t$d"
-      done
-    done
-done
+**ref_fasta** - The cloud path for the reference fasta sequence.
+
+
+## Appendix A: View ROCs
+To view ROC curves for the data you will need to set up rtg-tools locally, using the following steps (typically you need do this only once):
 
+1. Use `conda` to create a fresh environment to add these new tools to it:
+ ```
+ conda create --name gvs python=3.8
+ conda activate gvs
+ conda install -c bioconda rtg-tools
 ```
-The columns here are: sample name, variant type, FPs (false positives), FNs (false negatives), precision, and sensitivity. Copy and paste the output from this call into a TSV file.
-## Optional: View ROCs
+The base files needed to generate the ROC curves are among the outputs of the `GvsCalculatePrecisionAndSensitivity` wdl
+
 You can use wildcards to pull in multiple datasets into the graphical viewer. For example
 ```
 rtg roc NA12878_*_roc*/snp_roc.tsv.gz 
 rtg roc NA12878_*_roc*/indel_roc.tsv.gz 
 ```
-## Appendix A: New Control Samples
+## Appendix B: Control Samples
+
+"Truth" versions of control samples for this analysis are stored in:
+```
+gs://broad-dsp-spec-ops/gvs/truth/*
+```
+
+If you are missing any follow these steps there to add and prepare them to the above location
+
 "Truth" versions of control samples are from the [Genome in a Bottle Consortium (GIAB)](https://www.nist.gov/programs-projects/genome-bottle) and [releases of their data can be found here](https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/) organized by source.
 1. Download the `.bed`, `.vcf` and index files associated with the control sample.
 2. Convert the calling region GVS is using to BED format