Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VS-390. Add precision and sensitivity wdl #7813

Merged
merged 8 commits into from
Apr 29, 2022
8 changes: 8 additions & 0 deletions .dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,14 @@ workflows:
branches:
- master
- ah_var_store
- name: GvsCalculatePrecisionAndSensitivity
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsCalculatePrecisionAndSensitivity.wdl
filters:
branches:
- master
- ah_var_store
- gg_AddPrecisionAndSensitivityWdl
- name: MitochondriaPipeline
subclass: WDL
primaryDescriptorPath: /scripts/mitochondria_m2_wdl/MitochondriaPipeline.wdl
Expand Down
138 changes: 58 additions & 80 deletions scripts/variantstore/tieout/AoU_PRECISION_SENSITIVITY.md
Original file line number Diff line number Diff line change
@@ -1,104 +1,82 @@
# Generating Callset Precision and Sensitivity Values
## Prerequisites (hopefully only do once)
1. Use `conda` to create a fresh environment to add these new tools to it:
```
conda create --name gvs python=3.8
conda activate gvs
conda install -c bioconda samtools=1.9 --force-reinstall
conda install -c bioconda bcftools
conda install -c bioconda bedtools
conda install -c bioconda rtg-tools
```
2. Download reference genome from `gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta` and pass it in to `rtg`:
```
rtg format --output human_REF_SDF /path/to/Homo_sapiens_assembly38.fasta
```
3. Copy truth sample VCFs to your local environment. Make sure you have truth files for all your control samples (e.g. `HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz` for `UW_HG-002` and `BI_HG-002`); you will need these for the "Evaluate" step. If you are missing any see [Appendix A](#appendix-a-new-control-samples) and follow the steps there to add and prepare them.
```
mkdir -p truth && gsutil -m cp gs://broad-dsp-spec-ops/gvs/truth/* truth/
```
4. Complete (at least) through the training step (`GvsCreateFilterSet`) of the [GVS pipeline](../AOU_DELIVERABLES.md) with all the samples (control and non-control). Then run the callset extract step with a view of the sample-mapping table that's controls-only but with the training (i.e. "filter_set") from all the samples.
## Generate gVCFs for Chromosome 20
Once you have the "controls only" run complete, inspect the shards to find the set of shards from `GvsExtractCallset` that encompass chr20 entirely, e.g. looking at different values of `shard-???` where the output of:
```
gsutil cat alpha1_1000_???.vcf.gz | gunzip | grep -v "#" | cut -f1-8 | head
```
starts with "chr20". Once the range of shards is known, gather those into a single chr20 gVCF for all samples with something like:

## Prerequisites
Complete (at least) through the training step (`GvsCreateFilterSet`) of the [GVS pipeline](../AOU_DELIVERABLES.md) with all the samples (control and non-control). Then run the callset extract step with a view of the sample-mapping table that's controls-only but with the training (i.e. "filter_set") from all the samples.

1. Use the GvsCalculatePrecisionAndSensitivity wdl to calculate the precision and sensitivity.

The wdl takes several inputs:

**input_vcfs** - A collection of VCFS - generated by `GvsExtractCallSet`. These need not be subsetted down to the chromosome.

```
gatk --java-options -Xms6g GatherVcfsCloud --ignore-safety-checks --gather-type BLOCK \
-I aou_callset_1K_223.vcf.gz \
-I aou_callset_1K_224.vcf.gz \
-I aou_callset_1K_225.vcf.gz \
-I aou_callset_1K_226.vcf.gz \
-I aou_callset_1K_227.vcf.gz \
-I aou_callset_1K_228.vcf.gz \
-I aou_callset_1K_229.vcf.gz \
--output chr20.vcf.gz
[ "aou_callset_1K_223.vcf.gz", \
"aou_callset_1K_224.vcf.gz", \
"aou_callset_1K_225.vcf.gz", \
"aou_callset_1K_226.vcf.gz", \
"aou_callset_1K_227.vcf.gz", \
"aou_callset_1K_228.vcf.gz", \
"aou_callset_1K_229.vcf.gz"
```
And create an index for that VCF:

**output_basename** - The base name for the output files generated by the pipeline.

**chromosome** - The chromosome on which to run the analysis of Precision and Sensitivity. The default value for this is `chr20`. If it is set to `all` then the analysis will be run across *all* chromosomes.

**sample_names** - A list of the sample names that are controls and that will be used for the analysis. For every element on the list of sample names there must be a corresponding element on the list of `truth_vcfs`, `truth_vcf_indices`, and `truth_beds`.

```
tabix chr20.vcf.gz
[ "NA12878", \
"NA24385"
```
Now create single sample gVCFs for the control samples; in this example the sample names for the controls are "BI_HG-002", "UW_HG-002" and "BI_HG-003":
**truth_vcfs** - A list of the VCFs that contain the truth data used for analyzing the samples in `sample_names`.

```
INPUT_VCF=chr20.vcf.gz
~/gatk SelectVariants -V ${INPUT_VCF} -L chr20 --sample-name BI_HG-002 --select-type-to-exclude NO_VARIATION -O BI_HG-002.chr20.vcf.gz
~/gatk SelectVariants -V ${INPUT_VCF} -L chr20 --sample-name UW_HG-002 --select-type-to-exclude NO_VARIATION -O UW_HG-002.chr20.vcf.gz
~/gatk SelectVariants -V ${INPUT_VCF} -L chr20 --sample-name BI_HG-003 --select-type-to-exclude NO_VARIATION -O BI_HG-003.chr20.vcf.gz
[ "gs://broad-gotc-test-storage/gvs/truth/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz", \
"gs://broad-gotc-test-storage/gvs/truth/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz"
```
## Run Script to Add "AS_MAX_VQSLOD" to VCFs
The first line should contain only the control sample names in your callset (same list as previous step).
**truth_vcf_indices** - A list of the VCF indices for the truth data VCFs supplied above.

```
for sample in BI_HG-002 UW_HG-002 BI_HG-003
do
python ../add_max_as_vqslod.py ${sample}.chr20.vcf.gz | bgzip > ${sample}.chr20.maxas.vcf.gz && tabix ${sample}.chr20.maxas.vcf.gz
done
[ "gs://broad-gotc-test-storage/gvs/truth/HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz.tbi", \
"gs://broad-gotc-test-storage/gvs/truth/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi"
```
## Evaluate
First create the ROC curves using only the PASSing records. For each sample, you will need to find the corresponding files under `truth/` and pass those into the `BASE_CMD`.

```
BASE_CMD="rtg vcfeval --region chr20 --roc-subset snp,indel --vcf-score-field=INFO.MAX_AS_VQSLOD -t human_REF_SDF"
SUFFIX="_roc_filtered"
**truth_beds** - A list of the bed files for the truth data used for analyzing the samples in `sample_names`.

${BASE_CMD} -b truth/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz -e truth/HG002.gvs.evaluation.bed -c BI_HG-002.chr20.maxas.vcf.gz -o BI_HG-002_aou-bq${SUFFIX}
${BASE_CMD} -b truth/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz -e truth/HG002.gvs.evaluation.bed -c UW_HG-002.chr20.maxas.vcf.gz -o UW_HG-002_aou-bq${SUFFIX}
${BASE_CMD} -b truth/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz -e truth/HG003.gvs.evaluation.bed -c BI_HG-003.chr20.maxas.vcf.gz -o BI_HG-003_aou-bq${SUFFIX}
```
Then do the same thing but use all records.
[ "gs://broad-gotc-test-storage/gvs/truth/HG001.gvs.evaluation.bed", \
"gs://broad-gotc-test-storage/gvs/truth/HG002.gvs.evaluation.bed"
```
SOURCE="gvs"
BASE_CMD="rtg vcfeval --region chr20 --all-records --roc-subset snp,indel --vcf-score-field=INFO.MAX_AS_VQSLOD -t human_REF_SDF"
SUFFIX="_roc_all"

${BASE_CMD} -b truth/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz -e truth/HG002.gvs.evaluation.bed -c BI_HG-002.chr20.maxas.vcf.gz -o BI_HG-002_aou-bq${SUFFIX}
${BASE_CMD} -b truth/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz -e truth/HG002.gvs.evaluation.bed -c UW_HG-002.chr20.maxas.vcf.gz -o UW_HG-002_aou-bq${SUFFIX}
${BASE_CMD} -b truth/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz -e truth/HG003.gvs.evaluation.bed -c BI_HG-003.chr20.maxas.vcf.gz -o BI_HG-003_aou-bq${SUFFIX}
```
## Gather Precision and Sensitivity numbers
Run the following script to display the data for the output file:
```
for type in "snp" "indel"
do
for suffix in "_all" "_filtered"
do
for f in $(ls -1 *${suffix}/${type}*.tsv.gz); do
d=$(cat $f | gunzip | tail -1 | cut -f3,5,6,7)
s=$(echo $f | cut -d"/" -f1 )
echo -e "$s\t$type\t$d"
done
done
done
**ref_fasta** - The cloud path for the reference fasta sequence.


## Appendix A: View ROCs
To view ROC curves for the data you will need to set up rtg-tools locally, using the following steps (typically you need do this only once):

1. Use `conda` to create a fresh environment to add these new tools to it:
```
conda create --name gvs python=3.8
conda activate gvs
conda install -c bioconda rtg-tools
```
The columns here are: sample name, variant type, FPs (false positives), FNs (false negatives), precision, and sensitivity. Copy and paste the output from this call into a TSV file.
## Optional: View ROCs
The base files needed to generate the ROC curves are among the outputs of the `GvsCalculatePrecisionAndSensitivity` wdl

You can use wildcards to pull in multiple datasets into the graphical viewer. For example
```
rtg roc NA12878_*_roc*/snp_roc.tsv.gz
rtg roc NA12878_*_roc*/indel_roc.tsv.gz
```
## Appendix A: New Control Samples
## Appendix B: Control Samples

"Truth" versions of control samples for this analysis are stored in:
```
gs://broad-dsp-spec-ops/gvs/truth/*
```

If you are missing any follow these steps there to add and prepare them to the above location

"Truth" versions of control samples are from the [Genome in a Bottle Consortium (GIAB)](https://www.nist.gov/programs-projects/genome-bottle) and [releases of their data can be found here](https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/) organized by source.
1. Download the `.bed`, `.vcf` and index files associated with the control sample.
2. Convert the calling region GVS is using to BED format
Expand Down
Loading