Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Sites only step from the VAT creation WDL #7510

Merged
merged 26 commits into from
Oct 28, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
49c9971
clean up
RoriCremer Sep 16, 2021
9ff496b
is this step dropping rows
RoriCremer Sep 29, 2021
d845e01
add filtering on the FT tag level
RoriCremer Sep 29, 2021
e8646e4
fixup on the file length
RoriCremer Sep 29, 2021
46b81e1
remove duplicates by 4 cols, not only 2
RoriCremer Sep 29, 2021
59b6cb9
cleanup
RoriCremer Sep 30, 2021
1e358aa
does this cleanly remove sites only?
RoriCremer Sep 30, 2021
d168d27
add to dockstore for testing
RoriCremer Sep 30, 2021
e9f413c
lil cheaper
RoriCremer Sep 30, 2021
f93b896
testing the normalized vcfs
RoriCremer Sep 30, 2021
55fa305
make annotations machine bigger
RoriCremer Sep 30, 2021
b2d0ffe
add duplicate count info
RoriCremer Oct 1, 2021
d84ad82
dont rename things---use the names that already exist!
RoriCremer Oct 4, 2021
3d8b151
drop all the excessive AS_QUALapprox info since it just adds weight
RoriCremer Oct 4, 2021
615aa28
swap gzip
RoriCremer Oct 4, 2021
7a12d55
check with Lee about VAT - markings
RoriCremer Oct 6, 2021
a815a86
template update
RoriCremer Oct 6, 2021
55d5636
add in line counts for metrics
RoriCremer Oct 6, 2021
7a14fb2
get the count of high alt allele sites
RoriCremer Oct 7, 2021
5d8f1ed
AC_hemi for all
RoriCremer Oct 7, 2021
f57707a
add new docker image
RoriCremer Oct 8, 2021
c853b63
get sample counts
RoriCremer Oct 8, 2021
7f0864b
clean up files for memory preservation
RoriCremer Oct 13, 2021
27ca164
re-order errors
RoriCremer Oct 15, 2021
2602a56
pass dropped variants
RoriCremer Oct 20, 2021
5284dee
comments cleanup
RoriCremer Oct 27, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,7 @@ workflows:
branches:
- master
- ah_var_store
- rc-remove-sites-only-step
- name: GvsValidateVat
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/variant_annotations_table/GvsValidateVAT.wdl
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#title=gvsAnnotations
#assembly=GRCh38
#matchVariantsBy=allele
#CHROM POS REF ALT AC AN AF AC_Hom AC_Het AC_afr AN_afr AF_afr AC_Hom_afr AC_Het_afr AC_amr AN_amr AF_amr AC_Hom_amr AC_Het_amr AC_eas AN_eas AF_eas AC_Hom_eas AC_Het_eas AC_eur AN_eur AF_eur AC_Hom_eur AC_Het_eur AC_mid AN_mid AF_mid AC_Hom_mid AC_Het_mid AC_oth AN_oth AF_oth AC_Hom_oth AC_Het_oth AC_sas AN_sas AF_sas AC_Hom_sas AC_Het_sas
#categories . . . AlleleCount AlleleNumber AlleleFrequency AlleleCount AlleleCount AlleleCount AlleleNumber AlleleFrequency AlleleCount AlleleCount AlleleCount AlleleNumber AlleleFrequency AlleleCount AlleleCount AlleleCount AlleleNumber AlleleFrequency AlleleCount AlleleCount AlleleCount AlleleNumber AlleleFrequency AlleleCount AlleleCount AlleleCount AlleleNumber AlleleFrequency AlleleCount AlleleCount AlleleCount AlleleNumber AlleleFrequency AlleleCount AlleleCount AlleleCount AlleleNumber AlleleFrequency AlleleCount AlleleCount
#descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
#type . . . number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number
#CHROM POS REF ALT AC AN AF AC_Hom AC_Het AC_Hemi AC_afr AN_afr AF_afr AC_Hom_afr AC_Het_afr AC_Hemi_afr AC_amr AN_amr AF_amr AC_Hom_amr AC_Het_amr AC_Hemi_amr AC_eas AN_eas AF_eas AC_Hom_eas AC_Het_eas AC_Hemi_eas AC_eur AN_eur AF_eur AC_Hom_eur AC_Het_eur AC_Hemi_eur AC_mid AN_mid AF_mid AC_Hom_mid AC_Het_mid AC_Hemi_mid AC_oth AN_oth AF_oth AC_Hom_oth AC_Het_oth AC_Hemi_oth AC_sas AN_sas AF_sas AC_Hom_sas AC_Het_sas AC_Hemi_sas
#categories . . . AlleleCount AlleleNumber AlleleFrequency AlleleCount AlleleCount AlleleCount AlleleCount AlleleNumber AlleleFrequency AlleleCount AlleleCount AlleleCount AlleleCount AlleleNumber AlleleFrequency AlleleCount AlleleCount AlleleCount AlleleCount AlleleNumber AlleleFrequency AlleleCount AlleleCount AlleleCount AlleleCount AlleleNumber AlleleFrequency AlleleCount AlleleCount AlleleCount AlleleCount AlleleNumber AlleleFrequency AlleleCount AlleleCount AlleleCount AlleleCount AlleleNumber AlleleFrequency AlleleCount AlleleCount AlleleCount AlleleCount AlleleNumber AlleleFrequency AlleleCount AlleleCount AlleleCount
#descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
#type . . . number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number number
3 changes: 2 additions & 1 deletion scripts/variantstore/wdl/GvsSitesOnlyVCF.example.inputs.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,6 @@
"GvsSitesOnlyVCF.project_id": "spec-ops-aou",
"GvsSitesOnlyVCF.dataset_name": "anvil_100_for_testing",
"GvsSitesOnlyVCF.AnAcAf_annotations_template": "gs://broad-dsp-spec-ops/scratch/rcremer/Nirvana/custom_annotations_template.tsv",
"GvsSitesOnlyVCF.ancestry_file": "gs://broad-dsp-spec-ops/scratch/rcremer/Nirvana/subpop.tsv"
"GvsSitesOnlyVCF.ancestry_file": "gs://broad-dsp-spec-ops/scratch/rcremer/Nirvana/subpop.tsv",
"GvsSitesOnlyVCF.reference": "gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta"
}
162 changes: 72 additions & 90 deletions scripts/variantstore/wdl/GvsSitesOnlyVCF.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,6 @@ workflow GvsSitesOnlyVCF {
input {
File inputFileofFileNames
File inputFileofIndexFileNames
String output_sites_only_file_name
String output_annotated_file_name
String project_id
String dataset_name
File nirvana_data_directory
Expand Down Expand Up @@ -34,6 +32,9 @@ workflow GvsSitesOnlyVCF {

## Scatter across the shards from the GVS jointVCF
scatter(i in range(length(MakeSubpopulationFiles.input_vcfs)) ) {
## Create a sites-only VCF from the original GVS jointVCF
## Calculate AC/AN/AF for subpopulations and extract them for custom annotations
String input_vcf_name = basename(MakeSubpopulationFiles.input_vcfs[i], ".vcf.gz")
call ExtractAnAcAfFromVCF {
input:
input_vcf = MakeSubpopulationFiles.input_vcfs[i],
Expand All @@ -44,29 +45,20 @@ workflow GvsSitesOnlyVCF {
ref = reference
}

## Create a sites-only VCF from the original GVS jointVCF
call SitesOnlyVcf {
input:
input_vcf = ExtractAnAcAfFromVCF.output_vcf,
input_vcf_index = ExtractAnAcAfFromVCF.output_vcf_index,
service_account_json_path = service_account_json_path,
output_filename = "${output_sites_only_file_name}_${i}.sites_only.vcf.gz"
}

## Use Nirvana to annotate the sites-only VCF and include the AC/AN/AF calculations as custom annotations
call AnnotateVCF {
input:
input_vcf = SitesOnlyVcf.output_vcf,
input_vcf_index = SitesOnlyVcf.output_vcf_idx,
output_annotated_file_name = "${output_annotated_file_name}_${i}",
input_vcf = ExtractAnAcAfFromVCF.output_vcf,
input_vcf_index = ExtractAnAcAfFromVCF.output_vcf_index,
output_annotated_file_name = "${input_vcf_name}_annotated",
nirvana_data_tar = nirvana_data_directory,
custom_annotations_file = ExtractAnAcAfFromVCF.annotations_file,
}

call PrepAnnotationJson {
input:
annotation_json = AnnotateVCF.annotation_json,
output_file_suffix = "${i}.json.gz",
output_file_suffix = "${input_vcf_name}.json.gz",
output_path = output_path,
service_account_json_path = service_account_json_path
}
Expand All @@ -90,12 +82,13 @@ workflow GvsSitesOnlyVCF {
project_id = project_id,
dataset_name = dataset_name,
counts_variants = ExtractAnAcAfFromVCF.count_variants,
track_dropped_variants = ExtractAnAcAfFromVCF.track_dropped,
table_suffix = table_suffix,
service_account_json_path = service_account_json_path,
load_jsons_done = BigQueryLoadJson.done
}
scatter(i in range(length(contig_array)) ) {

scatter(i in range(length(contig_array)) ) {
call BigQueryExportVat {
input:
contig = contig_array[i],
Expand Down Expand Up @@ -158,7 +151,7 @@ task MakeSubpopulationFiles {
# ------------------------------------------------
# Runtime settings:
runtime {
docker: "us.gcr.io/broad-dsde-methods/variantstore:ah_var_store_20210916"
docker: "us.gcr.io/broad-dsde-methods/variantstore:ah_var_store_20211007"
memory: "1 GB"
preemptible: 3
cpu: "1"
Expand Down Expand Up @@ -200,7 +193,8 @@ task ExtractAnAcAfFromVCF {
String normalized_vcf_compressed = "normalized.vcf.gz"
String normalized_vcf_indexed = "normalized.vcf.gz.tbi"

# separate multi-allelic sites into their own lines, remove deletions and extract the an/ac/af & sc
# separate multi-allelic sites into their own lines, remove deletions and filtered sites and make a sites only vcf
# while extracting the an/ac/af & sc by subpopulation into a tsv
command <<<
set -e

Expand All @@ -217,8 +211,12 @@ task ExtractAnAcAfFromVCF {
gsutil cp ~{input_vcf_index} ~{local_input_vcf_index}
gsutil cp ~{ref} Homo_sapiens_assembly38.fasta

# TODO Compare the ancestry list with the sample list and throw an error (but dont fail the job) if there are samples that are in one, but not the other. Two different errors.
# awk '{print $2}' ~{subpopulation_sample_list} | tail -n +2 | sort -u > collected_subpopulations.txt
## compare the ancestry sample list with the vcf sample list
# TODO throw an error (but dont fail the job) if there are samples that are in one, but not the other. Throw two different errors.
# Currently commented out to save on io with the AoU beta VAT creation
# awk '{print $1}' ~{subpopulation_sample_list} | tail -n +2 | sort -u > collected_subpopulation_samples.txt
# bcftools query --list-samples ~{local_input_vcf} | sort -u > collected_samples.txt
# diff collected_subpopulation_samples.txt collected_samples.txt

# expected_subpopulations = [
# "afr",
Expand All @@ -230,39 +228,56 @@ task ExtractAnAcAfFromVCF {
# "sas"
#]


bcftools norm -m-any ~{local_input_vcf} | \
bcftools norm --check-ref w -f Homo_sapiens_assembly38.fasta > ~{normalized_vcf}


## make a file of just the first 4 columns of the tsv (maybe I could just bcftools query it?)
bcftools query ~{normalized_vcf} -f '%CHROM\t%POS\t%REF\t%ALT\n' > check_duplicates.tsv
## track the dropped variants with +500 alt alleles or N's in the reference (Since Nirvana cant handle N as a base, drop them for now)
bcftools view -i 'N_ALT>500 || REF~"N"' ~{local_input_vcf} | bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\n' > track_dropped.tsv

## filter out sites with too many alt alleles
bcftools view -e 'N_ALT>500 || REF~"N"' --no-update ~{local_input_vcf} | \
## filter out the non-passing sites
bcftools view -f 'PASS,.' --no-update | \
## normalize, left align and split multi allelic sites to new lines, remove duplicate lines
bcftools norm -m- --check-ref w -f Homo_sapiens_assembly38.fasta | \
## filter out spanning deletions and variants with an AC of 0
bcftools view -e 'ALT[0]="*" || AC=0' --no-update | \
## ensure that we respect the FT tag
bcftools filter -i "FORMAT/FT='PASS,.'" --set-GTs . > ~{normalized_vcf}

## clean up unneeded file
rm ~{local_input_vcf}

## make a file of just the first 5 columns of the tsv
bcftools query ~{normalized_vcf} -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\n' > check_duplicates.tsv
## check it for duplicates and put them in a new file
sort check_duplicates.tsv | uniq -d | cut -f1,2 > duplicates.tsv
## remove those rows (this will be ALL rows with this position--so good rows too, potentially we want to grab f1,4 to do this with)
grep -v -wFf duplicates.tsv ~{normalized_vcf} | grep -v "AC=0;" > deduplicated.vcf

wc -l duplicates.tsv

bcftools plugin fill-tags -- deduplicated.vcf -S ~{subpopulation_sample_list} -t AC,AF,AN,AC_het,AC_hom | bcftools query -f \
'%CHROM\t%POS\t%REF\t%ALT\t%AC\t%AN\t%AF\t%AC_Hom\t%AC_Het\t%AC_afr\t%AN_afr\t%AF_afr\t%AC_Hom_afr\t%AC_Het_afr\t%AC_amr\t%AN_amr\t%AF_amr\t%AC_Hom_amr\t%AC_Het_amr\t%AC_eas\t%AN_eas\t%AF_eas\t%AC_Hom_eas\t%AC_Het_eas\t%AC_eur\t%AN_eur\t%AF_eur\t%AC_Hom_eur\t%AC_Het_eur\t%AC_mid\t%AN_mid\t%AF_mid\t%AC_Hom_mid\t%AC_Het_mid\t%AC_oth\t%AN_oth\t%AF_oth\t%AC_Hom_oth\t%AC_Het_oth\t%AC_sas\t%AN_sas\t%AF_sas\t%AC_Hom_sas\t%AC_Het_sas\n' \
| grep -v "*" >> ~{custom_annotations_file_name}

### for validation of the pipeline
wc -l ~{custom_annotations_file_name} | awk '{print $1}' > count.txt
# Should this be where we do the filtering of the AC/AN/AF values rather than in the python?

### compress the vcf and index it
bcftools view -I deduplicated.vcf -Oz -o ~{normalized_vcf_compressed}
sort check_duplicates.tsv | uniq -d | cut -f1,2,3,4,5 > duplicates.tsv
rm check_duplicates.tsv ## clean up
## remove those rows (that match up to the first 5 cols)
grep -v -wFf duplicates.tsv ~{normalized_vcf} > deduplicated.vcf
rm ~{normalized_vcf} ## clean up

## add duplicates to the file tracking dropped variants
cat duplicates.tsv >> track_dropped.tsv
rm duplicates.tsv ## clean up unneeded file

## calculate annotations for all subpopulations
bcftools plugin fill-tags -- deduplicated.vcf -S ~{subpopulation_sample_list} -t AC,AF,AN,AC_het,AC_hom,AC_Hemi | bcftools query -f \
'%CHROM\t%POS\t%REF\t%ALT\t%AC\t%AN\t%AF\t%AC_Hom\t%AC_Het\t%AC_Hemi\t%AC_afr\t%AN_afr\t%AF_afr\t%AC_Hom_afr\t%AC_Het_afr\t%AC_Hemi_afr\t%AC_amr\t%AN_amr\t%AF_amr\t%AC_Hom_amr\t%AC_Het_amr\t%AC_Hemi_amr\t%AC_eas\t%AN_eas\t%AF_eas\t%AC_Hom_eas\t%AC_Het_eas\t%AC_Hemi_eas\t%AC_eur\t%AN_eur\t%AF_eur\t%AC_Hom_eur\t%AC_Het_eur\t%AC_Hemi_eur\t%AC_mid\t%AN_mid\t%AF_mid\t%AC_Hom_mid\t%AC_Het_mid\t%AC_Hemi_mid\t%AC_oth\t%AN_oth\t%AF_oth\t%AC_Hom_oth\t%AC_Het_oth\t%AC_Hemi_oth\t%AC_sas\t%AN_sas\t%AF_sas\t%AC_Hom_sas\t%AC_Het_sas\t%AC_Hemi_sas\n' \
>> ~{custom_annotations_file_name}

## for validation of the pipeline
wc -l ~{custom_annotations_file_name} | awk '{print $1 -7}' > count.txt

## compress the vcf and index it, make it sites-only
bcftools view --no-update --drop-genotypes deduplicated.vcf -Oz -o ~{normalized_vcf_compressed}
## if we can spare the IO and want to pass a smaller file we can also drop the info field w bcftools annotate -x INFO
bcftools index --tbi ~{normalized_vcf_compressed}

>>>
# ------------------------------------------------
# Runtime settings:
runtime {
docker: "us.gcr.io/broad-dsde-methods/variantstore:ah_var_store_20210916"
memory: "20 GB"
preemptible: 1
docker: "us.gcr.io/broad-dsde-methods/variantstore:ah_var_store_20211007"
memory: "32 GB"
preemptible: 3
cpu: "2"
disks: "local-disk 500 SSD"
}
Expand All @@ -271,52 +286,12 @@ task ExtractAnAcAfFromVCF {
output {
File annotations_file = "~{custom_annotations_file_name}"
Int count_variants = read_int("count.txt")
File track_dropped = "track_dropped.tsv"
File output_vcf = "~{normalized_vcf_compressed}"
File output_vcf_index = "~{normalized_vcf_indexed}"
}
}

task SitesOnlyVcf {
input {
File input_vcf
File input_vcf_index
String? service_account_json_path
String output_filename
}
String output_vcf_idx = basename(output_filename) + ".tbi"

command <<<
set -e

# Adding `--add-output-vcf-command-line false` so that the VCF header doesn't have a timestamp
# in it so that downstream steps can call cache

gatk --java-options "-Xmx12288m" \
SelectVariants \
-V ~{input_vcf} \
--add-output-vcf-command-line false \
--exclude-filtered \
--sites-only-vcf-output \
-O ~{output_filename}

>>>
# ------------------------------------------------
# Runtime settings:
runtime {
docker: "broadinstitute/gatk:4.2.0.0"
memory: "15 GB"
preemptible: 3
cpu: "2"
disks: "local-disk 250 HDD"
}
# ------------------------------------------------
# Outputs:
output {
File output_vcf="~{output_filename}"
File output_vcf_idx="~{output_vcf_idx}"
}
}

task AnnotateVCF {
input {
File input_vcf
Expand Down Expand Up @@ -375,10 +350,10 @@ task AnnotateVCF {
# Runtime settings:
runtime {
docker: "annotation/nirvana:3.14"
memory: "20 GB"
memory: "32 GB"
cpu: "2"
preemptible: 5
disks: "local-disk 250 SSD"
disks: "local-disk 500 SSD"
}
# ------------------------------------------------
# Outputs:
Expand Down Expand Up @@ -430,7 +405,7 @@ task PrepAnnotationJson {
# ------------------------------------------------
# Runtime settings:
runtime {
docker: "us.gcr.io/broad-dsde-methods/variantstore:ah_var_store_20210916"
docker: "us.gcr.io/broad-dsde-methods/variantstore:ah_var_store_20211007"
memory: "8 GB"
preemptible: 5
cpu: "1"
Expand Down Expand Up @@ -668,6 +643,7 @@ task BigQuerySmokeTest {
String project_id
String dataset_name
Array[Int] counts_variants
Array[File] track_dropped_variants
String table_suffix
String? service_account_json_path
Boolean load_jsons_done
Expand Down Expand Up @@ -695,6 +671,12 @@ task BigQuerySmokeTest {
# sum all the initial input variants across the shards

INITIAL_VARIANT_COUNT=$(python -c "print(sum([~{sep=', ' counts_variants}]))")
## here a list of files: ~{sep=', ' track_dropped_variants}
## I need to get the lines from each file
awk 'FNR==1{print ""}1' ~{sep=' ' track_dropped_variants} > dropped_variants.txt

echo "Dropped variants:"
cat dropped_variants.txt

# Count number of variants in the VAT
bq query --nouse_legacy_sql --project_id=~{project_id} --format=csv 'SELECT COUNT (DISTINCT vid) AS count FROM `~{dataset_name}.~{vat_table}`' > bq_variant_count.csv
Expand Down
Loading