VS-280 Create a VAT intermediary (#7657)

* gsutil cp them up * creation of new files and splitting of code * store in between states * add to dockstore! * validate wdl * add service account before gsutil * wait until subpipeline is over! * make it cheaper! * remove cruft from dockstore * remove in-case-of-fire step since subworkflow works much better * add readme and examples * stylize readme for readability * add examples to dockstore * remove gatk override param * clean up params * tsv note
broadinstitute · Feb 25, 2022 · 3aa3c3b · 3aa3c3b
1 parent d2ebecb
commit 3aa3c3b
Show file tree

Hide file tree

Showing 9 changed files with 613 additions and 324 deletions.
diff --git a/.dockstore.yml b/.dockstore.yml
@@ -152,16 +152,15 @@ workflows:
        branches:
          - master
          - ah_var_store
-   - name: GvsSitesOnlyVCF
+   - name: GvsCreateVAT
      subclass: WDL
-     primaryDescriptorPath: /scripts/variantstore/wdl/GvsSitesOnlyVCF.wdl
+     primaryDescriptorPath: /scripts/variantstore/wdl/GvsCreateVAT.wdl
      testParameterFiles:
-       - /scripts/variantstore/wdl/GvsSitesOnlyVCF.example.inputs.json
+       - /scripts/variantstore/wdl/GvsCreateVAT.example.inputs.json
      filters:
        branches:
          - master
          - ah_var_store
-         - rc-remove-sites-only-step
    - name: GvsValidateVat
      subclass: WDL
      primaryDescriptorPath: /scripts/variantstore/variant_annotations_table/GvsValidateVAT.wdl

diff --git a/.gitattributes b/.gitattributes
@@ -5,8 +5,7 @@ src/test/resources/large/funcotator/funcotator_dataSources/dna_repair_genes/hg38
 src/test/resources/large/funcotator/funcotator_dataSources/familial/hg38 -filter=lfs -diff=lfs -merge=lfs -text
 src/test/resources/large/funcotator/funcotator_dataSources/hgnc/hg38 -filter=lfs -diff=lfs -merge=lfs -text
 src/test/resources/large/funcotator/funcotator_dataSources/simple_uniprot/hg38 -filter=lfs -diff=lfs -merge=lfs -text
-
 #Otherwise, track everything in large
 src/test/resources/large/** filter=lfs diff=lfs merge=lfs -text
 src/main/resources/large/** filter=lfs diff=lfs merge=lfs -text
-
+*.psd filter=lfs diff=lfs merge=lfs -text
diff --git a/scripts/variantstore/TERRA_QUICKSTART.md b/scripts/variantstore/TERRA_QUICKSTART.md
@@ -36,8 +36,6 @@ These are the required parameters which must be supplied to the workflow:
 | project_id            | The name of the google project containing the dataset |
 | dataset_name          | The name of the dataset you created above       |
 | external_sample_names | datamodel  (e.g `this.samples.sample_id`)     |
-| workspace_namespace   | name of the current workspace namespace |
-| workspace_name        | name of the current workspace |
 
 ## 1.2 Load data
 

diff --git a/scripts/variantstore/variant_annotations_table/README.md b/scripts/variantstore/variant_annotations_table/README.md
@@ -0,0 +1,75 @@
+# Creating the Variant Annotations Table
+
+### The VAT pipeline is a set of WDLs
+- [GvsCreateVAT.wdl](/scripts/variantstore/wdl/GvsCreateVAT.wdl)
+- [GvsValidateVAT.wdl](/scripts/variantstore/variant_annotations_table/GvsValidateVAT.wdl)
+
+The pipeline takes in a jointVCF and outputs a table in BigQuery.
+
+**GvsCreateVAT** creates the table, while
+**GvsValidateVAT** checks and validates the VAT
+
+
+### Run GvsCreateVAT:
+
+Most of the inputs are constants — like the reference, or a table schema — and dont require any additional work (and paths can be found in the [example inputs json](/scripts/variantstore/wdl/GvsCreateVAT.example.inputs.json)). However for the specific data being put in the VAT, three inputs need to be created.
+
+The first two of these inputs are two files — one of the file/vcf/shards you want to use for the VAT, and their corresponding index files. These are labelled as `inputFileofFileNames` and `inputFileofIndexFileNames` and need to be copied into a GCP bucket that this pipeline will have access to (eg. this bucket: `gs://aou-genomics-curation-prod-processing/vat/`) for easy access during the workflow.
+The third input is the ancestry file from the ancestry pipeline which will be used to calculate AC, AN and AF for all subpopulations. It needs to be copied into a GCP bucket that this pipeline will have access to. This input has been labelled as the `ancestry_file`.
+
+Most of the other files are specific to where the VAT will live, like the project_id and dataset_name and the table_suffix which will name the VAT itself as vat_`table_suffix` as well as a GCP bucket location, the output_path, for the intermediary files and the VAT export in tsv form.
+
+The provided [example inputs json](/scripts/variantstore/wdl/GvsCreateVAT.example.inputs.json) indicates all of the inputs.
+
+
+### Notes:
+
+When running this pipeline, currently I try to stay around 5k-10k shards—I have yet to successfully run more than that, but I think it’s more about nerves than anything else. At 20k shards, the shards do tend to step on each other if there are too many—not that anything fails or is wrong, but I’d rather get results for a run that I then do multiple times than wait for queued jobs.
+
+Note that there are two temporary tables that are created in addition to the main VAT table: the Genes and VT tables. They have a time to live of 24 hours.
+The VAT table is created by that query fresh each time so that there is no risk of duplicates.  
+HOWEVER the Genes and VT tables are not. They are cleaned up after 24 hours, but this code needs to be tweaked so that you can’t get into a state where duplicates are created here. The real question here is going to be, is there a use case that we might want to run where adding to a VAT that was created say weeks ago is beneficial, but given that calculations occur on a sample summing level, this seems unlikely.
+
+
+To check that all of the shards have successfully made it past the first of the sub workflow steps (the most complicated and likely to fail) they will need to be annotated / transformed into json files and put here:
+`gsutil ls  [output_path]/annotations/  | wc -l`
+
+And then once they have been transformed by the python script and are ready to be loaded into BQ they will be here:
+`gsutil ls  [output_path]/genes/  | wc -l`
+`gsutil ls  [output_path]/vt/  | wc -l`
+
+These numbers are cumulative. Also the names of these json files are retained from the original shard names so as to not cause collisions. If you run the same shards through the VAT twice, the second runs should overwrite the first and the total number of jsons should not change.
+Once the shards have make it into the /genes/ and /vt/ directories, the majority of the expense and transformations needed for that shard are complete.
+They are ready to be loaded into BQ. You will notice that past this step, all there is to do is create the BQ tables, load the BQ tables, run a join query and then the remaining steps are all validations or an export into tsv.
+
+
+There are often a fair number of failures from google in this workflow—-so far they have all been 503s. Because of the re-working of the workflow, they should not interrupt unaffected shards, but the shards that are affected will need to be collected and put into a new File of File names and re-run.
+In theory you could actually just re-run the pipeline entirely and since the 503s seem to be completely random and intermittent, the same shards likely wont fail twice and you’d get everything you need, but at low efficiency and high expense.
+To grab any files not in the bucket, but in the file of file names:
+
+`gsutil ls [output_path]/genes/ | awk '{print substr($0,90)}' RS='.json.gz\n'  > successful_vcf_shards.txt`  
+`gsutil cat [inputFileofFileNames] | awk '{print substr($0,41)}' RS='.vcf.gz\n' > all_vcf_shards.txt`  
+`comm -3 all_vcf_shards.txt successful_vcf_shards.txt > diff_vcf_shards.txt`  
+`awk '{print g[bucket GVS output VCF shards are in]" $0 ".vcf.gz"}' diff_vcf_shards.txt > fofn_missing_shards.txt`  
+
+_(dont forget to do this for the indices as well)_
+
+
+
+
+There is a line of code in ExtractAnAcAfFromVCF (the most expensive in $ and time task in the workflow) that can be removed because it is used to track the variants that are dropped _TODO: Rori to make using it a parameter_
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/scripts/variantstore/wdl/GvsCreateVAT.example.inputs.json b/scripts/variantstore/wdl/GvsCreateVAT.example.inputs.json
@@ -0,0 +1,15 @@
+{
+  "GvsValidateVatTable.inputFileofFileNames": "FILE",
+  "GvsValidateVatTable.inputFileofIndexFileNames": "FILE",
+  "GvsValidateVatTable.project_id": "PROJECT_ID",
+  "GvsValidateVatTable.dataset_name": "DATASET",
+  "GvsValidateVatTable.nirvana_data_directory": "gs://broad-dsp-spec-ops/scratch/rcremer/Nirvana/NirvanaData.tar.gz",
+  "GvsValidateVatTable.vat_schema_json_file": "gs://broad-dsp-spec-ops/scratch/rcremer/Nirvana/schemas/vat_schema.json",
+  "GvsValidateVatTable.variant_transcript_schema_json_file": "gs://broad-dsp-spec-ops/scratch/rcremer/Nirvana/schemas/vt_schema.json",
+  "GvsValidateVatTable.genes_schema_json_file": "gs://broad-dsp-spec-ops/scratch/rcremer/Nirvana/schemas/genes_schema.json",
+  "GvsValidateVatTable.output_path": "PATH",
+  "GvsValidateVatTable.table_suffix": "v1",
+  "GvsValidateVatTable.service_account_json_path": "SERVICE_ACCOUNT",
+  "GvsValidateVatTable.AnAcAf_annotations_template": "gs://broad-dsp-spec-ops/scratch/rcremer/Nirvana/vat/custom_annotations_template.tsv",
+  "GvsValidateVatTable.ancestry_file": "ANCESTRY_FILE"
+}