broadinstitute · kcibul · Jun 16, 2021 · May 21, 2021 · May 21, 2021 · May 26, 2021
diff --git a/scripts/variantstore/TERRA_QUICKSTART.md b/scripts/variantstore/TERRA_QUICKSTART.md
@@ -0,0 +1,104 @@
+# Quickstart - Joint Calling with the Broad Genomic Variant Store 
+
+**Note** The markdown source for this quickstart is maintained in the the  [GATK GitHub Repository](https://github.com/broadinstitute/gatk/blob/ah_var_store/scripts/variantstore/TERRA_QUICKSTART.md).  Submit any feedback, corrections or improvements in a pull request there.  Do not edit this file directly.
+
+## Overview
+Through this QuickStart you will learn how to use the Broad Genomic Variant Store to create a GATK VQSR Filtered joint callset VCF for whole genome samples.
+
+The sequencing data in this quickstart came from the [AnVIL 1000G High Coverage workspace](https://app.terra.bio/#workspaces/anvil-datastorage/1000G-high-coverage-2019)
+
+**Note:** VQSR dies with the default/recommended configuration, so we set SNP max-gaussians to 4 here...
+
+## Prerequisites
+
+This quickstart assumes that you are familiar with Terra workspaces, the data model and providing input parameters and launching workflows.
+
+1. You will need to have or create a BigQuery dataset (we'll call this `datasetname` later on). 
-1. You will need to have or create a BigQuery dataset (we'll call this `datasetname` later on). 
+1. You will need to have or create a BigQuery dataset (we'll call this `datasetname` later on) in a Google project where you are an "Owner." 
-1. You will need to have or create a BigQuery dataset (we'll call this `datasetname` later on). 
+1. You will need to have or create a BigQuery dataset (we'll call this `datasetname` later on) in a Google project where you are an "Owner." 
+2. Grant the "BigQuery Editor" role on that **dataset** to your Terra PROXY group.  Your proxy group name can be found on your Terra Profile page and look something like `PROXY_3298237498237948372@firecloud.org`
+3. Grant the following roles on the Google **project** containing the dataset to your proxy group
+    - BigQuery data editor
+    - BigQuery job user
+    - BigQuery Read Session User
+
+## 1. Import Data
+
+First, we have to import your reblocked gVCF files into GVS by running the `GvsImportGenomes` workflow.  
+
+The workflow should be run against a sample set indicating the samples to load.  The sample table should have a columns for the gVCFs (`hg38_reblocked_gvcf`) and their indexes (`hg38_reblocked_gvcf_index`)
+
+A sample set for the quickstart (`gvs_demo-10`) has already been created with 10 samples.
+
+These are the required parameters which must be supplied to the workflow:
+
+| Parameter      | Description |
+| ----------------- | ----------- |
+| project_id | The name of the google project containing the dataset |
+| dataset_name      | The name of the dataset you created above       |
+| sample_map | Use `"gs://fc-2b4456d7-974b-4b67-90f8-63c2fd2c03d4/gvs_demo_10_sample_map.csv"` |
+| output_directory | A unique GCS path to be used for loading, can be in the workspace bucket.  E.g. `gs://fc-124-12-132-123-31/gvs/demo1`)
+
+
+**NOTE**: if your workflow fails, you will need to manually remove a lockfile from the output directory.  It is called LOCKFILE, and can be removed with `gsutil rm`
+
+## 1.1 Create Alt Allele Table
+**NOTE:** This is a bit of a kludge until we gain more confidence that the data loaded into the ALT_ALLELE table for feature training are optimal and we can automate this process
+
+You'll need to run this from the BigQuery Console for your dataset.
+
+Load the SQL script you can find here in the [GATK GitHub Repository](https://github.com/broadinstitute/gatk/blob/ah_var_store/scripts/variantstore/bq/alt_allele_creation.example.sql)
+
+There are three places where you need to replace the string `spec-ops-aou.gvs_tieout_acmg_v1` with your project and dataset name in the form `PROJECT.DATASET`
+
+Execute the script, it should take 30-60 seconds to run resulting in the creation of the `ALT_ALLELE` table in your dataset
+
+## 2. Create Filter Set
+
+This step calculates features from the ALT_ALLELE table, and trains the VQSR filtering model along with site-level QC filters and loads them into BigQuery into a series of `filter_set_*` tables.  
+
+This is done by running the `GvsCreateFilterSet` workflow with the following parameters:
+
+| Parameter      | Description |
+| ----------------- | ----------- |
+| data_project | The name of the google project containing the dataset |
+| default_dataset      | The name of the dataset  |
+| filter_set_name | A unique name to identify this filter set (e.g. `my_demo_filters` ); you will want to make note of this for use in step 4  |
+
+**Note:** This workflow does not use the Terra Entity model to run, so be sure to select `Run workflow with inputs defined by file paths`
+
+## 3. Prepare Callset
+This step performs the heavy lifting in BigQuery to gather all the data required to create a jointly called VCF.  
+
+This is done by running the `GvsPrepareCallset` workflow with the following parameters:
+
+
+| Parameter      | Description |
+| ----------------- | ----------- |
+| data_project | The name of the google project containing the dataset |
+| default_dataset      | The name of the dataset  |
+| destination_cohort_table_prefix | A unique, descriptive name for the tables containing the callset (e.g. `demo_10_wgs_callset`); you will want to make note of this for use in the next step |
+| sample_names_to_extract | A file of sample names to be extracted in the callset (use `gs://fc-2b4456d7-974b-4b67-90f8-63c2fd2c03d4/gvs_quickstart_10_samples.txt`) |
+
+
+**Note:** This workflow does not use the Terra Entity model to run, so be sure to select `Run workflow with inputs defined by file paths`
+
+## 4. Extract Cohort
+
+This step extracts the data in BigQuery, prepared by `GvsPrepareCallset` and transforms it into a sharded joint called VCF 
+
+This is done by running the `GvsExtractCallset` workflow with the following parameters:
+
+
+| Parameter      | Description |
+| ----------------- | ----------- |
+| data_project | The name of the google project containing the dataset |
+| default_dataset      | The name of the dataset  |
+| filter_set_name | the name of the filter set identifier created in step #2 |
+| fq_cohort_extract_table_prefix | the fully qualified name of the `destination_cohort_table_prefix` from step #3, of the form `<project>.<dataset>.<destination_cohort_table_prefix>` |
+| output_file_base_name | Base name for generated VCFs |
+
+**Note:** This workflow does not use the Terra Entity model to run, so be sure to select `Run workflow with inputs defined by file paths`
+
+## 5. Your VCF is ready!!
+
+The sharded VCF outut files are listed in the `ExtractTask.output_vcf` workflow output, and the associated index files are listed in `ExtractTask.output_vcf_index`
+