-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First pass at a Terra QuickStart #7267
Merged
Merged
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
# Quickstart - Joint Calling with the Broad Genomic Variant Store | ||
|
||
**Note** The markdown source for this quickstart is maintained in the the [GATK GitHub Repository](https://github.com/broadinstitute/gatk/blob/ah_var_store/scripts/variantstore/TERRA_QUICKSTART.md). Submit any feedback, corrections or improvements in a pull request there. Do not edit this file directly. | ||
|
||
## Overview | ||
Through this QuickStart you will learn how to use the Broad Genomic Variant Store to create a GATK VQSR Filtered joint callset VCF for whole genome samples. | ||
|
||
The sequencing data in this quickstart came from the [AnVIL 1000G High Coverage workspace](https://app.terra.bio/#workspaces/anvil-datastorage/1000G-high-coverage-2019) | ||
|
||
**Note:** VQSR dies with the default/recommended configuration, so we set SNP max-gaussians to 4 here... | ||
|
||
## Prerequisites | ||
|
||
This quickstart assumes that you are familiar with Terra workspaces, the data model and providing input parameters and launching workflows. | ||
|
||
1. You will need to have or create a BigQuery dataset (we'll call this `datasetname` later on). | ||
2. Grant the "BigQuery Editor" role on that **dataset** to your Terra PROXY group. Your proxy group name can be found on your Terra Profile page and look something like `PROXY_3298237498237948372@firecloud.org` | ||
3. Grant the following roles on the Google **project** containing the dataset to your proxy group | ||
- BigQuery data editor | ||
- BigQuery job user | ||
- BigQuery Read Session User | ||
|
||
## 1. Import Data | ||
|
||
First, we have to import your reblocked gVCF files into GVS by running the `GvsImportGenomes` workflow. | ||
|
||
The workflow should be run against a sample set indicating the samples to load. The sample table should have a columns for the gVCFs (`hg38_reblocked_gvcf`) and their indexes (`hg38_reblocked_gvcf_index`) | ||
|
||
A sample set for the quickstart (`gvs_demo-10`) has already been created with 10 samples. | ||
|
||
These are the required parameters which must be supplied to the workflow: | ||
|
||
| Parameter | Description | | ||
| ----------------- | ----------- | | ||
| project_id | The name of the google project containing the dataset | | ||
| dataset_name | The name of the dataset you created above | | ||
| sample_map | Use `"gs://fc-2b4456d7-974b-4b67-90f8-63c2fd2c03d4/gvs_demo_10_sample_map.csv"` | | ||
| output_directory | A unique GCS path to be used for loading, can be in the workspace bucket. E.g. `gs://fc-124-12-132-123-31/gvs/demo1`) | ||
|
||
|
||
**NOTE**: if your workflow fails, you will need to manually remove a lockfile from the output directory. It is called LOCKFILE, and can be removed with `gsutil rm` | ||
|
||
## 1.1 Create Alt Allele Table | ||
**NOTE:** This is a bit of a kludge until we gain more confidence that the data loaded into the ALT_ALLELE table for feature training are optimal and we can automate this process | ||
|
||
You'll need to run this from the BigQuery Console for your dataset. | ||
|
||
Load the SQL script you can find here in the [GATK GitHub Repository](https://github.com/broadinstitute/gatk/blob/ah_var_store/scripts/variantstore/bq/alt_allele_creation.example.sql) | ||
|
||
There are three places where you need to replace the string `spec-ops-aou.gvs_tieout_acmg_v1` with your project and dataset name in the form `PROJECT.DATASET` | ||
|
||
Execute the script, it should take 30-60 seconds to run resulting in the creation of the `ALT_ALLELE` table in your dataset | ||
|
||
## 2. Create Filter Set | ||
|
||
This step calculates features from the ALT_ALLELE table, and trains the VQSR filtering model along with site-level QC filters and loads them into BigQuery into a series of `filter_set_*` tables. | ||
|
||
This is done by running the `GvsCreateFilterSet` workflow with the following parameters: | ||
|
||
| Parameter | Description | | ||
| ----------------- | ----------- | | ||
| data_project | The name of the google project containing the dataset | | ||
| default_dataset | The name of the dataset | | ||
| filter_set_name | A unique name to identify this filter set (e.g. `my_demo_filters` ); you will want to make note of this for use in step 4 | | ||
|
||
**Note:** This workflow does not use the Terra Entity model to run, so be sure to select `Run workflow with inputs defined by file paths` | ||
|
||
## 3. Prepare Callset | ||
This step performs the heavy lifting in BigQuery to gather all the data required to create a jointly called VCF. | ||
|
||
This is done by running the `GvsPrepareCallset` workflow with the following parameters: | ||
|
||
|
||
| Parameter | Description | | ||
| ----------------- | ----------- | | ||
| data_project | The name of the google project containing the dataset | | ||
| default_dataset | The name of the dataset | | ||
| destination_cohort_table_prefix | A unique, descriptive name for the tables containing the callset (e.g. `demo_10_wgs_callset`); you will want to make note of this for use in the next step | | ||
| sample_names_to_extract | A file of sample names to be extracted in the callset (use `gs://fc-2b4456d7-974b-4b67-90f8-63c2fd2c03d4/gvs_quickstart_10_samples.txt`) | | ||
|
||
|
||
**Note:** This workflow does not use the Terra Entity model to run, so be sure to select `Run workflow with inputs defined by file paths` | ||
|
||
## 4. Extract Cohort | ||
|
||
This step extracts the data in BigQuery, prepared by `GvsPrepareCallset` and transforms it into a sharded joint called VCF | ||
|
||
This is done by running the `GvsExtractCallset` workflow with the following parameters: | ||
|
||
|
||
| Parameter | Description | | ||
| ----------------- | ----------- | | ||
| data_project | The name of the google project containing the dataset | | ||
| default_dataset | The name of the dataset | | ||
| filter_set_name | the name of the filter set identifier created in step #2 | | ||
| fq_cohort_extract_table_prefix | the fully qualified name of the `destination_cohort_table_prefix` from step #3, of the form `<project>.<dataset>.<destination_cohort_table_prefix>` | | ||
| output_file_base_name | Base name for generated VCFs | | ||
|
||
**Note:** This workflow does not use the Terra Entity model to run, so be sure to select `Run workflow with inputs defined by file paths` | ||
|
||
## 5. Your VCF is ready!! | ||
|
||
The sharded VCF outut files are listed in the `ExtractTask.output_vcf` workflow output, and the associated index files are listed in `ExtractTask.output_vcf_index` | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't able to add roles (for prerequisites 2 & 3 as an "Editor")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this because you need to be an owner so that you can grant these IAM roles? Or do you need to be owner for some other reason (ie say someone else granted you these roles, would that work?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also -- the directions didn't really talk about making a temp dataset... was that a problem? did you have to make one @rsasch ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I needed to be "Owner" to add those roles for my proxy group. I didn't test a scenario where those roles were added by someone else. I believe being "Owner" also solved the error localizing GATK (my pet needed "storage.objects.list" access to the Google Cloud Storage bucket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I might have mis-read my notes. It looks like I ran into the issue in the CreateVetTables call: “BigQuery error in mk operation: Access Denied: Dataset”, but it worked once I used “BigQuery Data Editor” instead.