Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First pass at a Terra QuickStart #7267

Merged
merged 5 commits into from
Jun 16, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 104 additions & 0 deletions scripts/variantstore/TERRA_QUICKSTART.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Quickstart - Joint Calling with the Broad Genomic Variant Store

**Note** The markdown source for this quickstart is maintained in the the [GATK GitHub Repository](https://github.com/broadinstitute/gatk/blob/ah_var_store/scripts/variantstore/TERRA_QUICKSTART.md). Submit any feedback, corrections or improvements in a pull request there. Do not edit this file directly.

## Overview
Through this QuickStart you will learn how to use the Broad Genomic Variant Store to create a GATK VQSR Filtered joint callset VCF for whole genome samples.

The sequencing data in this quickstart came from the [AnVIL 1000G High Coverage workspace](https://app.terra.bio/#workspaces/anvil-datastorage/1000G-high-coverage-2019)

**Note:** VQSR dies with the default/recommended configuration, so we set SNP max-gaussians to 4 here...

## Prerequisites

This quickstart assumes that you are familiar with Terra workspaces, the data model and providing input parameters and launching workflows.

1. You will need to have or create a BigQuery dataset (we'll call this `datasetname` later on).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to add roles (for prerequisites 2 & 3 as an "Editor")

Suggested change
1. You will need to have or create a BigQuery dataset (we'll call this `datasetname` later on).
1. You will need to have or create a BigQuery dataset (we'll call this `datasetname` later on) in a Google project where you are an "Owner."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because you need to be an owner so that you can grant these IAM roles? Or do you need to be owner for some other reason (ie say someone else granted you these roles, would that work?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also -- the directions didn't really talk about making a temp dataset... was that a problem? did you have to make one @rsasch ?

Copy link

@rsasch rsasch Jun 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because you need to be an owner so that you can grant these IAM roles? Or do you need to be owner for some other reason (ie say someone else granted you these roles, would that work?)

Yes, I needed to be "Owner" to add those roles for my proxy group. I didn't test a scenario where those roles were added by someone else. I believe being "Owner" also solved the error localizing GATK (my pet needed "storage.objects.list" access to the Google Cloud Storage bucket.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also -- the directions didn't really talk about making a temp dataset... was that a problem? did you have to make one @rsasch ?

I think I might have mis-read my notes. It looks like I ran into the issue in the CreateVetTables call: “BigQuery error in mk operation: Access Denied: Dataset”, but it worked once I used “BigQuery Data Editor” instead.

2. Grant the "BigQuery Editor" role on that **dataset** to your Terra PROXY group. Your proxy group name can be found on your Terra Profile page and look something like `PROXY_3298237498237948372@firecloud.org`
3. Grant the following roles on the Google **project** containing the dataset to your proxy group
- BigQuery data editor
- BigQuery job user
- BigQuery Read Session User

## 1. Import Data

First, we have to import your reblocked gVCF files into GVS by running the `GvsImportGenomes` workflow.

The workflow should be run against a sample set indicating the samples to load. The sample table should have a columns for the gVCFs (`hg38_reblocked_gvcf`) and their indexes (`hg38_reblocked_gvcf_index`)

A sample set for the quickstart (`gvs_demo-10`) has already been created with 10 samples.

These are the required parameters which must be supplied to the workflow:

| Parameter | Description |
| ----------------- | ----------- |
| project_id | The name of the google project containing the dataset |
| dataset_name | The name of the dataset you created above |
| sample_map | Use `"gs://fc-2b4456d7-974b-4b67-90f8-63c2fd2c03d4/gvs_demo_10_sample_map.csv"` |
| output_directory | A unique GCS path to be used for loading, can be in the workspace bucket. E.g. `gs://fc-124-12-132-123-31/gvs/demo1`)


**NOTE**: if your workflow fails, you will need to manually remove a lockfile from the output directory. It is called LOCKFILE, and can be removed with `gsutil rm`

## 1.1 Create Alt Allele Table
**NOTE:** This is a bit of a kludge until we gain more confidence that the data loaded into the ALT_ALLELE table for feature training are optimal and we can automate this process

You'll need to run this from the BigQuery Console for your dataset.

Load the SQL script you can find here in the [GATK GitHub Repository](https://github.com/broadinstitute/gatk/blob/ah_var_store/scripts/variantstore/bq/alt_allele_creation.example.sql)

There are three places where you need to replace the string `spec-ops-aou.gvs_tieout_acmg_v1` with your project and dataset name in the form `PROJECT.DATASET`

Execute the script, it should take 30-60 seconds to run resulting in the creation of the `ALT_ALLELE` table in your dataset

## 2. Create Filter Set

This step calculates features from the ALT_ALLELE table, and trains the VQSR filtering model along with site-level QC filters and loads them into BigQuery into a series of `filter_set_*` tables.

This is done by running the `GvsCreateFilterSet` workflow with the following parameters:

| Parameter | Description |
| ----------------- | ----------- |
| data_project | The name of the google project containing the dataset |
| default_dataset | The name of the dataset |
| filter_set_name | A unique name to identify this filter set (e.g. `my_demo_filters` ); you will want to make note of this for use in step 4 |

**Note:** This workflow does not use the Terra Entity model to run, so be sure to select `Run workflow with inputs defined by file paths`

## 3. Prepare Callset
This step performs the heavy lifting in BigQuery to gather all the data required to create a jointly called VCF.

This is done by running the `GvsPrepareCallset` workflow with the following parameters:


| Parameter | Description |
| ----------------- | ----------- |
| data_project | The name of the google project containing the dataset |
| default_dataset | The name of the dataset |
| destination_cohort_table_prefix | A unique, descriptive name for the tables containing the callset (e.g. `demo_10_wgs_callset`); you will want to make note of this for use in the next step |
| sample_names_to_extract | A file of sample names to be extracted in the callset (use `gs://fc-2b4456d7-974b-4b67-90f8-63c2fd2c03d4/gvs_quickstart_10_samples.txt`) |


**Note:** This workflow does not use the Terra Entity model to run, so be sure to select `Run workflow with inputs defined by file paths`

## 4. Extract Cohort

This step extracts the data in BigQuery, prepared by `GvsPrepareCallset` and transforms it into a sharded joint called VCF

This is done by running the `GvsExtractCallset` workflow with the following parameters:


| Parameter | Description |
| ----------------- | ----------- |
| data_project | The name of the google project containing the dataset |
| default_dataset | The name of the dataset |
| filter_set_name | the name of the filter set identifier created in step #2 |
| fq_cohort_extract_table_prefix | the fully qualified name of the `destination_cohort_table_prefix` from step #3, of the form `<project>.<dataset>.<destination_cohort_table_prefix>` |
| output_file_base_name | Base name for generated VCFs |

**Note:** This workflow does not use the Terra Entity model to run, so be sure to select `Run workflow with inputs defined by file paths`

## 5. Your VCF is ready!!

The sharded VCF outut files are listed in the `ExtractTask.output_vcf` workflow output, and the associated index files are listed in `ExtractTask.output_vcf_index`