From e267ca5b146ac0da4852110b13bf31ff384ddadd Mon Sep 17 00:00:00 2001 From: kayleemathews Date: Tue, 6 Sep 2022 15:51:00 -0400 Subject: [PATCH 1/4] Added storage cost section --- .../variantstore/beta_docs/gvs-overview.md | 2 +- .../variantstore/beta_docs/gvs-quickstart.md | 24 +++++++++++-------- .../beta_docs/run-your-own-samples.md | 22 ++++++++++------- 3 files changed, 28 insertions(+), 20 deletions(-) diff --git a/scripts/variantstore/beta_docs/gvs-overview.md b/scripts/variantstore/beta_docs/gvs-overview.md index 0bf8a612930..7cdf1c3ab58 100644 --- a/scripts/variantstore/beta_docs/gvs-overview.md +++ b/scripts/variantstore/beta_docs/gvs-overview.md @@ -4,7 +4,7 @@ | :----: | :---: | :----: | :--------------: | | [GvsJointVariantCalling](https://github.com/broadinstitute/gatk/blob/ah_var_store/scripts/variantstore/wdl/GvsJointVariantCalling.wdl) | June, 2022 | [Kaylee Mathews](mailto:kmathews@broadinstitute.org) and [Aurora Cremer](mailto:aurora@broadinstitute.org) | If you have questions or feedback, contact the [Broad Variants team](mailto:variants@broadinstitute.org) | -![Diagram depicting the Genomic Variant Store workflow. Sample GVCF files are imported into the core data model. A filtering model is trained using Variant Quality Score Recalibration, or VQSR, and then used to extract cohorts and produce sharded joint VCF files. Each step integrates BigQuery and GATK tools.](/scripts/variantstore/beta_docs/genomic-variant-store_diagram.png) +![Diagram depicting the Genomic Variant Store workflow. Sample GVCF files are imported into the core data model. A filtering model is trained using Variant Quality Score Recalibration, or VQSR, and then used to extract cohorts and produce sharded joint VCF files. Each step integrates BigQuery and GATK tools.](./genomic-variant-store_diagram.png) ## Introduction to the Genomic Variant Store workflow diff --git a/scripts/variantstore/beta_docs/gvs-quickstart.md b/scripts/variantstore/beta_docs/gvs-quickstart.md index 5c37f41fe97..cfa7321b636 100644 --- a/scripts/variantstore/beta_docs/gvs-quickstart.md +++ b/scripts/variantstore/beta_docs/gvs-quickstart.md @@ -8,7 +8,7 @@ The [GVS beta workspace](https://app.terra.bio/#workspaces/gvs-prod/Genomic_Vari ## Workflow Overview -![Diagram depicting the Genomic Variant Store workflow. Sample GVCF files are imported into the core data model. A filtering model is trained using Variant Quality Score Recalibration, or VQSR, and then applied while the samples are extracted as cohorts in sharded joint VCF files. Each step integrates BigQuery and GATK tools.](/scripts/variantstore/beta_docs/genomic-variant-store_diagram.png) +![Diagram depicting the Genomic Variant Store workflow. Sample GVCF files are imported into the core data model. A filtering model is trained using Variant Quality Score Recalibration, or VQSR, and then applied while the samples are extracted as cohorts in sharded joint VCF files. Each step integrates BigQuery and GATK tools.](./genomic-variant-store_diagram.png) The [GVS workflow](https://github.com/broadinstitute/gatk/blob/ah_var_store/scripts/variantstore/wdl/GvsJointVariantCalling.wdl) is an open-source, cloud-optimized workflow for joint calling at a large scale using the Terra platform. The workflow takes in single sample GVCF files with indices and produces sharded joint VCF files with indices, a manifest file, and metrics. @@ -16,7 +16,7 @@ To learn more about the GVS workflow, see the [Genomic Variant Store workflow ov ### What data does it require as input? -- reblocked single sample GVCF files (`input_vcfs`) +- Reblocked single sample GVCF files (`input_vcfs`) - GVCF index files (`input_vcf_indexes`) Example GVCF and index files in the Data tab of the [GVS beta workspace](https://app.terra.bio/#workspaces/gvs-prod/Genomic_Variant_Store_Beta) are hosted in a public Google bucket and links are provided in the sample data table. @@ -27,9 +27,9 @@ While the GVS workflow has been tested with 100,000 single sample GVCF files as The following files are stored in the workspace Google bucket and links to the files are written to the `sample_set` data table: -- sharded joint VCF files and index files -- size of output VCF files in MB -- manifest file containing the destinations and sizes in B of the output sharded joint VCF and index files +- Sharded joint VCF files and index files +- Size of output VCF files in MB +- Manifest file containing the destinations and sizes in B of the output sharded joint VCF and index files ## Setup @@ -127,6 +127,14 @@ Below is an example of the time and cost of running the workflow with the sample For more information about controlling Cloud costs, see [this article](https://support.terra.bio/hc/en-us/articles/360029748111). +#### Storage cost + +The GVS workflow produces several intermediate files in your BigQuery dataset, and storing these files in the cloud will increase the storage cost associated with your callset. To reduce cloud storage costs, you can delete some of the intermediate files after your callset has been created successfully. + +If you plan to create subcohorts of your data, you can delete the tables with `_REF_DATA`, `_SAMPLES`, and `_VET_DATA` at the end of the table name in your BigQuery dataset by following the instructions in the Google Cloud article, [Managing tables](https://cloud.google.com/bigquery/docs/managing-tables#deleting_a_table). + +If you don’t plan to create subcohorts of your data, you can delete your BigQuery dataset by following the instructions in the Google Cloud article, [Managing datasets](https://cloud.google.com/bigquery/docs/managing-datasets#deleting_a_dataset). Note that the data will be deleted permanently from this location, but output files can still be found in the workspace bucket. + --- ### Additional Resources @@ -148,8 +156,4 @@ If you use plan to publish data analyzed using the GVS workflow, please cite the Details on citing Terra workspaces can be found here: [How to cite Terra](https://support.terra.bio/hc/en-us/articles/360035343652) -Data Sciences Platform, Broad Institute (*Year, Month Day that the workspace was last modified*) gvs-prod/Genomic_Variant_Store_Beta [workspace] Retrieved *Month Day, Year that workspace was retrieved*, https://app.terra.bio/#workspaces/gvs-prod/Genomic_Variant_Store_Beta - -### License -**Copyright Broad Institute, 2022 | BSD-3** -All code provided in the workspace is released under the WDL open source code license (BSD-3) (full license text at https://github.com/broadinstitute/warp/blob/develop/LICENSE). Note however that the programs called by the scripts may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running these tools. \ No newline at end of file +Data Sciences Platform, Broad Institute (*Year, Month Day that the workspace was last modified*) gvs-prod/Genomic_Variant_Store_Beta [workspace] Retrieved *Month Day, Year that workspace was retrieved*, https://app.terra.bio/#workspaces/gvs-prod/Genomic_Variant_Store_Beta \ No newline at end of file diff --git a/scripts/variantstore/beta_docs/run-your-own-samples.md b/scripts/variantstore/beta_docs/run-your-own-samples.md index 9081675aa89..0d21c1e0922 100644 --- a/scripts/variantstore/beta_docs/run-your-own-samples.md +++ b/scripts/variantstore/beta_docs/run-your-own-samples.md @@ -12,7 +12,7 @@ To learn more about the GVS workflow, see the [Genomic Variant Store workflow ov ### What does it require as input? -- reblocked single sample GVCF files (`input_vcfs`) with specific annotations described below +- Reblocked single sample GVCF files (`input_vcfs`) with specific annotations described below - GVCF index files (`input_vcf_indexes`) While the GVS workflow has been tested with 100,000 single sample GVCF files as input, only datasets of up to 10,000 files are being used for beta testing. @@ -45,9 +45,9 @@ Input GVCF files for the GVS workflow must include the annotations described in The following files are stored in the workspace Google bucket and links to the files are written to the sample_set data table: -- sharded joint VCF files and index files -- size of output VCF files in MB -- manifest file containing the output destination of additional files and other metadata +- Sharded joint VCF files and index files +- Size of output VCF files in MB +- Manifest file containing the output destination of additional files and other metadata ## Setup @@ -184,6 +184,14 @@ Below are several examples of the time and cost of running the workflow. For more information about controlling Cloud costs, see [this article](https://support.terra.bio/hc/en-us/articles/360029748111). +#### Storage cost + +The GVS workflow produces several intermediate files in your BigQuery dataset, and storing these files in the cloud will increase the storage cost associated with your callset. To reduce cloud storage costs, you can delete some of the intermediate files after your callset has been created successfully. + +If you plan to create subcohorts of your data, you can delete the tables with `_REF_DATA`, `_SAMPLES`, and `_VET_DATA` at the end of the table name in your BigQuery dataset by following the instructions in the Google Cloud article, [Managing tables](https://cloud.google.com/bigquery/docs/managing-tables#deleting_a_table). + +If you don’t plan to create subcohorts of your data, you can delete your BigQuery dataset by following the instructions in the Google Cloud article, [Managing datasets](https://cloud.google.com/bigquery/docs/managing-datasets#deleting_a_dataset). Note that the data will be deleted permanently from this location, but output files can still be found in the workspace bucket. + --- ### Additional Resources @@ -205,8 +213,4 @@ If you use plan to publish data analyzed using the GVS workflow, please cite the Details on citing Terra workspaces can be found here: [How to cite Terra](https://support.terra.bio/hc/en-us/articles/360035343652) -Data Sciences Platform, Broad Institute (*Year, Month Day that the workspace was last modified*) gvs-prod/Genomic_Variant_Store_Beta [workspace] Retrieved *Month Day, Year that workspace was retrieved*, https://app.terra.bio/#workspaces/gvs-prod/Genomic_Variant_Store_Beta - -### License -**Copyright Broad Institute, 2020 | BSD-3** -All code provided in this workspace is released under the WDL open source code license (BSD-3) (full license text at https://github.com/broadinstitute/warp/blob/develop/LICENSE). Note however that the programs called by the scripts may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running these tools. \ No newline at end of file +Data Sciences Platform, Broad Institute (*Year, Month Day that the workspace was last modified*) gvs-prod/Genomic_Variant_Store_Beta [workspace] Retrieved *Month Day, Year that workspace was retrieved*, https://app.terra.bio/#workspaces/gvs-prod/Genomic_Variant_Store_Beta \ No newline at end of file From d26456262c0f5f3de619e6e75cfd7a545eedd109 Mon Sep 17 00:00:00 2001 From: kayleemathews Date: Wed, 7 Sep 2022 11:38:20 -0400 Subject: [PATCH 2/4] Added license info --- scripts/variantstore/beta_docs/gvs-overview.md | 4 ++++ scripts/variantstore/beta_docs/gvs-quickstart.md | 6 +++++- scripts/variantstore/beta_docs/run-your-own-samples.md | 6 +++++- 3 files changed, 14 insertions(+), 2 deletions(-) diff --git a/scripts/variantstore/beta_docs/gvs-overview.md b/scripts/variantstore/beta_docs/gvs-overview.md index 7cdf1c3ab58..de972931f57 100644 --- a/scripts/variantstore/beta_docs/gvs-overview.md +++ b/scripts/variantstore/beta_docs/gvs-overview.md @@ -124,6 +124,10 @@ Details on citing Terra workspaces can be found here: [How to cite Terra](https: Data Sciences Platform, Broad Institute (*Year, Month Day that the workspace was last modified*) gvs-prod/Genomic_Variant_Store_Beta [workspace] Retrieved *Month Day, Year that workspace was retrieved*, https://app.terra.bio/#workspaces/gvs-prod/Genomic_Variant_Store_Beta +### License +**Copyright Broad Institute, 2021 | Apache** +The workflow script is released under the Apache License, Version 2.0 (full license text at https://github.com/broadinstitute/gatk/blob/master/LICENSE.TXT). Note however that the programs called by the scripts may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running these tools. + ## Feedback Please help us improve our tools by contacting the [Broad Variants team](mailto:variants@broadinstitute.org) for workflow-related suggestions or questions. \ No newline at end of file diff --git a/scripts/variantstore/beta_docs/gvs-quickstart.md b/scripts/variantstore/beta_docs/gvs-quickstart.md index cfa7321b636..88458bfa293 100644 --- a/scripts/variantstore/beta_docs/gvs-quickstart.md +++ b/scripts/variantstore/beta_docs/gvs-quickstart.md @@ -156,4 +156,8 @@ If you use plan to publish data analyzed using the GVS workflow, please cite the Details on citing Terra workspaces can be found here: [How to cite Terra](https://support.terra.bio/hc/en-us/articles/360035343652) -Data Sciences Platform, Broad Institute (*Year, Month Day that the workspace was last modified*) gvs-prod/Genomic_Variant_Store_Beta [workspace] Retrieved *Month Day, Year that workspace was retrieved*, https://app.terra.bio/#workspaces/gvs-prod/Genomic_Variant_Store_Beta \ No newline at end of file +Data Sciences Platform, Broad Institute (*Year, Month Day that the workspace was last modified*) gvs-prod/Genomic_Variant_Store_Beta [workspace] Retrieved *Month Day, Year that workspace was retrieved*, https://app.terra.bio/#workspaces/gvs-prod/Genomic_Variant_Store_Beta + +### License +**Copyright Broad Institute, 2021 | Apache** +The workflow script is released under the Apache License, Version 2.0 (full license text at https://github.com/broadinstitute/gatk/blob/master/LICENSE.TXT). Note however that the programs called by the scripts may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running these tools. \ No newline at end of file diff --git a/scripts/variantstore/beta_docs/run-your-own-samples.md b/scripts/variantstore/beta_docs/run-your-own-samples.md index 0d21c1e0922..b3d55daf78f 100644 --- a/scripts/variantstore/beta_docs/run-your-own-samples.md +++ b/scripts/variantstore/beta_docs/run-your-own-samples.md @@ -213,4 +213,8 @@ If you use plan to publish data analyzed using the GVS workflow, please cite the Details on citing Terra workspaces can be found here: [How to cite Terra](https://support.terra.bio/hc/en-us/articles/360035343652) -Data Sciences Platform, Broad Institute (*Year, Month Day that the workspace was last modified*) gvs-prod/Genomic_Variant_Store_Beta [workspace] Retrieved *Month Day, Year that workspace was retrieved*, https://app.terra.bio/#workspaces/gvs-prod/Genomic_Variant_Store_Beta \ No newline at end of file +Data Sciences Platform, Broad Institute (*Year, Month Day that the workspace was last modified*) gvs-prod/Genomic_Variant_Store_Beta [workspace] Retrieved *Month Day, Year that workspace was retrieved*, https://app.terra.bio/#workspaces/gvs-prod/Genomic_Variant_Store_Beta + +### License +**Copyright Broad Institute, 2021 | Apache** +The workflow script is released under the Apache License, Version 2.0 (full license text at https://github.com/broadinstitute/gatk/blob/master/LICENSE.TXT). Note however that the programs called by the scripts may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running these tools. \ No newline at end of file From 34226343427fb6265d2ad15e4c98aab3c6f7ae32 Mon Sep 17 00:00:00 2001 From: Kaylee Mathews <95316074+kayleemathews@users.noreply.github.com> Date: Wed, 7 Sep 2022 12:07:30 -0400 Subject: [PATCH 3/4] Update scripts/variantstore/beta_docs/gvs-overview.md Co-authored-by: Kylee Degatano --- scripts/variantstore/beta_docs/gvs-overview.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/variantstore/beta_docs/gvs-overview.md b/scripts/variantstore/beta_docs/gvs-overview.md index de972931f57..8644a61c5a1 100644 --- a/scripts/variantstore/beta_docs/gvs-overview.md +++ b/scripts/variantstore/beta_docs/gvs-overview.md @@ -125,7 +125,7 @@ Details on citing Terra workspaces can be found here: [How to cite Terra](https: Data Sciences Platform, Broad Institute (*Year, Month Day that the workspace was last modified*) gvs-prod/Genomic_Variant_Store_Beta [workspace] Retrieved *Month Day, Year that workspace was retrieved*, https://app.terra.bio/#workspaces/gvs-prod/Genomic_Variant_Store_Beta ### License -**Copyright Broad Institute, 2021 | Apache** +**Copyright Broad Institute, 2022 | Apache** The workflow script is released under the Apache License, Version 2.0 (full license text at https://github.com/broadinstitute/gatk/blob/master/LICENSE.TXT). Note however that the programs called by the scripts may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running these tools. ## Feedback From 7c9eb91215aa003e68e670f5f3763032a9a7b13a Mon Sep 17 00:00:00 2001 From: kayleemathews Date: Wed, 7 Sep 2022 12:08:48 -0400 Subject: [PATCH 4/4] Update license --- scripts/variantstore/beta_docs/gvs-quickstart.md | 2 +- scripts/variantstore/beta_docs/run-your-own-samples.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/scripts/variantstore/beta_docs/gvs-quickstart.md b/scripts/variantstore/beta_docs/gvs-quickstart.md index 88458bfa293..ef18ca2ff53 100644 --- a/scripts/variantstore/beta_docs/gvs-quickstart.md +++ b/scripts/variantstore/beta_docs/gvs-quickstart.md @@ -159,5 +159,5 @@ Details on citing Terra workspaces can be found here: [How to cite Terra](https: Data Sciences Platform, Broad Institute (*Year, Month Day that the workspace was last modified*) gvs-prod/Genomic_Variant_Store_Beta [workspace] Retrieved *Month Day, Year that workspace was retrieved*, https://app.terra.bio/#workspaces/gvs-prod/Genomic_Variant_Store_Beta ### License -**Copyright Broad Institute, 2021 | Apache** +**Copyright Broad Institute, 2022 | Apache** The workflow script is released under the Apache License, Version 2.0 (full license text at https://github.com/broadinstitute/gatk/blob/master/LICENSE.TXT). Note however that the programs called by the scripts may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running these tools. \ No newline at end of file diff --git a/scripts/variantstore/beta_docs/run-your-own-samples.md b/scripts/variantstore/beta_docs/run-your-own-samples.md index b3d55daf78f..2365277a61f 100644 --- a/scripts/variantstore/beta_docs/run-your-own-samples.md +++ b/scripts/variantstore/beta_docs/run-your-own-samples.md @@ -216,5 +216,5 @@ Details on citing Terra workspaces can be found here: [How to cite Terra](https: Data Sciences Platform, Broad Institute (*Year, Month Day that the workspace was last modified*) gvs-prod/Genomic_Variant_Store_Beta [workspace] Retrieved *Month Day, Year that workspace was retrieved*, https://app.terra.bio/#workspaces/gvs-prod/Genomic_Variant_Store_Beta ### License -**Copyright Broad Institute, 2021 | Apache** +**Copyright Broad Institute, 2022 | Apache** The workflow script is released under the Apache License, Version 2.0 (full license text at https://github.com/broadinstitute/gatk/blob/master/LICENSE.TXT). Note however that the programs called by the scripts may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running these tools. \ No newline at end of file