The CH Toolkit is a collection of utilities and tools created by Irenaeus Chan (chani@wustl.edu) and Indraniel for the purpose of handling and maintaing variants called by the ArCH Pipeline.
Variant Calling is performed using two separate Variant Callers: GATK's Mutect2 and AstraZeneca's VarDict
To improve on cost and runtime performance, only genomic regions where Mutect2 successfully called and passed variants are used as input BED windows to VarDict.
Details behind this WDL workflow can be found here under the ArCH WGS Variant Calling Pipeline.
The subsequent variants called by both callers are then sanitized, normalized, and filtered for common Germline variants before being used as inputs for this toolkit.
To consolidate as much information as possible prior to filtering for CH variants. Several annotation steps must be performed including:
- VEP - The effect of variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions.
- Panel of Normal (PoN) Pileup - Estimation of the statistical noise within a given region or location from sequencing
- AnnotatePD (RScript) - The classification of CH variants through prior large-scale sequencing projects and population data including OncoKB, COSMIC, gnomAD, etc.
While this tool is meant to be run as a stand-alone infrastructure, for the purposes of automation, a WDL workflow has been written with inputs being simply the VCF files produced from Mutect2 and VarDict, sample phenotype information, and the local database filepath for the storage and maintainence of the variant database.
Please use pip to install ch-toolkit
.
If you want to try the latest bleeding edge, run the following command:
pip install git+https://github.com/kbolton-lab/ch-toolkit.git@main
For a stable docker image of ch-toolkit
, please visit Docker Hub.
Usage: ch-toolkit [OPTIONS] COMMAND [ARGS]...
A collection of db related tools for handling sample data.
Options:
--version Show the version and exit.
-h, --help Show this message and exit.
Commands:
calculate-fishers-test Updates the variants inside Mutect or Vardict tables
with p-value from Fisher's Exact Test
chromosome-to-caller Combines all chromosome databases into a single
<Mutect|Vardict> database
database-to-chromosome Splits <Mutect|Vardict|Variant|Annotation|Pileup>
database into individual chromosomes
dump-annotations dumps all variant annotations inside duckdb into a
CSV file
dump-ch Outputs CH Variants from Database
dump-variants dumps all variants inside duckdb into a VCF file
dump-variants-pileup dumps all variants inside duckdb into a VCF file
that needs pileup
import-annotate-pd annotates variants with their pathogenicity
import-pon-pileup updates variants inside duckdb with PoN pileup
information
import-sample-variants Register the variants for a VCF file into a variant
database
import-sample-vcf import a vcf file into sample variant database
import-samples Loads a CSV containing samples into samples database
import-vep updates variants inside duckdb with VEP information
merge-batch-variants Combines all sample variant databases into a single
database
merge-batch-vcf Combines all sample vcfs databases into a single
database
reduce-db Reduces the size of the mutect_db and vardict_db
databases to only CH possible variants
import-samples | |
---|---|
Goal: | Create the samples.db database which will contain the information for the samples |
Input: | A CSV file containing the information relevant to the samples being processed |
Output: | samples.db database |
ch-toolkit import-samples \
--samples washu-cad-1.csv \
--sdb database/samples.db \
--batch-number 1
import-sample-variants | |
---|---|
Goal: | Create individual sample.variant.db database for each sample that will contain the variant specific information |
Input: | Sample VCF file containing the information about the variants |
Output: | sample.variant.db database |
ch-toolkit import-sample-variants \
--input-vcf mutect.sample_name.vcf.gz \
--vdb variant_databases/mutect.sample_name.db \
--batch-number 1
ch-toolkit import-sample-variants \
--input-vcf vardict.sample_name.vcf.gz \
--vdb variant_databases/vardict.sample_name.db \
--batch-number 1
merge-batch-variants | |
---|---|
Goal: | Create a single variants.db that will contain ALL unique variants found within the individual sample.variant.db databases |
Input: | Path where the sample.varaint.db database files are located |
Output: | variants.db database |
ch-toolkit merge-batch-variants \
--db-path variant_databases/ \
--vdb database/variants.db \
--batch-number 1
dump-variants | |
---|---|
Goal: | Convert variants within variants.db into a VCF file for use in various annotation like VEP, PoN Pileup, etc. |
Input: | variants.db database containing all unique variants that need to be converted into a VCF file |
Output: | A VCF file containing all unique variants from variants.db |
ch-toolkit dump-variants \
--vdb database/variants.db \
--header-type simple \
--batch-number 1
import-sample-vcf | |
---|---|
Goal: | Create individual sample.caller.db databases for each sample that will contain the caller specific information |
Input: | Sample VCF file containing the information from the specific caller |
Output: | sample.caller.db database |
ch-toolkit import-sample-vcf \
--caller mutect \
--input-vcf mutect.sample_name.vcf.gz \
--cdb mutect_databases/mutect.sample_name.db \
--batch-number 1
ch-toolkit import-sample-vcf \
--caller vardict \
--input-vcf vardict.sample_name.vcf.gz \
--cdb vardict_databases/vardict.sample_name.db \
--batch-number 1
Create a Single Centralized Mutect and Vardict Caller Database from all Individual Sample Caller Databases
merge-batch-vcf | |
---|---|
Goal: | Create a single mutect.db and vardict.db database that will contain ALL caller specific information found within the individual sample.caller.db databases |
Input: | Path where the sample.caller.db database files are located |
Output: | mutect.db database or vardict.db database |
ch-toolkit merge-batch-vcf \
--db-path mutect_databases/ \
--cdb database/mutect.db \
--vdb database/variants.db \
--sdb database/samples.db \
--caller mutect \
--batch-number 1
ch-toolkit merge-batch-vcf \
--db-path mutect_databases/ \
--cdb database/vardict.db \
--vdb database/variants.db \
--sdb database/samples.db \
--caller vardict \
--batch-number 1
import-vep | |
---|---|
Goal: | Import annotation information produced from annotating variants from the dump-variants step using VEP |
Input: | The resulting TSV file produced from running VEP using (--tab ) mode |
Output: | annotations.db database |
ch-toolkit import-vep \
--adb database/annotations.db \
--vdb database/variants.db \
--vep VEP_annotated.tsv \
--batch-number 1
dump-annotations | |
---|---|
Goal: | Export variants that are potentially putative drivers to be annotated by the custom AnnotatePD RScript |
Input: | annotations.db database containing information about the variants that need to be converted into a CSV file |
Output: | A CSV file containing all unique variants from annotations.db that will be annotated using AnnotatePD |
ch-toolkit dump-annotations \
--adb database/annotations.db \
--batch-number 1
import-annotate-pd | |
---|---|
Goal: | Import annotation information produced from annotating variants from the dump-annotations step using AnnotatePD |
Input: | The resulting CSV file produced from running AnnotatePD |
Output: | annotations.db database |
ch-toolkit import-annotate-pd \
--adb database/annotations.db \
--pd annotatePD_results.csv \
--batch-number 1
After Performing the PoN Pileup Workflow, Import Pileup Information
import-pon-pileup | |
---|---|
Goal: | Import pileup information produced from running the PoN Pileup Workflow on the variants from the dump-variants step |
Input: | The resulting PoN Pileup VCF file produced from the PoN Pileup Workflow |
Output: | pileup.db database |
ch-toolkit import-pon-pileup \
--vdb database/variants.db \
--pdb database/pileup.db \
--pon-pileup pon_pileup.vcf.gz \
--batch-number 1
calculate-fishers-test | |
---|---|
Goal: | Perform a Fisher's Exact Test comparing the proportion of variant alleles to reference alleles to the proportion of the same variant alleles to reference alleles found within the PoN Samples |
Input: | The pileup information within the pileups.db database along with variant information found within mutect.db or vardict.db databases |
Output: | mutect.db and vardict.db databases updated with a p-value indiciating the signifance of the detected variant signal relative to the expected noise for the same given location |
ch-toolkit calculate-fishers-test \
--pdb database/pileup.db \
--cdb database/mutect.db \
--caller mutect \
--batch-number 1
ch-toolkit calculate-fishers-test \
--pdb database/pileup.db \
--cdb database/vardict.db \
--caller vardict \
--batch-number 1
dump-ch | |
---|---|
Goal: | Process through all information currently stored in the databases and detect CH variants with pathogenic support |
Input: | mutect.db, vardict.db, and annotations.db databases |
Output: | A CSV file containing all of the variants and all relevant information pertaining to said variants predicted to be pathogenic |
ch-toolkit dump-ch \
--mcdb database/mutect.db \
--vcdb database/vardict.db \
--adb database/annotations.db
database-to-chromosome | |
---|---|
Goal: | Divide the database into individual chromosome components. Useful for when the database is too large |
Input: | mutect.db, vardict.db, annotations.db, variant.db, pileup.db databases |
Output: | The original database used as input now split into all possible chromosomes from 1-22,X, and Y |
ch-toolkit database-to-chromosome \
--db database/variants.db \
--which_db variants \
--batch-number 1 \
--threads 4
reduce-db | |
---|---|
Goal: | Most CH mutations are only in the exonic regions, when processing, intronic regions may not be necessary |
Input: | mutect.db or vardict.db and annotations.db databases |
Output: | The original database used as input but only containing variants situated within the exonic regions of the genome |
ch-toolkit reduce-db \
--cdb database/mutect.db \
--caller mutect \
--adb database/annotations.db \
--threads 4
ch-toolkit reduce-db \
--cdb database/vardict.db \
--caller vardict \
--adb database/annotations.db \
--threads 4