Skip to content

Latest commit

 

History

History
460 lines (334 loc) · 19.7 KB

running_on_a_hpc.md

File metadata and controls

460 lines (334 loc) · 19.7 KB

Run vcf_annotation_pipeline on a high performance cluster

Table of contents

1. Fork the pipeline repo to a personal or lab account

See here for help forking a repository

2. Take the pipeline to the data on the HPC

Clone the forked vcf_annotation_pipeline repo into the same directory as your vcf files to be processed.

See here for help cloning a repository

3. Setup files and directories

Note. for single sample analysis, you'll need the mapped bam files along with your vcf files. bam files are not needed for cohort analyses

Required folder structure and file naming convention:

.
|___human_genomics_pipeline/results/called/
|                                      |___sample1_raw_snps_indels.vcf
|                                      |___sample1_raw_snps_indels.vcf.idx
|                                      |___sample2_raw_snps_indels.vcf
|                                      |___sample2_raw_snps_indels.vcf.idx
|                                      |___ ...
|
|___human_genomics_pipeline/results/mapped/
|                                      |___sample1_recalibrated.bam
|                                      |___sample1_recalibrated.bai
|                                      |___sample2_recalibrated.bam
|                                      |___sample2_recalibrated.bai
|                                      |___ ...
|
|___vcf_annotation_pipeline/

If you're analysing cohort's of samples, you will need a directory with a pedigree file for each cohort/family using the following folder structure and file naming convention:

.
|___human_genomics_pipeline/results/called/
|                                      |___sample1_raw_snps_indels.vcf
|                                      |___sample1_raw_snps_indels.vcf.idx
|                                      |___sample2_raw_snps_indels.vcf
|                                      |___sample2_raw_snps_indels.vcf.idx
|                                      |___sample3_raw_snps_indels.vcf
|                                      |___sample3_raw_snps_indels.vcf.idx
|                                      |___sample4_raw_snps_indels.vcf
|                                      |___sample4_raw_snps_indels.vcf.idx
|                                      |___sample5_raw_snps_indels.vcf
|                                      |___sample5_raw_snps_indels.vcf.idx
|                                      |___sample6_raw_snps_indels.vcf
|                                      |___sample6_raw_snps_indels.vcf.idx
|                                      |___ ...
|
|___pedigrees/
|     |___proband1_pedigree.ped
|     |___proband2_pedigree.ped
|     |___ ...
|
|___vcf_annotation_pipeline/

Requirements:

  • Currently, the filenames of the pedigree files need to be labelled with the name of the proband/individual affected with the disease phenotype in the cohort (we will be working towards removing this requirement)
  • Singletons and cohorts need to be run in separate pipeline runs
  • It is assumed that there is one proband/individual affected with the disease phenotype of interest in a given cohort (one individual with a value of 2 in the 6th column of a given pedigree file)

Test data

The provided test dataset can be used. Setup the test dataset before running the pipeline on this data - choose to setup to run either a single sample analysis or a cohort analysis with the -a flag. For example:

cd ./vcf_annotation_pipeline
bash ./test/setup_test.sh -a cohort

4. Get prerequisite software/hardware

For GPU accelerated runs, you'll need NVIDIA GPUs and NVIDIA CLARA PARABRICKS and dependencies. Talk to your system administrator to see if the HPC has this hardware and software available.

Other software required to get setup and run the pipeline:

Most of this software is commonly pre-installed on HPC's, likely available as modules that can be loaded. Talk to your system administrator if you need help with this.

5. Create a local copy of the GATK resource bundle (either b37 or hg38)

b37

Download from Google Cloud Bucket

gsutil cp -r gs://gatk-legacy-bundles/b37 /where/to/download/

hg38

Download from Google Cloud Bucket

gsutil cp -r gs://genomics-public-data/resources/broad/hg38 /where/to/download/

6. Create a local copy of other databases (either GRCh37 or GRCh38)

GRCh37

Download the Ensembl-VEP database using a conda version of Ensembl-VEP

conda create -n download_data_env python=3.7
conda activate download_data_env
conda install -c bioconda ensembl-vep=99.2
vep_install -a cf -s homo_sapiens -y GRCh37 -c /output/file/path/GRCh37 --CONVERT
conda deactivate

Download the CADD database and it's associated index file.

wget https://krishna.gs.washington.edu/download/CADD/v1.4/GRCh37/whole_genome_SNVs.tsv.gz
wget https://krishna.gs.washington.edu/download/CADD/v1.4/GRCh37/whole_genome_SNVs.tsv.gz.tbi

Create a custom dbNSFP database build by following this documentation

GRCh38

Download Ensembl-VEP database using a conda install of Ensembl-VEP

mamba create -n download_data_env python=3.7
conda activate download_data_env
mamba install -c bioconda ensembl-vep=99.2
vep_install -a cf -s homo_sapiens -y GRCh38 -c /output/file/path/GRCh38 --CONVERT
conda deactivate

Download the CADD database and it's associated index file.

wget https://krishna.gs.washington.edu/download/CADD/v1.5/GRCh38/whole_genome_SNVs.tsv.gz
wget https://krishna.gs.washington.edu/download/CADD/v1.5/GRCh38/whole_genome_SNVs.tsv.gz.tbi

Create a custom dbNSFP database build by following this documentation

7. Modify the configuration file

Edit 'config.yaml' found within the config directory.

Overall workflow

Specify the build of reference genome (either 'GRCh37' (for b37) or 'GRCh38' (for hg38)) to use

BUILD: "GRCh37"

Specify whether the data is to be analysed on it's own ('Single') or as a part of a cohort of samples ('Cohort'). For example:

DATA: "Single"

Specify whether the pipeline should be GPU accelerated where possible (either 'Yes' or 'No', this requires NVIDIA GPUs and NVIDIA CLARA PARABRICKS)

GPU_ACCELERATED: "No"

Specify whether the data should be prepared to be ingested into scout (either 'Yes' or 'No'). This will output an additional vcf file that has undergone additional filtering steps. For example:

PREPARE_FOR_SCOUT: "Yes"

Set the working directories to the reference human genome file (b37 or hg38). For example:

REFGENOME: "/scratch/publicData/b37/human_g1k_v37_decoy.fasta"

Set the the working directory to your dbSNP database file (b37 or hg38). For example:

dbSNP: "/scratch/publicData/b37/dbsnp_138.b37.vcf"

Set the the working directory to a temporary file directory. Make sure this is a location with a fair amount of memory space for large intermediate analysis files. For example:

TEMPDIR: "/scratch/tmp/"

If analysing WES data, pass a design file (.bed) indicating the genomic regions that were sequenced (see here for more information on accessing design files). Also set the level of padding by passing the amount of padding in base pairs. For example:

If NOT analysing WES data, leave these fields blank

WES:
  # File path to the exome capture regions over which to operate
  INTERVALS: "/scratch/publicData/sure_select_human_all_exon_V7/S31285117_Padded.bed"
  # Padding (in bp) to add to each region
  PADDING: "100"

Pipeline resources

These settings allow you to configure the resources per rule/sample

Set the number of threads to use per sample/rule for multithreaded rules (rule gatk_CNNScoreVariants, rule vep and genmod models). Multithreading will significantly speed up these rules, however the improvements in speed will diminish beyond 8 threads. If desired, a different number of threads can be set for these multithreaded rules by utilising the --set-threads flag in the runscript (see step 9).

THREADS: 8

Set the maximum memory usage per rule/sample (eg. '40g' for 40 gigabytes, this should suffice for exomes)

MAXMEMORY: "40g"

Set the maximum number of GPU's to be used per rule/sample for gpu-accelerated runs (eg 1 for 1 GPU)

GPU: 1

It is a good idea to consider the number of samples that you are processing. For example, if you set THREADS: "8" and set the maximum number of cores to be used by the pipeline in the run script to -j/--cores 32 (see step 9), a maximum of 3 samples will be able to run at one time for these rules (if they are deployed at the same time), but each sample will complete faster. In contrast, if you set THREADS: "1" and -j/--cores 32, a maximum of 32 samples could be run at one time, but each sample will take longer to complete. This also needs to be considered when setting MAXMEMORY + --resources mem_mb and GPU + --resources gpu.

Variant filtering

Set the tranche filtering level for snps and indels (by gatk FilterVariantTranches and gatk VariantRecalibrator for single samples and cohorts respectively. For example:

TRANCHE:
  SNPS: "99.95"
  INDELS: "99.4"

Single samples

If analysing single sample data, define the resources to be used to filter variants with gatk FilterVariantTranches. For example:

If NOT analysing single sample data, leave this section blank

SINGLE:

  # Provide a list of resources
  RESOURCES:
    - /scratch/publicData/b37/hapmap_3.3.b37.vcf
    - /scratch/publicData/b37/Mills_and_1000G_gold_standard.indels.b37.vcf

Cohort samples

If analysing cohort data instead, define the resources to be used to filter variants with gatk VariantRecalibrator. You'll need to add some additional information for each resource. For example:

If NOT analysing cohort data, leave this section blank

COHORT:

  # For indels...
  INDELS:

    # ...provide a list of resources
    RESOURCES:
      - /scratch/publicData/b37/Mills_and_1000G_gold_standard.indels.b37.vcf
      - /scratch/publicData/b37/1000G_phase1.indels.b37.vcf
      - /scratch/publicData/b37/dbsnp_138.b37.vcf

    # ...provide associated machine learning parameters
    PARAMS:
      - mills,known=false,training=true,truth=true,prior=12.0
      - 1000G,known=false,training=true,truth=false,prior=10.0
      - dbsnp,known=true,training=false,truth=false,prior=2.0

  # For SNP's...
  SNPS:

    # ...provide a list of resources
    RESOURCES:
    - /scratch/publicData/b37/hapmap_3.3.b37.vcf
    - /scratch/publicData/b37/1000G_omni2.5.b37.vcf
    - /scratch/publicData/b37/1000G_phase1.indels.b37.vcf
    - /scratch/publicData/b37/dbsnp_138.b37.vcf

    # ...provide associated machine learning parameters
    PARAMS:
    - hapmap,known=false,training=true,truth=true,prior=15.0
    - omni,known=false,training=true,truth=false,prior=12.0
    - 1000G,known=false,training=true,truth=false,prior=10.0
    - dbsnp,known=true,training=false,truth=false,prior=2.0

VCF annotation

Set the the working directories to the other vcf annotation databases (GRCh37 or GRCh38). For example:

VEP: "/scratch/publicData/vep/GRCh37/"
dbNSFP: "/scratch/publicData/dbNSFP/GRCh37/dbNSFPv4.0a.hg19.custombuild.gz"
CADD: "/scratch/publicData/CADD/GRCh37/whole_genome_SNVs.tsv.gz"

8. Configure to run on a HPC

This will deploy the non-GPU accelerated rules to slurm and deploy the GPU accelerated rules locally (pbrun_cnnscorevariants). Therefore, if running the pipeline gpu accelerated, the pipeline should be deployed from the machine with the GPU's.

In theory, this cluster configuration should be adaptable to other job scheduler systems. In fact, snakemake is designed to allow pipelines/workflows such as this to be portable to different job schedulers by having any any cluster specific configured outside of the workflow/pipeline. but here I will provide a minimal example for deploying this pipeline to slurm.

Configure account: and partition: in the default section of 'cluster.json' in order to set the parameters for slurm sbatch (see documentation here and here). For example:

{
    "__default__" :
    {
        "account" : "lkemp",
        "partition" : "prod",
        "output" : "logs/slurm-%j_{rule}_{wildcards.sample}.out"
    }
}

There are a plethora of additional slurm parameters that can be configured (and can be configured per rule). If you set additional slurm parameters, remember to pass them to the --cluster flag in the runscripts. See here for a good working example of deploying a snakemake workflow to NeSi.

9. Modify the run scripts

Set the singularity bind location to a directory that contains your pipeline working directory with the --singularity-args '-B' flag. Set the number maximum number of cores to be used with the --cores flag and the maximum amount of memory to be used (in megabytes) with the resources mem_mb= flag. If running GPU accelerated, also set the maximum number of GPU's to be used with the --resources gpu= flag. For example:

Dry run (dryrun_hpc.sh):

snakemake \
--dryrun \
--cores 32 \
--resources mem_mb=150000 \
--resources gpu=2 \
--use-conda \
--conda-frontend mamba \
--latency-wait 120 \
--use-singularity \
--singularity-args '-B /scratch/' \
--configfile ../config/config.yaml \
--cluster-config ../config/cluster.json \
--cluster "sbatch -A {cluster.account} \
-p {cluster.partition} \
-o {cluster.output}"

Full run (run_hpc.sh):

snakemake \
--cores 32 \
--resources mem_mb=150000 \
--resources gpu=2 \
--use-conda \
--conda-frontend mamba \
--latency-wait 120 \
--use-singularity \
--singularity-args '-B /scratch/' \
--configfile ../config/config.yaml \
--cluster-config ../config/cluster.json \
--cluster "sbatch -A {cluster.account} \
-p {cluster.partition} \
-o {cluster.output}"

See the snakemake documentation for additional run parameters.

10. Create and activate a conda environment with python and snakemake installed

cd ./workflow/
mamba env create -f pipeline_run_env.yml
conda activate pipeline_run_env

11. Run the pipeline

First carry out a dry run

bash dryrun_hpc.sh

If there are no issues, start a full run

bash run_hpc.sh

12. Evaluate the pipeline run

Generate an interactive html report

bash report.sh

13. Commit and push to your forked version of the github repo

To maintain reproducibility, commit and push:

  • All configuration files
  • All run scripts
  • The final report

14. Repeat step 13 each time you re-run the analysis with different parameters

15. Raise issues, create feature requests or create a pull request with the upstream repo to merge any useful changes to the pipeline (optional)

See the README for info on how to contribute back to the pipeline!