A long-read, whole-genome structural variant (SV) caller. It takes as input long read alignments (BAM), the corresponding reference genome (FASTA), a VCF with high-quality SNPs (e.g. via GATK, Deepvariant, NanoCaller, and gnomAD database VCF files with SNP population frequencies for each chromosome. Class documentation is available at https://wglab.openbioinformatics.org/ContextSV
First, install Anaconda.
Next, create a new environment. This installation has been tested with Python 3.11:
conda create -n contextsv python=3.11
conda activate contextsv
ContextSV can then be installed using the following command:
conda install -c bioconda -c wglab contextsv=1.0.0
First install Anaconda. Then follow the instructions below to install LongReadSum and its dependencies:
git clone https://github.com/WGLab/ContextSV
cd ContextSV
conda env create -f environment.yml
make
SNP population allele frequency information is used for copy number predictions in this tool (see PennCNV for specifics). We recommend downloading this data from the Genome Aggregation Database (gnomAD).
Download links for genome VCF files are located here (last updated April 3, 2024):
-
gnomAD v4.0.0 (GRCh38): https://gnomad.broadinstitute.org/downloads#4
-
gnomAD v2.1.1 (GRCh37): https://gnomad.broadinstitute.org/downloads#2
download_dir="~/data/gnomad/v4.0.0/"
chr_list=("1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "X" "Y")
for chr in "${chr_list[@]}"; do
echo "Downloading chromosome ${chr}..."
wget "https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/genomes/gnomad.genomes.v4.0.sites.chr${chr}.vcf.bgz" -P "${download_dir}"
done
Finally, create a text file that specifies the chromosome and its corresponding gnomAD filepath. This file will be passed in as an argument:
gnomadv4_filepaths.txt
1=~/data/gnomad/v4.0.0/gnomad.genomes.v4.0.sites.chr1.vcf.bgz
2=~/data/gnomad/v4.0.0/gnomad.genomes.v4.0.sites.chr2.vcf.bgz
3=~/data/gnomad/v4.0.0/gnomad.genomes.v4.0.sites.chr3.vcf.bgz
...
X=~/data/gnomad/v4.0.0/gnomad.genomes.v4.0.sites.chrX.vcf.bgz
Y=~/data/gnomad/v4.0.0/gnomad.genomes.v4.0.sites.chrY.vcf.bgz
# Activate the environment
conda activate contextsv
# Set the input reference genome
ref_file="~/data/GRCh38.fa"
# Set the input alignment file (e.g. from minimap2)
long_read_bam="~/data/HG002.GRCh38.bam"
# Set the input SNPs file (e.g. from NanoCaller)
snps_file="~/data/variant_calls.snps.vcf.gz"
# Set the SNP population frequencies filepath
pfb_file="~/data/gnomadv4_filepaths.txt"
# Set the output directory
output_dir=~/data/contextSV_output
# Specify the number of threads (system-specific)
thread_count=40
# Run SV calling (~3-4 hours for whole-genome, 40 cores)
python contextsv --threads $thread_count -o $output_dir -lr $long_read_bam --snps $snps_file --reference $ref_file --pfb $pfb_file
# The output VCF filepath is located here:
output_vcf=$output_dir/sv_calls.vcf
# Merge SVs (~3-4 hours for whole-genome, 40 cores)
python contextsv --merge $output_vcf
# The final merged VCF filepath is located here:
merged_vcf=$output_dir/sv_calls.merged.vcf
python contextsv --help
ContextSV: A tool for integrative structural variant detection.
options:
-h, --help show this help message and exit
-lr LONG_READ, --long-read LONG_READ
path to the long read alignment BAM file
-g REFERENCE, --reference REFERENCE
path to the reference genome FASTA file
-s SNPS, --snps SNPS path to the SNPs VCF file
--pfb PFB path to the file with SNP population frequency VCF filepaths (see docs for format)
-o OUTPUT, --output OUTPUT
path to the output directory
-r REGION, --region REGION
region to analyze (e.g. chr1, chr1:1000-2000). If not provided, the entire genome will be analyzed
-t THREADS, --threads THREADS
number of threads to use
--hmm HMM path to the PennCNV HMM file
--window-size WINDOW_SIZE
window size for calculating log2 ratios for CNV predictions (default: 10 kb)
-d, --debug debug mode (verbose logging)
-v, --version print the version number and exit
For release history, please visit here.
Please refer to the contextSV issue pages for posting your issues. We will also respond your questions quickly. Your comments are critical to improve our tool and will benefit other users.