Snakemake workflow for microbial variant calling, recombination detection and phylogenetic tree reconstruction.
Ensure you are cloning the repository in your scratch directory. The path should look something like this: /scratch/esnitkin_root/esnitkin1/your_uniqname/
Clone the github directory onto your system.
git clone https://github.com/Snitkin-Lab-Umich/snpkit-smk.git
Ensure you have successfully cloned snpkit. Type
ls
and you should see the newly created directorysnpkit-smk
. Move to the newly created directory.
cd snpkit-smk
Load necessary modules from Great Lakes modules.
module load Bioinformatics
module load snakemake singularity htslib
This workflow makes use of singularity containers available through State Public Health Bioinformatics group. If you are working on Great Lakes (umich cluster)—you can load snakemake and singularity modules as shown above. However, if you are running it on your local or other computing platform, ensure you have snakemake and singularity installed.
As an input, the snakemake file takes a config file where you can set the path to samples.tsv
, path to your raw sequencing reads, path to adapter fasta file etc. Instructions on how to modify config/config.yaml
is found in config.yaml
.
Add samples to config/samples.tsv
following the explanation provided below. samples.tsv
should be a comma seperated file consisting of one columns: sample_id
.
sample_id
is the prefix that should be extracted from your FASTQ reads. For example, in your raw FASTQ files directory, if you have a file calledRush_KPC_110_R1.fastq.gz
, your sample_id would beRush_KPC_110
.
You can create samples.tsv
file using the following for loop. Replace path_to_your_raw_reads below with the actual path to your raw sequencing reads.
echo "sample_id" > config/samples_test.tsv
for read1 in path_to_your_raw_reads/*_R1.fastq.gz; do
sample_id=$(basename $read1 | sed 's/_R1.fastq.gz//g')
echo $sample_id
done >> config/samples_test.tsv
Before you run snpkit, it is essential to establish a snpEff database tailored to your reference genome.
First, create a directory named after your reference genome within the designated reference directory. For instance, if your reference genome is KPNIH1, create a directory named
KPNIH1
withinreference
.
mkdir reference/KPNIH1
Upload the corresponding genbank (.gbk) and GFF (.gff) files into the newly created directory. You can download the files from NCBI, check if your reference genome of interest can be found in this path /nfs/esnitkin/bin_group/variant_calling_bin/reference
and move it to the reference
folder or if your reference genome is t.
After uploading the files, it's crucial to rename them to ensure they are compatible with snpEff.
Rename the genbank file to genes.gbk. For example, if your genbank file is named KPNIH1.gbk, rename it to genes.gbk.
mv KPNIH1.gbk genes.gbk
Similarly, rename the GFF file to genes.gff. For instance, if your GFF file is named KPNIH1.gff, rename it to genes.gff.
mv KPNIH1.gff genes.gff
Generating the necessary files to prepare a fasta to use as reference is required for BWA (aligner used in this pipeline) and GATK.
Indexing a genome can be compared to indexing a book. To find the page where a specific word appears or where a chapter begins, it's far more efficient to consult a pre-built index than to flip through every page until you locate it. The same principle applies to alignments. Indices help the aligner quickly pinpoint the likely location of a query sequence within the genome, thereby saving both time and memory.
Read about why indexing is important here and why we need all the following files for variant calling.
Load bwa module
module load bwa
Generate index files with
bwa index
bwa index your_reference_genome.fasta
The difference between bwa index
and indexing with samtools faidx
Load samtools module
module load samtools
Generate
.fai
index file
samtools faidx your_reference_genome.fasta
The .dict
file will list the contig names and sizes based on the reference genome.
Load GATK module
module load gatk
Generate
.dict
file
gatk CreateSequenceDictionary -R your_reference_genome.fasta
Preview the steps in snpkit by performing a dryrun of the pipeline.
snakemake -s snpkit.smk --dryrun -p
Run snpkit on Great lakes HPC
snakemake -s snpkit.smk -p --use-conda --use-singularity -j 999 --cluster "sbatch -A {cluster.account} -p {cluster.partition} -N {cluster.nodes} -t {cluster.walltime} -c {cluster.procs} --mem-per-cpu {cluster.pmem}" --conda-frontend conda --cluster-config config/cluster.json --configfile config/config.yaml --latency-wait 1000
Run the second part of the pipeline only after successfully completing the run
Start an interactive session in your current directory i.e.
snpkit-smk
.
srun --account=esnitkin1 --nodes=1 --ntasks-per-node=1 --mem-per-cpu=5GB --cpus-per-task=1 --time=12:00:00 --pty /bin/bash
Activate snpkit conda environment
conda activate /nfs/turbo/umms-esnitkin/conda/snpkit/
Install package required for parsing of VCF files
pip install cyvcf2
Final step of the pipeline is to generate SNP/indel matrices. Follow the steps below:
-
replace
/path/to/snpkit.py
with the path to where yoursnpkit.py
script is. It should look something like this:your_uniqname
is your uniqname/gpfs/accounts/esnitkin_root/esnitkin1/your_uniqname/snpkit-smk/python_scripts/snpkit_parser/snpkit.py
. -
Edit
config
insnpkit-smk/python_scripts/snpkit_parser/config
. Scroll down until you see[index]
and see how index is called. Follow the same format as the other reference genomes specified in that section and edit the config with the reference genome of your choice. -
replace
path/to/your/reads/directory
with the path to your sequencing reads. -
replace
your_results_output_directory
with where you would like to output the results of this run. Ideally, you would want it in the same folder as the output from the first step i.e./snpkit-smk/results/date_of_your_run_snpkit-smk/
python3 /path/to/snpkit.py -type PE -index your_index -readsdir path/to/your/reads/directory -steps parse -analysis Test_Snpkit_Parser -outdir your_results_output_directory -cluster cluster -scheduler SLURM -gubbins yes -mask
An example of how the above command would look like
python3 /gpfs/accounts/esnitkin_root/esnitkin1/dhatrib/snpkit-smk/python_scripts/snpkit_parser/snpkit.py -type PE -index KPNIH1 -readsdir /nfs/turbo/umms-esnitkin/Github/test_data/fastq -steps parse -analysis Test_Snpkit_Parser -outdir /gpfs/accounts/esnitkin_root/esnitkin1/dhatrib/snpkit-smk/results/2024-05-29_snpkit-smk/ -cluster cluster -scheduler SLURM -gubbins yes -mask