Prok-SNPTree is a pipeline to generate phylogenetic tree using core substitutions between reference genome and whole genomes Illumina sequencing.
Prok-SNPTree was originally designed and optimized for small prokaryotic genomes, but can perform reasonably well with larger genomes up to 100 Mb.
Prok-SNPtree comes with a helper file for providing required input info. The sbatch has preset information for running through SLURM scheduler, but should be able to run without SLURM.
Just place helper batch file and the script in working directory. Make the script executable.
chmod +x prok-snptree.sh
The following tools must be installed and their executables must be available in the ENV.
Function | Tools/scripts |
---|---|
Parallelized sample processing | GNU Parallel ⇨ Source · Website |
Quality check for raw reads | FᴀsᴛQC ⇨ Source · Website MᴜʟᴛɪQC ⇨ Source · Reference · Website |
Adapter identification and trimming | ᴄᴜᴛᴀᴅᴀᴘᴛ ⇨ Source · Reference ᴛʀɪᴍ_ɢᴀʟᴏʀᴇ ⇨ Source · Reference · Website |
Genome indexing and read alignment | ʙᴡᴀ ⇨ Source · Reference · Reference · Website |
Binary conversion and sorting | Sᴀᴍᴛᴏᴏʟs ⇨ Source · Reference · Reference · Website |
Variant calling and selection | GATK ⇨ Source · Reference · Website |
Phylogenetic tree | RAxML ⇨ Source · Reference · Website |
Pairwise SNP count | FᴀsᴛᴀTᴏSNPCᴏᴜɴᴛ.sʜ ⇨ Source |
- All paired gzipped fastq can be placed in working directory or in a subdirectory named
fastq
. The pipeline does not support singletons currently. - Reference genome should be in
refs
subdirectory, and namedgenome.fna
. Symbolic links are accepted. Alternatively, the script can download it automatically (with wget) if a direct link is provided (see arguments below). - Reference annotation is not currently. For futureproofing, it may be provided inside
refs
subdirectory asgenes.gtf
. See alternative methods in arguments.
The script accepts the following arguments, which are supplied from the helper sbatch file.
-
refgenome (Reference genome)
Direct link (url) to reference genome (gzipped). This for convenience, and the intended goal is to be able to download reference genome from NCBI etc. Keep its value empty to.
if it will be supplied maunally (see input files above). -
refannotation (Annotation file for reference genome)
Direct link (url) to reference genome (gzipped) asrefgenome
. This option is here for future function, and may be set to empty now. -
minDP, minQD, maxRDP, and minADP (VCF filtration parameters)
- minDP: Minimum sequencing depth (Positions that fail are considered absent)
- minQD: Minimum depth-normalized quality (SNPs that fail are ignored)
- maxRDP: Maximum reference allele depth for SNPs to be accepted as real.
- minADP: Minimum alternate allele depth for SNPs to be accepted as real.
-
ncpu and nmem (Parallelization parameters)
Number of CPUs and memory (overall) to use. If number of CPU is less than 8, only one sample is processed at a time with the number of CPU available. Otherwise, 8 CPUs are used per sample, and the samples are parallelized based on number of CPUs. -
nboot (Bootstrap)
Number of bootstraps to be used while preparing the phylogenetic tree with RAxML.
If SLURM is available, edit the resource parameters in sbatch file and run as sbatch slurm.batch
.
If running without SLURM, run as bash slurm.batch
. This has not been throughly tested, and is not officially supported in the current version.
The pipeline is written so as to enable resuming or rerunning. The main output files are name systematically and are used as checkpoints.
Some examples:
- Trimming: non-empty
\<sample>.og
files in fastq subdirectory. - Alignment: non-empty
.bam
file in align/<sample> subdirectory. - Variant calling: non-empty
.vcf
file in variants/<sample> subdirectory.
If some samples fail, just rerun the pipeline after making changes with the inputs. Completed samples with valid outputs are not processed again.
If the pipeline exits in the middle of operation, just rerun, and the pipeline will pick up from last complete operation.
If you add new samples, just rerun the pipeline to process it.
A peer-reviewed paper is pending publication. Please cite the zenodo record at the moment as follows:
Sharma A. 2022. rknx/Prok-SNPTree: Phylogenetic Tree from Next-gen Sequencing of Prokaryotes (v0.1b). Zenodo. https://doi.org/10.5281/zenodo.7445133