Prok-SNPTree

Prok-SNPTree is a pipeline to generate phylogenetic tree using core substitutions between reference genome and whole genomes Illumina sequencing.
Prok-SNPTree was originally designed and optimized for small prokaryotic genomes, but can perform reasonably well with larger genomes up to 100 Mb.

Prok-SNPtree comes with a helper file for providing required input info. The sbatch has preset information for running through SLURM scheduler, but should be able to run without SLURM.

Installation

Just place helper batch file and the script in working directory. Make the script executable.

chmod +x prok-snptree.sh

Dependencies

The following tools must be installed and their executables must be available in the ENV.

Function	Tools/scripts
Parallelized sample processing	GNU Parallel ⇨ Source · Website
Quality check for raw reads	FᴀsᴛQC ⇨ Source · Website MᴜʟᴛɪQC ⇨ Source · Reference · Website
Adapter identification and trimming	ᴄᴜᴛᴀᴅᴀᴘᴛ ⇨ Source · Reference ᴛʀɪᴍ_ɢᴀʟᴏʀᴇ ⇨ Source · Reference · Website
Genome indexing and read alignment	ʙᴡᴀ ⇨ Source · Reference · Reference · Website
Binary conversion and sorting	Sᴀᴍᴛᴏᴏʟs ⇨ Source · Reference · Reference · Website
Variant calling and selection	GATK ⇨ Source · Reference · Website
Phylogenetic tree	RAxML ⇨ Source · Reference · Website
Pairwise SNP count	FᴀsᴛᴀTᴏSNPCᴏᴜɴᴛ.sʜ ⇨ Source

Input files

All paired gzipped fastq can be placed in working directory or in a subdirectory named fastq. The pipeline does not support singletons currently.
Reference genome should be in refs subdirectory, and named genome.fna. Symbolic links are accepted. Alternatively, the script can download it automatically (with wget) if a direct link is provided (see arguments below).
Reference annotation is not currently. For futureproofing, it may be provided inside refs subdirectory as genes.gtf. See alternative methods in arguments.

Arguments

The script accepts the following arguments, which are supplied from the helper sbatch file.

refgenome (Reference genome)
Direct link (url) to reference genome (gzipped). This for convenience, and the intended goal is to be able to download reference genome from NCBI etc. Keep its value empty to . if it will be supplied maunally (see input files above).
refannotation (Annotation file for reference genome)
Direct link (url) to reference genome (gzipped) as refgenome. This option is here for future function, and may be set to empty now.
minDP, minQD, maxRDP, and minADP (VCF filtration parameters)
- minDP: Minimum sequencing depth (Positions that fail are considered absent)
- minQD: Minimum depth-normalized quality (SNPs that fail are ignored)
- maxRDP: Maximum reference allele depth for SNPs to be accepted as real.
- minADP: Minimum alternate allele depth for SNPs to be accepted as real.
ncpu and nmem (Parallelization parameters)
Number of CPUs and memory (overall) to use. If number of CPU is less than 8, only one sample is processed at a time with the number of CPU available. Otherwise, 8 CPUs are used per sample, and the samples are parallelized based on number of CPUs.
nboot (Bootstrap)
Number of bootstraps to be used while preparing the phylogenetic tree with RAxML.

Running the program

If SLURM is available, edit the resource parameters in sbatch file and run as sbatch slurm.batch.
If running without SLURM, run as bash slurm.batch. This has not been throughly tested, and is not officially supported in the current version.

Optimizations

The pipeline is written so as to enable resuming or rerunning. The main output files are name systematically and are used as checkpoints.
Some examples:

Trimming: non-empty \<sample>.og files in fastq subdirectory.
Alignment: non-empty .bam file in align/<sample> subdirectory.
Variant calling: non-empty .vcf file in variants/<sample> subdirectory.

If some samples fail, just rerun the pipeline after making changes with the inputs. Completed samples with valid outputs are not processed again.

If the pipeline exits in the middle of operation, just rerun, and the pipeline will pick up from last complete operation.

If you add new samples, just rerun the pipeline to process it.

Citation

A peer-reviewed paper is pending publication. Please cite the zenodo record at the moment as follows:

Sharma A. 2022. rknx/Prok-SNPTree: Phylogenetic Tree from Next-gen Sequencing of Prokaryotes (v0.1b). Zenodo. https://doi.org/10.5281/zenodo.7445133

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.zenodo.json		.zenodo.json
CITATION.cff		CITATION.cff
README.md		README.md
prok-snptree.sh		prok-snptree.sh
slurm.batch		slurm.batch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prok-SNPTree

Installation

Dependencies

Input files

Arguments

Running the program

Optimizations

Citation

About

Releases

Packages

Languages

rknx/prok-snptree

Folders and files

Latest commit

History

Repository files navigation

Prok-SNPTree

Installation

Dependencies

Input files

Arguments

Running the program

Optimizations

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages