This document describes the output produced by the pipeline.
The pipeline is built using Nextflow and processes data using the following steps:
- Fetch from ENA (Optional) Fetch reads from the ENA
- Trim Reads Read trimmimg using trimmomatic
- Estimate genome size
- Downsample reads
- Map reads
- Call variants
- Filter variant
- Pseudogenome creation
- Pseudogenome alignment creation
- Recombination removal(Optional)
- Invariant site removal
- Phylogenetic tree creation (Optional)
This process will fetch reads from the ENA archive using the enaDataGet
tool from ENA Browser Tools
Trim with Trimmomatic reads based on the parameters ILLUMINACLIP:adapter_file.fas:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 and with MIN_LEN dynamically determined based on 30% of the read length
Output directory: <OUTPUT DIR>/trimmed_fastqs
Fastq files post trimming will be written here
Estimate the size of the genome using Mash
If the --depth_cutoff
parameter is specified then reads will be downsampled using seqtk to the specified depth
The reads will be mapped to the specified reference genome using bwa mem
Output directory: <OUTPUT DIR>/sorted_bams
Sorted bam files will be written here
Variants will be called using samtools
Variants will be filtered using bcftools in order to flag low quality SNPs using the default filter of %QUAL<25 || FORMAT/DP<10 || MAX(FORMAT/ADF)<5 || MAX(FORMAT/ADR)<5 || MAX(FORMAT/AD)/SUM(FORMAT/DP)<0.9 || MQ<30 || MQ0F>0.1
Output directory: <OUTPUT DIR>/filtered_bcfs
Filtered vcf files will be written here
A pseudogenome based on the variants called is created where missing positions are encoded as -
characters and low quality positions as N
. All other positions either match the reference or are encoded as a SNV of either G,A,T or C. The script filtered_bcf_to_fasta.py is used.
Output directory: <OUTPUT DIR>/pseudogenomes
A pseudogenome for each sample will be written here
The pseudogenomes from the previous step are concatenanted to make a whole genome alignment
Output directory: <OUTPUT DIR>/pseudogenomes
The multi-sample pseudogenome alignment will be written here
Recombination is removed from the alignment using gubbins
Invariant sites are removed using snp-sites
A Maximum likelihood tree is generated using IQ-TREE
Output directory: <OUTPUT DIR>
The consensus tree aligned_pseudogenome.variants_only.contree
including bootstrap values will be written here
- Trimmomatic A flexible read trimming tool for Illumina NGS data.
- mash Fast genome and metagenome distance estimation using MinHash.
- seqtk A fast and lightweight tool for processing sequences in the FASTA or FASTQ format.
- bwa mem Burrow-Wheeler Aligner for short-read alignment
- samtools Utilities for the Sequence Alignment/Map (SAM) format
- bcftools Utilities for variant calling and manipulating VCFs and BCFs
- filtered_bcf_to_fasta.py Python utility to create a pseudogenome from a bcf file where each position in the reference genome is included
- gubbins Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences
- snp-sites Finds SNP sites from a multi-FASTA alignment file
- IQ-TREE Efficient software for phylogenomic inference