Skip to content

Latest commit

 

History

History
84 lines (64 loc) · 4.66 KB

output.md

File metadata and controls

84 lines (64 loc) · 4.66 KB

nf-core/bactmap: Output

This document describes the output produced by the pipeline.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

Fetch from ENA

This process will fetch reads from the ENA archive using the enaDataGet tool from ENA Browser Tools

Trim Reads

Trim with Trimmomatic reads based on the parameters ILLUMINACLIP:adapter_file.fas:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 and with MIN_LEN dynamically determined based on 30% of the read length Output directory: <OUTPUT DIR>/trimmed_fastqs Fastq files post trimming will be written here

Genome Size Estimation

Estimate the size of the genome using Mash

Downsample reads

If the --depth_cutoff parameter is specified then reads will be downsampled using seqtk to the specified depth

Map reads

The reads will be mapped to the specified reference genome using bwa mem Output directory: <OUTPUT DIR>/sorted_bams Sorted bam files will be written here

Call variants

Variants will be called using samtools

Filter variants

Variants will be filtered using bcftools in order to flag low quality SNPs using the default filter of %QUAL<25 || FORMAT/DP<10 || MAX(FORMAT/ADF)<5 || MAX(FORMAT/ADR)<5 || MAX(FORMAT/AD)/SUM(FORMAT/DP)<0.9 || MQ<30 || MQ0F>0.1 Output directory: <OUTPUT DIR>/filtered_bcfs Filtered vcf files will be written here

Pseudogenome creation

A pseudogenome based on the variants called is created where missing positions are encoded as - characters and low quality positions as N. All other positions either match the reference or are encoded as a SNV of either G,A,T or C. The script filtered_bcf_to_fasta.py is used. Output directory: <OUTPUT DIR>/pseudogenomes A pseudogenome for each sample will be written here

Pseudogenome alignment creation

The pseudogenomes from the previous step are concatenanted to make a whole genome alignment Output directory: <OUTPUT DIR>/pseudogenomes The multi-sample pseudogenome alignment will be written here

Recombination removal

Recombination is removed from the alignment using gubbins

Invariant sites

Invariant sites are removed using snp-sites

Phylogenetic tree creation

A Maximum likelihood tree is generated using IQ-TREE Output directory: <OUTPUT DIR> The consensus tree aligned_pseudogenome.variants_only.contree including bootstrap values will be written here

Software used within the pipeline

  • Trimmomatic A flexible read trimming tool for Illumina NGS data.
  • mash Fast genome and metagenome distance estimation using MinHash.
  • seqtk A fast and lightweight tool for processing sequences in the FASTA or FASTQ format.
  • bwa mem Burrow-Wheeler Aligner for short-read alignment
  • samtools Utilities for the Sequence Alignment/Map (SAM) format
  • bcftools Utilities for variant calling and manipulating VCFs and BCFs
  • filtered_bcf_to_fasta.py Python utility to create a pseudogenome from a bcf file where each position in the reference genome is included
  • gubbins Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences
  • snp-sites Finds SNP sites from a multi-FASTA alignment file
  • IQ-TREE Efficient software for phylogenomic inference