LegioVue is a nextflow pipeline for whole-genome analysis of Legionella pneumophila. It performs in silico sequence typing, genome assembly, and core-genome analysis. It also provides detailed information about the quality of L. pneumophila genomes. The name is an homage to the Bellevue-Stratford hotel, site of the first known outbreak of Legionnaire's Disease.
This project serves as a repository for the LegioVue analysis pipeline along with validation notes and information on follow-up data analysis steps like clustering. This project is a GRDI-funded research project surrounding the assessment and implementation of a whole genome sequencing scheme for rapid resolution of Legionella pneumophila outbreaks within Canada to better protect vulnerable populations. The goal is to generate and nationally deploy a standardized pipeline that will shift L. pneumophila analysis from conventional sequence based typing to whole genome sequence-based typing and clustering, for rapid detection and response to Legionnaires' Disease outbreaks in Canada.
LegioVue contains a combination of tools that are used to do de novo assembly, sequence typing, cgMLST, and quality control for all input samples with the end goal in having analyzed and formatted data to confirm cluster outbreaks. Currently, clustering is not included in the pipeline but its addition is to come soon. However, we include additional instructions on how to perform cluster analysis on generated outputs.
- Installation
- Resource Requirements
- Quick Usage
- Quick Outputs
- Pipeline Components and Settings
- Limitations
- Citations
- Contributing
- Legal
Installation requires both nextflow (minimum version tested 23.10.1
) and a dependency management system to run.
Steps:
-
Download and install nextflow
- Download and install with conda
- Conda command:
conda create -n nextflow -c conda-forge -c bioconda nextflow
- Conda command:
- Install with the instructions at https://www.nextflow.io/
- Download and install with conda
-
Determine which dependency management system works best for you
- Note: Currently the plotting process is using a custom docker container
-
Run the pipeline with one of the following profiles to handle dependencies (or use your own profile if you have one for your institution! The NML one is included as an example):
conda
mamba
singularity
docker
By default, the kraken2
and SPAdes
steps have a minimum resource usage allocation set to 8 cpus
and 48GB memory
using the nf-core process_high
label.
This can be adjusted (along with the other labels) by creating and passing a custom configuration file with -c <config>
or by adjusting the --max_cpus
and --max_memory
parameters. More info can be found in the usage doc
The recommended kraken2
database is the 8Gb standard database that can be found on the AWS Index server or the the kraken2 database zone so the required memory can be lowered a decent bit (16Gb) with minimal impact if resources are a limiting factor.
Detailed run and parameter instructions are found in the usage doc here.
To just get started and run the pipeline, one of the following basic commands is all that is required to do so. The only difference between the two being in how the input fastq data is specified/found:
Directory Input:
nextflow run phac-nml/legiovue \
-profile <PROFILE> \
--fastq_dir </PATH/TO/PAIRED_FASTQS> \
--kraken2_db </PATH/TO/KRAKEN2_DB> \
[Optional Args]
Where:
-profile <PROFILE>
: The nextflow profile to use.- Specification of a dependency management system (docker, singularity, conda)
--fastq_dir </PATH/TO/PAIRED_FASTQS>
: Path to directory containing paired Illumina_R1
and_R2
fastq files- Fastqs must be formatted as
<NAME>_{R1,R2}\*.fastq\*
- At the moment everything before the first
_R1/_R2
is kept as the sample name
- Fastqs must be formatted as
--kraken2_db </PATH/TO/KRAKEN2_DB>
: Path to a kraken2 database
Samplesheet CSV Input:
nextflow run phac-nml/legiovue \
-profile <PROFILE> \
--input </PATH/TO/INPUT.csv> \
--kraken2_db </PATH/TO/KRAKEN2_DB> \
[Optional Args]
Where:
-profile <PROFILE>
: The nextflow profile to use.- Specification of a dependency management system (docker, singularity, conda)
--input </PATH/TO/INPUT.csv>
: Path to a CSV file with the header linesample,fastq_1,fastq_2
sample
is the name of the samplefastq_1,fastq_2
is the path to both the fastq reads- Note that paired end sequencing is required at this time!
- Example file
--kraken2_db </PATH/TO/KRAKEN2_DB>
: Path to a kraken2 database
Note
The recommended 8GB kraken2
standard database can be found on the AWS Index server or the the kraken2 database zone. Download this before running the pipeline!
All of the outputs can be found in the output docs. All outputs are by default put in the results
folder with some of the major outputs being as follows:
spades/
: Contains the SPAdes assemblies (contigs as .fasta files) for each sample.el_gato/el_gato_st.tsv
: Summarized el_gato ST calls for all samples.chewbbaca/allele_calls/cgMLST/
: cgMLST profiles that can be used for downstream visualization.overall.qc.csv
: Final quality summary report for each sample throughout the different pipeline steps. Important quality flags can be found in this file.
Kraken2
and Bracken
Kraken2 is used to taxonomically profile the paired Illumina reads against the standard Kraken RefSeq database with a confidence level of 0.1 (--confidence 0.1
). Bracken is then used to estimate taxonomic abundances (including potential contaminants) from the Kraken profile.
Trimmomatic
Trimmomatic is used to remove Illumina adapters (ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:True
) and trim reads according to quality (LEADING:3
, TRAILING:3
, SLIDINGWINDOW:4:20
). Reads shorter than 100bp are dropped (MINLEN:100
).
FastQC
FastQC provides quality information about the trimmed reads including estimates of duplication, %GC, and N content. Samples retaining fewer than 150,000 high-quality read pairs after trimming are removed unless --min_reads <COUNT>
is specified.
SPAdes
and QUAST
High-quality reads (both paired and unpaired) are then assembled into Legionella genomes using the SPAdes assembler and --careful
option, which aims to minimize mismatches and short indels in the assembly. The quality of the resulting assemblies is evaluated with QUAST. At this step, genomes are compared to a Legionella pneumophila reference genome and an assembly quality score is calculated for each sample using a custom script.
The quast_analyzer.py
script assigns a score to each SPAdes assembly based on pre-cgMLST metrics (e.g., similarity to RefSeq complete Lp genomes, N50, # contigs, %GC content) originally outlined in the supplementary appendix (Supplementary Table 2) of the following paper:
Gorzynski, J., Wee, B., Llano, M., Alves, J., Cameron, R., McMenamin, J., et al. (2022). Epidemiological analysis of Legionnaires’ disease in Scotland: a genomic study. The Lancet Microbe 3, e835–e845. doi: 10.1016/S2666-5247(22)00231-2
Quality thresholds and score effects have been updated in this pipeline to better capture quality issues that are likely to affect the interpretation of the resulting cgMLST profile. Assemblies are assigned a quality score out of 6, where a score of 6/6 represents an "excellent" high-quality Legionella pneumophila assembly.
el_gato
el_gato performs in silico Sequence-based Typing (SBT) of Legionella pneumophila sequences based on the identification and comparison of 7 loci (flaA, pilE, asd, mip, mompS, proA, neuA/neuAh) against an allele database. In this pipeline SBT is first called on Illumina paired-end reads using a mapping/alignment approach that is recommended by the el_gato
developers. If samples are not initially assigned a sequence type (ST = MA?
or MD-
), el_gato
is run again on the assembled genome using an in silico PCR-based approach. The resulting allele and ST calls are reported in el_gato_st.tsv
.
Note: if the ST results are inconclusive after both approaches have been tried, users are encouraged to review the possible_mlsts.txt
intermediate output for that sample in the pipeline results folder under el_gato/reads/
chewBBACA
Assembled Legionella pneumophila genomes are passed to chewBBACA, which performs Core Genome MultiLocus Sequence Typing (cgMLST) according to the published Ridom SeqSphere 1521-loci cgMLST schema for L. pneumophila.
cgMLST Visualization and Clustering
PHYLOViZ
and reporTree
Note: Reportree requires an update before it can be properly incorporated into the nextflow pipeline. Users can run reportree on their pipeline output separately for now to produce the same visualizations.
Visualize cgMLST profiles alongside sample metadata using one of the following two methods:
i) Either drop the cgMLST profile (e.g., cgMLST100.tsv
) directly into PhyloViz and upload metadata for visualization, or,
ii) Perform partitioning (clustering) with ReporTree, which will generate outputs (MST and metadata) that can be visualized with the local version of GrapeTree.
Detailed instructions for clustering and visualization are provided separately.
Quality Summary
LegioVue outputs a summary of quality metrics and warnings for each step of the workflow in the overall.qc.csv
file
The final quality summary has two columns: qc_status
and qc_message
that can be used to quickly determine if a sample is good or may have an issue. The qc_status
column will be any of the following statuses:
- Pass: The sample passes all checks!
- Warn: The sample was flagged for a specific warning
- Fail: The sample has failed out of the pipeline and may not be included in the final cgMLST profile.
The qc_message
column contains the reason for the qc_status
and includes:
Message | Associated Status | Flag Reason |
---|---|---|
low_lpn_abundance | WARN | Low (< 75%) L. pneumophila abundance is not expected with isolate sequencing and may indicate contamination. |
low_read_count | WARN | Low read count (< 150,000 reads default) has been shown to lead to poor, uninformative assemblies. |
low_n50 | WARN | Low N50 scores (< 100,000) have been shown to negatively affect clustering outputs by inflating observed allele differences. |
low_exact_allele_calls | WARN | Low chewBBACA exact allele calls (< 90%) indicate that there may be issues in the assembly, possibly affecting the cgMLST profile. |
low_qc_score | WARN | Low QUAST-Analyzer QC score (< 4) indicates that there may be issues in the assembly, possibly affecting the cgMLST profile. |
no_lpn_detected | FAIL | Very low (< 10% default) L.pneumophila abundance flags that the sample may not be L.pneumophila and sample is removed from the remainder of the pipeline |
failing_read_count | FAIL | Post-trimming read count below failing threshold (< 60,000 reads default) has been shown to lead to poor, uninformative assemblies and sample is removed. |
This pipeline is intended to be run on Legionella pneumophila paired illumina isolate sequencing data. In the future Nanopore long-read sequencing data will also be supported.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x. In addition, references of tools and data used in this pipeline are as follows:
Detailed citations for utilized tools are found in CITATIONS.md
Contributions are welcome through creating PRs or Issues
Copyright 2024 Government of Canada
Licensed under the MIT License (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:
https://opensource.org/license/mit/
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.