Genomic epidemiology of E. coli strains within wild hosts

Supplementary material for the MSc thesis titled "Genomic epidemiology of E. coli strains within wild hosts".

Contents

Supplementary figures

Supplementary figure 1: Multi-dimensional scaling of pairwise distances between the Roary core genome alignment. A- separating the isolates by phylogroup showing clearly defined clustering. B- separating the isolates by order of species the isolate came from.

Supplementary methods

Isolate acquisition: The isolate collection from North America consisted of 107 E. coli isolates and 11 Enterobacter cloacae, which were excluded from downstream analysis, from Mexico and one E. coli isolate from Costa Rica. Three E. coli isolates were provided from Venezuela in South America and one Unclassified isolate from Africa. Thirteen of the contributed E. coli isolates and two E. cloacae isolates came from unknown continents and countries. A total of 138 isolates were provided with 123 being E. coli. The E. coli isolates were recovered from the faeces 50 different genus’s consisting of 64 different species covering both land and sky-based animals. The isolates were sequenced as described by Moradigaravand et al¹.

Antibiotic susceptibility data for the isolates had been established for Ampicilin, Cefotaxime, Ceftazidime, Cefuroxime, Cephalothin, Ciprofloxacin, Gentamycin, Tobramycin and Trimethoprim through phenotypic testing measures.

Quality control of short paired end reads: The pair end reads need to be quality controlled to ensure the reads are E. coli and are of high enough quality to be used in downstream analysis. The Kraken² taxonomical assignment tool uses alignment of K-mers and a custom classification algorithm to produce fast and sensitive assignment of taxonomical labels. The outputs are a percent of the reads belonging to a certain species and from this we identifed and remove contaminate isolates that were below 40% E. coli from all analysis.

The stats.sh shell tool from BBMap³ was used to obtain the N50 and number of contigs for each assembly file. Combined the N50 value (higher the better) and number of contigs (lower the better) show the quality of the assembly and how well we have retrieved the whole genome sequence of the isolate. A custom shell script was compiled to calculate and extract these values.

De novo assembly and pangenome analysis: Paired end reads were de novo assembled via Velvet⁴ following the workflow and parameters of Moradigaravand et al¹. The assembled genomes were then annotated via Prokka, the rapid prokaryotic whole genome annotator⁵. The annotated .gff files were used as input for the pangenome pipeline Roary⁶ using fast core alignment to create a multiFASTA alignment of core genes (roary -n -e -z *.gff). Roary works via extracting the coding regions and performs BLASTP on them with a defined identity threshold. The output summary stats and core genome alignment use in downstream analysis. The SNPs within the core genome were identified with SNP-Sites from the sanger institute⁷. Maximum likelihood trees were generated via FastTee⁸ with visualisation of trees and associated metadata in FigTree and iTOL⁹. R was used to generate MDS plots of the pairwise distance between isolates, clustering by phenotypic groups and isolate taxonomic levels.

Pairwise distance analysis: The pairwise distance between the core gene alignment and core SNP alignment of isolates was analysed using the R package ape¹⁰ to read and generate a pairwise distance matrix for each multi-fasta alignment file. The distribution of pairwise distance, excluding the phylogroup Clade I and isolates with an unidentified phylogroup, were plotted and analysed by splitting into same and different phylogroup of the isolate/taxonomical level of the host species. The clustering of the pairwise distances was analysed via multi-dimensional scaling (MDS) using the build in R stats cmdscale functionality on the pairwise distance matrices. The output of which is directly readable into ggplot2, however clade I and those with unidentified phylogroup) were excluded. Data points were labelled by phylogroup of the isolate and differing taxonomical levels of the host species to observe the clustering patterns.

Correlation analysis of virulence factor frequency and adult body mass of isolate: Linear regression of virulence factor frequency of the isolate to adult body mass of the host species was conducted using adult body masses from the PanTHERIA¹¹ ecological database with NAs being removed and merging extrapolated data and actual data. The integrated virulence factor frequency and adult body mass dataset was bootstrapped to produce 100 datasets consisting of 82 observations each, producing a distribution of adjusted R squared, f-statistic and p-values. Correlation between the two data types was also analysed by Spearman’s correlation tests on the same bootstrapped datasets to get a distribution of t-statistic, correlation and p-values. Spearman’s was employed to allow a dropping of the linear assumption of correlation.

References

Moradigaravand, D. et al. Evolution of the Staphylococcus argenteus ST2250 Clone in Northeastern Thailand Is Linked with the Acquisition of Livestock-Associated Staphylococcal Genes . MBio (2017). doi:10.1128/mbio.00802-17
Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. (2014). doi:10.1186/gb-2014-15-3-r46
Bushnell B. BBMap download | SourceForge.net. Available at: https://sourceforge.net/projects/bbmap/. (Accessed: 27th June 2019)
Zerbino, D. R. & Birney, E. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. (2008). doi:10.1101/gr.074492.107
Seemann, T. Prokka: Rapid prokaryotic genome annotation. Bioinformatics (2014). doi:10.1093/bioinformatics/btu153
Page, A. J. et al. Roary: Rapid large-scale prokaryote pan genome analysis. Bioinformatics (2015). doi:10.1093/bioinformatics/btv421
Page, A. J. et al. SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microb. Genomics (2016). doi:10.1099/mgen.0.000056
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2 - Approximately maximum-likelihood trees for large alignments. PLoS One (2010). doi:10.1371/journal.pone.0009490
Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v4: recent updates and new developments. Nucleic Acids Res. (2019). doi:10.1093/nar/gkz239
Paradis, E. & Schliep, K. Ape 5.0: An environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics (2019). doi:10.1093/bioinformatics/bty633
Jones, K. E. et al. PanTHERIA: a species-level database of life history, ecology, and geography of extant and recently extinct mammals. Ecology (2009). doi:10.1890/08-1494.1

Data

Virulence Finder data

ResFinder data

Scripts

Bash scripts- Functionality of these may depend on local system.

R scripts

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
Bash_scripts		Bash_scripts
R_scripts		R_scripts
Trees		Trees
QC_and_metadata_table.csv		QC_and_metadata_table.csv
README.md		README.md
ResFinder_binary_gene_frequency_perIsolate.csv		ResFinder_binary_gene_frequency_perIsolate.csv
Vir_freq_grouped_PerIsolate_orderd_withHomo.csv		Vir_freq_grouped_PerIsolate_orderd_withHomo.csv
supplementary figure 1.png		supplementary figure 1.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Genomic epidemiology of E. coli strains within wild hosts

Supplementary figures

Supplementary methods

Data

Scripts

About

Releases

Packages

Languages

Rob-murphys/MSc-Bioinformatics-thesis

Folders and files

Latest commit

History

Repository files navigation

Genomic epidemiology of E. coli strains within wild hosts

Supplementary figures

Supplementary methods

Data

Scripts

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages