Skip to content

00. Background & Considerations

Rauf Salamzade edited this page Jul 29, 2022 · 32 revisions

Large Scale Genus or Species Evolutionary Investigations of BGCs

lsaBGC's primary aim is to investigate the general BGC-ome of a single genus or species. In lsaBGC-Easy.py this is achieved by: (i) selecting distinct representative genomes to perform direct BGC predictions on using antiSMASH, GECCO, or DeepBGC, (ii) grouping complete instances of BGCs into GCFs, and (iii) then comprehensively searching for them in all genomes using lsaBGC-AutoExpansion (including re-searching the representative/primary genomes). Distinct representatives are selected using Treemmer on a genomic phylogeny constructing with GToTree phylogeny using 74 core genes to all bacteria. After BGCs are identified, clustered into GCFs, and additional instances are found in non-representaive genomes - lsaBGC-Easy.py constructs another phylogeny using GToTree (preferably with a larger, taxonomically-specific, core gene set offered by GToTree, e.g. 138 core genes specific to Actinobacteria) and uses the core alignment for phylogeny construction to perform dereplication of genomes prior to starting lsaBGC-AutoAnalyze.py. While lsaBGC-AutoAnalyze.py is scalable, dereplication boosts computational efficiency and more importantly attempts to make evolutionary statistics calculated less biased due to genome representation bias in databases where certain strains can be excessively sequenced. The final results lsaBGC-AutoAnalyze.py offers can be found in the Final_Results/ sub-directory and are described on the Wiki here with example results provided on Google drive here. One of the major results in a spreadsheet which lists homolog groups as rows with annotation and conservation/evolutionary stats in consensus order for each GCF making exploring the entire BGC-ome of any taxa very simplified.

Note, GCFs are searched comprehensively and sensitively across the full set of genomes for a taxa using lsaBGC-AutoExpansion.py (which is able to leverage the genus/species context of the application together with auxiliary gene information of key domains from identified complete BGC instances to find fragmented GCF instances distributed across multiple scaffolds in other genomes, e.g. those of draft quality or MAGs). This means that for each GCF, more comprehensive analyses could be achieved using the analytical programs in lsaBGC-AutoAnalyze.py if one desired and could also be used to build homolog-group specific catalogs of observed alleles in available assemblies and subsequently mine short-read metagenomic data for novel BGC variants using lsaBGC-DiscoVary.py (see below).

Relation to Other Software for Comparative/Evolutionary Analysis of BGCs

The selection of representative genomes for initial BGC identification and dereplication components of lsaBGC-Easy.py means that it will likely miss rare BGCs and GCFs present in the taxa. Thus, if you are interested in whether a new isolate you have sequenced has novel BGC content, we recommend assessing this using BiG-SLICE (which has an awesome webserver interface at: https://bigfam.bioinformatics.nl) or BiG-SCAPE which avoids upfront all-vs-all protein alignment via applying linear annotation of Pfam domains and makes it possible to thus comprehensively cluster GCFs in a larger set of primary/representative genomes. If you have run antiSMASH/GECCO/DeepBGC annotation on all genomes and would prefer to do one upfront clustering event to ensure you capture the full extend of BGCs, we actually recommend a combination of BiG-SCAPE with lsaBGC-AutoExpansion.py. Unlike lsaBGC-Cluster.py which was designed for complete BGCs and employs a global syntenic order similarity metric, BiG-SCAPE has a neighbor-based approach to determine whether two BGCs are similar in domain order to group them into GCFs. This will allow you to get accurate GCFs from draft BGC instances. You can then research the primary genomes for missing GCF instances using lsaBGC-AutoExpansion.py which we showed in the lsaBGC manuscript to lead to more sensitive identification of auxiliary components of BGCs when they are at scaffold/contig edges (see Supplemental Figure S3).

image

The final results from lsaBGC-AutoAnalyze.py are GCF-centric, though we now highlight and provide input for CORASON analysis which allows looking at how the context of BGC genes can change across GCFs (gene-centric).

If finding antibiotic producing BGCs is a goal, we also want to encourage folks to check out ARTS2.0 (https://arts.ziemertlab.com/index) which allows for identification of potential self-resistance enzymes associated with GCFs + many other useful evolutionary approaches for investigating GCFs!

Metagenomic Mining for Taxa-Specific GCFs and New Genic Alleles within Them

lsaBGC-AutoExpansion.py allows for the comprehensive identification of BGCs strain resolution detection of allows us to identify novel in the scope of BGCs, but very much inspired by methods to identify novel variants in COVID (e.g. Nextstrain).

image

Alternate Software for BGC Exploration in Metagenomic Datasets to Consider:

Many great bioinformatic tools exist for identifying BGCs and novelty in metagenomes. While lsaBGC-DiscoVary.py aims to identify microdiversity in BGCs within a taxonomic context and more confidently identify novel variants lying in key domains of known BGCs, other tools such as MetaBGC and BiG-MEx aim to uncover new BGCs broadly and irrespective of taxa. Another tool, BiG-MAP aims to quantify GCFs in metagenomes across taxa and is especially useful for differential analysis.

Application of lsaBGC to Natural Products Discovery

In addition to generating simplified reports to ease exploration of a taxa's BGC-ome, which should hopefully aid natural products discovery researchers another interesting future direction is to use it to study evolution of AntiSMASH predicted BGCs and characterized BGCs in MIBIG. Then using evolutionary signatures of BGCs we can help apply them to narrow down predictions from more sensitive yet false-positive producing BGC prediction methods such as ClusterFinder of DeepBGC which are also more likely to find truly novel BGCs.

"... biology is not random. Even though the space of hypothetical possibilities is huge, much of it doesn't exist. It is empty." - Dr. Aviv Regev

In the same spirit as Dr. Regev's quote, we developed lsaBGC to explore the evolutionary trends and microdiversity of BGCs in a lineage/taxonomy of interest in an automated and reproducible fashion.