Skip to content

00. Background & Considerations

Rauf Salamzade edited this page Jun 21, 2023 · 32 revisions

Simplifying Genus-Wide and Species-Wide Evolutionary Investigations of Biosynthetic Gene Clusters

Biosynthetic gene clusters (BGCs) are discrete sets of genes which encode for enzymes that function to synthesize a specific "specialized" or "secondary" metabolite, that are often not universally essential but can definitely be "conditionally" essential in certain settings. BGC "regions" can be predicted by multiple tools, the most popular of which is antiSMASH. These regions can contain co-located or hybrid "protocluster" components which may be independent or encode enzymes that function together to encode common metabolite(s). Multi-type BGC regions can also exist. This concise documentation page of antiSMASH's lists the variety of types of BGCs that exist and that the software is able to identify using largely rule-based approaches.

lsaBGC's primary aim is to ease in-depth investigations of the pan-BGC-ome of a single genus, species, or lineage.

In the lsaBGC-Easy.py workflow, this is achieved by: (i) selecting distinct representative genomes to perform direct BGC region predictions on using antiSMASH, GECCO, or DeepBGC, (ii) grouping complete instances of BGCs into gene cluster families (GCFs), and (iii) performing in-depth investigations of GCFs, profiling the annotation and evolutionary trends of individual genes (or ortholog groups) found within them.

Distinct representative genomes are selected within lsaBGC-Easy based on an amino-acid sequence similarity threshold along a single-copy core constructed using GToTree (also used to create a species phylogeny - default of 74 genes common to all Bacteria). Dereplication not only speeds up computation but also allows for evolutionary statistics computed for ortholog groups within GCFs to be broadly representative of trends across the taxa and less impacted by representation bias of over-sequenced strains in genome databases.

Important Note: After dereplication you should have at most 200 genomes, ideally 100 genomes or less - this is because OrthoFinder is run on the dereplicated set of genomes to determine homolog/ortholog groups which is an O(n^2) operation and can take a long time.

By default, lsaBGC-Easy uses only complete BGC region instances (those which are located away from scaffold edges) because lsaBGC-Cluster was designed for such cases. An alternate strategy for users to consider is running BiG-SCAPE instead of lsaBGC-Cluster (now integrated in and request-able in lsaBGC-Easy) within lsaBGC-Easy with incomplete BGC predictions included to increase sensitivity. This is because BiG-SCAPE's algorithms are better able to handle fragmented BGCs, but this could still lead to unintuitive results downstream when interpreting evolutionary trends within GCFs. For instance, if an instance of a GCF is likely incomplete/fragmented (near a scaffold edge), then are the missing genes truly absent or is this simply an assembly issue and is variability in gene content being overestimated?

Important Note: GCF clustering parameters in BiG-SCAPE are optimized for grouping together BGCs which encode for similar metabolites. Instead, clustering parameters in lsaBGC-Cluster.py are recommended to be selected for each focal taxonomy by manually assessing parameter impact on the number of singleton BGCs, number of distinct GCFs, and number of GCFs with multiple copies per genome. ​ If running lsaBGC manually (not currently available through lsaBGC-Easy), a report for showing the effects of clustering parameters can be requested as described on the lsaBGC-Cluster wiki page.

If you want to see how easy it can be to run lsaBGC-Easy, please check out the wiki page describing the lsaBGC-Easy and lsaBGC-Euk-Easy workflows.

If you want to know what you will get at the end of lsaBGC-Easy, check out the wiki page on lsaBGC-AutoAnalyze results.

The lsaBGC-Easy workflow

lsaBGC_Easy_Overview

Relation to Other Software for Comparative/Evolutionary Analysis of BGCs

The selection of representative genomes for initial BGC identification and dereplication components of lsaBGC-Easy.py further means that it will likely miss rare BGCs and GCFs present in the taxa (this dereplication can be turned off however). Thus, **if you are primarily interested in whether a new isolate you have sequenced has novel BGC content, we recommend assessing this using BiG-SLICE/BiG-FAM by checking whether it matches already cataloged BGCs or checking against MIBiG to see if it resembles a characterized/reference BGC.

lsaBGC-Easy.py has also incorporated BiG-SCAPE for users to run for BGC clustering instead of the default lsaBGC-Cluster.py. BiG-SCAPE features a neat approach to measure syntenic simlarity between BGCs (in other words how similar domain order is between BGCs) using a localized adjacency-index and a glocal mode to identify partial similarities between BGCs. This should allow obtaining accurate GCFs from fragmented BGC instances on scaffold edges and can be paired with lsaBGC-AutoExpansion.py to identify additional parts of such GCF instances on other scaffolds. lsaBGC-AutoExpansion.py is currently turned off by default in lsaBGC-Easy.py because it could lead to false positives in BGC-rich taxa, however, turning it on is recommended when working with taxa with low or moderate counts of BGCs per genome. We showed the potential for lsaBGC-AutoExpansion.py to recover a more comprehensive overview of GCFs by capturing more BGC segments when dealing with draft quality assemblies (and potentially MAGs) in the lsaBGC manuscript. However, we heavily advise to look at lsaBGC-See.py and lsaBGC-ComprehenSeeIve.py visualizations to ensure that lsaBGC-AutoExpansion is not detecting likely false positives.

CORASON is another great tool to check out. The final results from lsaBGC-AutoAnalyze.py are mainly GCF-centric, though we now highlight and provide input for CORASON analysis which allows looking at how the context of specific BGC genes can change across GCFs (a gene-centric approach).

Finally, if finding antibiotic producing BGCs is a goal, we also want to encourage folks to check out ARTS2.0 (https://arts.ziemertlab.com/index) which allows for identification of potential self-resistance enzymes associated with GCFs + many other useful evolutionary approaches for investigating GCFs!

Metagenomic Mining for Taxa-Specific GCFs and New Genic Alleles within Them

lsaBGC-AutoExpansion.py allows for the comprehensive identification of BGCs in available genomic assemblies for a taxonomy (turned off by default in lsaBGC-Easy.py). This allows us to use lsaBGC-DiscoVary.py - our metagenomics exploration tool in lsaBGC - to mine for novel single nucleotide variants within BGC genes in metagenomic short-readsets which have not previously been observed in assemblies. This approach mimics and was inspired by methods to detect novel variants of the COVID virus (e.g. Nextstrain) and similar to how novel variants in viruses can change protein structure, certain mutational sites in BGCs, such as those within key domains of major enzymes can potentially result in major differences in metabolite chemical structure. In our paper, we showed that lsaBGC-Discovary.py's ability to profile strain-specific alleles was concordant with another tools' strain presence predictions which used whole-genome information (Figure 4). In addition to producing a report on potentially novel SNVs of BGC genes, lsaBGC-DiscoVary.py also performs clustering of how similar GCF instances in different metagenomes are to each other and utilizes DESMAN in a novel way to phase BGC gene alleles from a metagenome using a codon-alignment based approach. Similar to BiG-MEx it can produce phylogenies showing how novel alleles in metagenomes compare to alleles of genes already observed in assemblies, only more resolute (constrained to GCF & taxa). Check out more information on DiscoVary on its Wiki page.

Identification of novel SNVs in BGC genes from metagenomes using lsaBGC-DiscoVary image

Alternate Software for BGC Exploration in Metagenomic Datasets to Consider:

Many great bioinformatic tools exist for identifying BGCs and novelty in metagenomes. While lsaBGC-DiscoVary.py aims to identify microdiversity in BGCs within a taxonomic context and more confidently identify novel variants lying in key domains of known BGCs, other tools such as MetaBGC and BiG-MEx aim to uncover new BGCs broadly and irrespective of taxa. Another tool, BiG-MAP aims to quantify GCFs in metagenomes across taxa and is especially useful for differential analysis between different environments/conditional settings.

Application of lsaBGC to Natural Products Discovery

In addition to generating simplified reports to ease exploration of a taxa's pan-BGC-ome, which should generally aid researchers in natural products discovery, another interesting future direction is to use the suite to study the evolution of antiSMASH predicted BGCs or characterized BGCs in MIBiG. Through understanding evolutionary trends of individual genes within GCFs, we can develop new criteria to narrow down predictions from more sensitive yet false-positive producing BGC prediction methods such as ClusterFinder or DeepBGC.

Clone this wiki locally