Biosynthetic Gene clusters - Super Linear Clustering Engine
- Clustering now uses cosine-like (via l2-normalization) distances (as in https://www.nature.com/articles/s41564-022-01110-2)
- pHMM databases have been updated to PFAM 35.0
- BGC class definition has been updated to antiSMASH v7.0.0
- Switching from HMMER to pyHMMER (speed-ups, can now be fully installed via pip)
- General speed improvement
- Ability to export pre-calculated BGCs and GCFs table into TSVs (use --export-csv parameter)
Make sure you have HMMer (version 3.2b1 or later) installed.- Install BiG-SLiCE using pip:
- from PyPI (stable)
user@local:~$ pip install bigslice
- from source (bleeding edge -- only do this when you know what you are doing!)
user@local:~$ pip install git+https://github.com/medema-group/bigslice.git
- Fetch the latest HMM models (± 271MB gzipped):
user@local:~$ download_bigslice_hmmdb
- Check your installation:
user@local:~$ bigslice --version .
==============
BiG-SLiCE version 2.0.0
HMM databases version: bigslice-models-2022-11-30
Biosynthetic-pfam md5: 37495cac452bf1dd8aff2c4ad92065fe
Sub-pfam md5: 2e6b41d06f3c318c61dffb022798091e
==============
- Run BiG-SLiCE clustering analysis: (see wiki:Input folder on how to prepare the input folder)
user@local:~$ bigslice -i <input_folder> <output_folder>
For a "minimal" test run, you can use the example input folder that we provided.
Querying antiSMASH BGCs
Using the --query
mode, you can perform a blazing-fast query of a putative BGC against the pre-processed set of Gene Cluster Family (GCF) models that BiG-SLiCE outputs (for example, you can use our pre-processed result on ~1.2M microbial BGCs from the NCBI database -- a 17GB zipped file download there is currently no pre-processed result for BiG-SLiCE v2, we will work to make it available soon.). You will get a ranked list of GCFs and BGCs similar to the BGC in question, which will help in determining the function and/or novelty of said BGC. To perform a GCF query, simply use:
user@local:~$ bigslice --query <antismash_output_folder> --n_ranks <int> <output_folder>
Which will perform a query analysis on the latest clustering result contained inside the output folder (see wiki: Program parameters for more advanced options). Top-(n_ranks) matching GCFs will be returned along with their similarity measurements. You can then view the query results using the user interactive output (see below).
To perform GCF analyses on BGCs not covered by antiSMASH/MIBiG (i.e., from tools like ClusterFinder and DeepBGC, or BGCs with manually-refined cluster borders), you can use the converter script that we provided, which will take a (genome) GenBank file along with a comma-separated descriptor file for every BGCs to be generated (please see the example input files provided in the script's folder).
BiG-SLiCE's output folder contains both the processed input data (in the form of an SQLite3 database file) and some scripts that power a mini web-app to visualize that data. To run this visualization engine, follow these steps:
- Fulfill the web-app's package requirements:
user@local:~$ pip install -r <output_folder>/requirements.txt
- Run the flask server:
user@local:~$ bash <output_folder>/start_server.sh <port(optional)>
- Open an internet browser, then go to the URL described by the previous step:
- e.g.
* Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
- then go to
http://0.0.0.0:5000
in your browser
To access BiG-SLiCE's preprocessed data, (advanced) users need to be able to run SQL(ite) queries. Although the learning curve might be steeper compared to the conventional tabular-formatted output files, once familiarized, the SQL database can provide an easy-to-use yet very powerful data wrangling experience. Please refer to our publication manuscript to get an idea of what kind of things are able to be done with the output data. Additionally, you can also download and reuse some jupyter notebook scripts that we wrote to perform all analyses and generate figures for the manuscript.
Bacteria and fungi produce a vast array of bioactive compounds in nature, which can be useful for us as antibiotics (see this list), antivirals (see this list) and anticancer drugs (see Salinisporamide). To optimize and retain the production of those complex chemical agents, microbes organize the responsible genes into genomic 'clumps' colloquially termed as "Biosynthetic Gene Clusters (BGCs)" (above picture, left panel). Using bioinformatics tools such as antiSMASH, we can now take a genome sequence to identify BGCs and predict the secondary metabolites that the organism may produce (see this example analysis for the S. coelicolor genome). Furthermore, by doing a large scale comparative analysis of homologous BGCs sharing similar domain architectures (we call them "Gene Cluster Families (GCFs)"), we can practically chart an atlas of biosynthetic diversity among all sequenced microbes (above picture, right panel).
To enable such a large scale analysis, BiG-SLiCE was specifically designed with scalability and speed as the #1 priority (Figure 1A), as opposed to our previous tool, BiG-SCAPE, which was able to sensitively capture the slightest difference of both domain architecture and sequence similarity between pairs of BGCs (see our paper for the details). As a result, BiG-SLiCE can reliably take an input data of more than 1.2 million BGCs and process it in less than a week runtime using 36-cores machine with 128GB RAM (Figure 1B) while keeping enough sensitivity to delineate the essential biosynthetic 'signals' among the input BGCs (Figure 1C). Moreover, to facilitate exploration and investigation of the analysis results, BiG-SLiCE also produce an interactive, easy-to-use output visualization that can be run with minimal software / hardware requirements.
This software was initially developed and is currently maintained by Satria Kautsar (twitter: @satriaphd) as part of a fully funded PhD project granted to Dr. Marnix Medema (website: marnixmedema.nl, twitter: @marnixmedema) by the Graduate School of Experimental Plant Sciences, NL. Contributions and feedbacks are very welcomed. Feel free to drop us an e-mail if you have any question regarding or related to BiG-SLiCE. In the future, we aim to make BiG-SLiCE a comprehensive platform to do all sorts of downstream large scale BGC analysis, taking advantage of its portable and powerful SQLite3-based data storage combined with the flexible flask-based web app architecture as the foundation.
Satria A Kautsar, Justin J J van der Hooft, Dick de Ridder, Marnix H Medema, BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters, GigaScience, Volume 10, Issue 1, January 2021, giaa154. https://doi.org/10.1093/gigascience/giaa154