Skip to content

Latest commit

 

History

History
166 lines (110 loc) · 7.45 KB

README.md

File metadata and controls

166 lines (110 loc) · 7.45 KB

Python Status License: MIT

BioDeepFuse

BioDeepFuse: Empowering Researchers in Life Sciences with Deep Learning

HomeInstallingHow To UseCitation

Abstract

The accurate classification of non-coding RNA (ncRNA) sequences is pivotal for advanced non-coding genome annotation and analysis, a fundamental aspect of genomics that facilitates understanding of ncRNA functions and regulatory mechanisms in various biological processes. While traditional machine learning approaches have been employed for distinguishing ncRNA, these often necessitate extensive feature engineering. Recently, deep learning algorithms have provided advancements in ncRNA classification. This study presents BioDeepFuse, a hybrid deep learning framework integrating convolutional neural networks (CNN) or bidirectional long short-term memory (BiLSTM) networks with handcrafted features for enhanced accuracy. This framework employs a combination of k-mer one-hot, k-mer dictionary, and feature extraction techniques for input representation. Extracted features, when embedded into the deep network, enable optimal utilization of spatial and sequential nuances of ncRNA sequences. Using benchmark datasets and real-world RNA samples from bacterial organisms, we evaluated the performance of BioDeepFuse. Results exhibited high accuracy in ncRNA classification, underscoring the robustness of our tool in addressing complex ncRNA sequence data challenges. The effective melding of CNN or BiLSTM with external features heralds promising directions for future research, particularly in refining ncRNA classifiers and deepening insights into ncRNAs in cellular processes and disease manifestations. In addition to its original application in the context of bacterial organisms, the methodologies and techniques integrated into our framework can potentially render BioDeepFuse effective in various and broader domains.

Authors

  • Anderson P. Avila Santos $^\dagger$, Breno L. S. de Almeida $^\dagger$, Robson P. Bonidia, Peter F. Stadler, Polonca Stefanic, Ines Mandic-Mulec, Ulisses Rocha, Danilo S. Sanches, and André C.P.L.F. de Carvalho

  • Correspondence: Ulisses Nunes da Rocha. Email: ulisses.rocha@ufz.de

$\dagger$ These authors have contributed equally.

Publication

Abstract accepted and presented at the 16th symposium on Genetics and Bacterial Ecology 2023 (BAGECO 2023), with the title: Feature Extraction Techniques based on Deep Learning Framework for Enhanced Classification of Non-coding RNA.

Full paper: https://doi.org/10.1080/15476286.2024.2329451

Installing dependencies and package

Conda - Terminal

Installing the application using miniconda, e.g.:

$ git clone https://github.com/brenoslivio/BioDeepFuse.git BioDeepFuse

$ cd BioDeepFuse

$ git submodule init

$ git submodule update

1 - Install Miniconda:

See documentation: https://docs.conda.io/en/latest/miniconda.html

$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

$ chmod +x Miniconda3-latest-Linux-x86_64.sh

$ ./Miniconda3-latest-Linux-x86_64.sh

$ export PATH=~/miniconda3/bin:$PATH

2 - Create environment:

conda env create -f environment.yml -n biodeepfuse

3 - Activate environment:

conda activate biodeepfuse

4 - You can deactivate the environment, using:

conda deactivate

How to use

When running the application, it is possible to use diverse arguments. They are shown using the --help argument. The options available are:

  -train TRAIN, --train TRAIN
                        Folder with FASTA training files

  -test TEST, --test TEST
                        Folder with FASTA testing files

  -epochs EPOCHS, --epochs EPOCHS
                        Number of epochs to train

  -patience PATIENCE, --patience PATIENCE
                        Epochs to stop training after loss plateau

  -encoding ENCODING, --encoding ENCODING
                        Encoding - 0: One-hot encoding, 1: K-mer embedding, 2: No encoding (only feature extraction), 3: All encodings (without feature extraction)

  -k K, --k K           Length of k-mers

  -concat CONCAT, --concat CONCAT
                        Concatenation type - 1: Directly, 2: Using dense layer before concatenation

  -feat_extraction FEAT_EXTRACTION [FEAT_EXTRACTION ...], --feat_extraction FEAT_EXTRACTION [FEAT_EXTRACTION ...]
                        Features to be extracted, e.g., 1 2 3 4 5 6. 1 = NAC, 2 = DNC, 3 = TNC, 4 = kGap, 5 = ORF, 6 = Fickett Score

  -features_exist FEATURES_EXIST, --features_exist FEATURES_EXIST
                        Features extracted previously - 0: False, 1: True; Default: False

  -algorithm ALGORITHM, --algorithm ALGORITHM
                        Algorithm - 0: Support Vector Machines (SVM), 1: Extreme Gradient Boosting (XGBoost), 2: Deep Learning

  -num_convs NUM_CONVS, --num_convs NUM_CONVS
                        Number of convolutional layers

  -activation ACTIVATION, --activation ACTIVATION
                        Activation to use - 0: ReLU, 1: Leaky ReLU; Default: ReLU

  -batch_norm BATCH_NORM, --batch_norm BATCH_NORM
                        Use Batch Normalization for Convolutional Layers - 0: False, 1: True; Default: False

  -cnn_dropout CNN_DROPOUT, --cnn_dropout CNN_DROPOUT
                        Dropout rate between Convolutional layers - 0 to 1

  -num_lstm NUM_LSTM, --num_lstm NUM_LSTM
                        Number of LSTM layers

  -bidirectional BIDIRECTIONAL, --bidirectional BIDIRECTIONAL
                        Use Bidirectional LSTM - 0: False, 1: True; Default: False

  -lstm_dropout LSTM_DROPOUT, --lstm_dropout LSTM_DROPOUT
                        Dropout rate between LSTM layers - 0 to 1

  -output OUTPUT, --output OUTPUT
                        Output folder for classification reports.

Alternatively, you can run the experiments used for the paper in run_experiments.sh.

Citation

If you use this application in a scientific publication, we would appreciate citations to the following research paper:

Anderson Avila Santos, Breno L S de Almeida, Robson P. Bonidia, Peter F. Stadler, Polonca Stefanic, Ines Mandic-Mulec, Ulisses Rocha, Danilo S. Sanches, and André CPLF de Carvalho. "BioDeepfuse: a hybrid deep learning approach with integrated feature extraction techniques for enhanced non-coding RNA classification." RNA biology 21, no. 1 (2024): 1-12.

@article{doi:10.1080/15476286.2024.2329451,
author = {Anderson P. Avila Santos, Breno L. S. de Almeida, Robson P. Bonidia, Peter F. Stadler, Polonca Stefanic, Ines Mandic-Mulec, Ulisses Rocha, Danilo S. Sanches and André C.P.L.F. de Carvalho},
title = {BioDeepfuse: a hybrid deep learning approach with integrated feature extraction techniques for enhanced non-coding RNA classification},
journal = {RNA Biology},
volume = {21},
number = {1},
pages = {1-12},
year = {2024},
publisher = {Taylor & Francis},
doi = {10.1080/15476286.2024.2329451},
note ={PMID: 38528797},
URL = {https://doi.org/10.1080/15476286.2024.2329451},
eprint ={https://doi.org/10.1080/15476286.2024.2329451}
}