Promotech-CNN is a Bacterial Promoter Prediction Tool built using Dilated Convolution Neural Networks.
The body of the code was built using a Class module that read fasta files to retrieve DNA sequences. Once read, the sequences are converted to one-hot-coding and passed to the CNN for building and training. Our CNN architecture was guided by the BPNet architecture found at, however, this solution does not use as many layers and have several tweeks in architecture.
-- Operating System: Linux
-- Python: 3.8.2
-- tensorflow: 2.5.0
-- numpy: 1.2.0
-- pandas: 1.2.4
About the Files:
-- Promotech-CNN-20210720.h5 - H5 module built from the trained network. This is loaded during predictions
-- - Base Class module that is used to build and train a netwok if required. It handles promoter sequence conversion before they are fed into the Neural Network. The main prediction method is processed here.
-- - Main file to load CNN module and obtain predictions for sequences in FASTA format
Predictions are made after loading the H5 data file and then running the prediction on an input promoter sequence file.
-- You may run predictions on fasta files or text files with only one 40nt sequence per line. Sequences must contain only A, C, G and T. python Promotech-CNN-20210720.h5 sequences.fasta
The output is a comma-delimited text file with four columns: sequence, probability of not being a promoter, probability of being a promoter and the prediction (0 - non promoter, 1 - promoter). The first row contains the column headers.
The model was built for training on a Linux environment, you may change line 115 and 116 to run on Windows - the '/' should be switched to '\'.
If you intend to rebuild the network from scratch, you may run the commands below - this is ideal for new training data.
-- Import the Python Class. from PromotechCNN import PromotechCNN code = PromotechCNN() -- Setup the training data(Path to directory with training data is used). code.createTrainFastaFiles("/TrainingData/40nt-sequences/bacteria-1-10-ratio") code.buildData() -- Build and Train the network with 100 epochs, early stopping is in place so the network might stop before 100 epochs. code.buildBPNet(100) -- Predict the validation data. Used to build AUPRC and AUROC graphs code.predictValidation("/ValidationData/40nt-sequences/bacteria-1-1-ratio") -- Run prediction on some promoter sequence. code.predictSequenceFile("BACILLUS_negative.fasta")