Skip to content

sjang92/plastNN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PlastNN

This is the public repository for PlastNN, a neural-network based predictor for Apicoplast proteins in Plasmodium falciparum.

How to Run

PlastNN was developed and tested on Ubuntu 16.04, Bash on Ubuntu on Windows, and MacOSX using Python3. We've also checked that it runs fine on windows with Visual Studio 2017, but we are not providing tools to run the script on this specific environment.

To setup the python environment and install dependencies, clone the repo and run: '''

source setup.sh '''

Once this is done, you should find yourself in the virtual environment named 'venv'.

To train the model with default parameters (recommended) and produce output, run: '''

./train.sh '''

By default, this script uses 6-fold cross-validation to train 6 fully-connected models. Once training is over, these 6 models then vote on the unlabeled protein datapoints to assign labels.

You should see results outputted in the newly created 'results' directory. 'perf.csv' contains the evaluation results obtained after each epoch, and 'vote.csv' contains the final voting results made by 6 different models trained using 6-fold cross validation.

The past_results directory contains saved models from previous runs. The model with the best performance, and its output, are saved in the plastNN_final_saved_model directory.

If you want to run the script with different hyperparameters (learning rate, number of layers and neurons, etc), check the tensorflow app flags defined in 'src/trainer.py' and re-run accordingly.

Data

The training data includes the following data types for 205 positive-label (apicoplast) proteins and 451 negative-label (non-apicoplast) proteins:

  1. Protein sequence (positive.txt and negative.txt)
  2. Position of the first nucleotide after the end of the signal peptide, predicted by signalP3.0 (1) (pos_tp.txt and neg_tp.txt)
  3. Transcript levels corresponding to each protein at 8 time points, from Bartfai et al. (2) (pos_rna.txt and neg_rna.txt)

The unlabeled data contains similar files for 450 unlabeled proteins. Both sets of data can be found in the data directory.

Featurization

For each protein, plastNN constructs a feature vector of length 28. The first 20 elements represent fequencies of the 20 canonical amino acids in a 50-amino acid region immediately after the predicted signal peptide, and the next 8 elements are transcript levels at 8 time points. These vectors are used as input to the neural network.

Model

PlastNN is a simple fully-connected neural network with 3 hidden layers, with each layer having 64, 64 and 16 output neurons respectively.

Training and Evaluation

We trained models using 6-fold cross-validation; that is, we trained 6 separate models with the same architecture, each using 5 of the 6 folds for training and the using the one remaining fold as a cross-validation set to evaluate performance. Accuracy, coverage and PPV were calculated on the cross-validation set. When predicting on the test set, the final predictions were generated by a majority vote of all 6 models.

All models were trained using the RMSProp optimization algorithm with a learning rate of 0.0001.

Results

The results are described in the following paper:

Boucher, M.J., Ghosh, S., Zhang, L., Lal, A., Jang, S.W., Ju, A., Zhang, S., Wang, X., Ralph, S.A., Zou, J. and Elias, J.E., 2018. Integrative proteomics and bioinformatic prediction enable a high-confidence apicoplast proteome in malaria parasites. PLoS biology, 16(9), p.e2005895. https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.2005895

References

  1. Nielsen, H., 2017. Predicting Secretory Proteins with SignalP. Protein Function Prediction: Methods and Protocols, pp.59-73.
  2. Bártfai R, Hoeijmakers WAM, Salcedo-Amaya AM, Smits AH, Janssen-Megens E, Kaan A, et al. (2010) H2A.Z Demarcates Intergenic Regions of the Plasmodium falciparum Epigenome That Are Dynamically Marked by H3K9ac and H3K4me3. PLoS Pathog 6(12): e1001223. https://doi.org/10.1371/journal.ppat.1001223

About

Code for plastNN

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published