Skip to content

Latest commit

 

History

History
224 lines (173 loc) · 9.42 KB

README.md

File metadata and controls

224 lines (173 loc) · 9.42 KB

DeepVEP

DeepVEP: mutation impact prediction on post-translational modifications using deep learning

Table of contents:

Installation

Download DeepVEP

$ git clone https://github.com/bzhanglab/DeepVEP

Installation

DeepVEP is a python3 package. TensorFlow (>=2.6) is supported. Its dependencies can be installed via

$ pip install -r requirements.txt

DeepVEP has been tested on both Linux and Windows systems. It supports training and prediction on both CPU and GPU.

Download model files

The pretrained model files are available at DeepVEP model repository. After the model files are downloaded, decompress the *.tar.gz file and move all the files to the models folder as shown below:

├── README.md
├── deepvep.py
├── lib
│   ├── DataIO.py
│   ├── Metrics.py
│   ├── ModelView.py
│   ├── MutationUtils.py
│   ├── PTModels.py
│   ├── PeptideEncode.py
│   ├── RegCallback.py
│   ├── Utils.py
│   └── __init__.py
├── models
└── requirements.txt

Usage

Run the following command line to show all command line options:

python deepmp.py predict -h
  -h or --help            show this help message and exit
  -i or --input           Input data for prediction
  -d or --db              Protein database
  -o or --out_dir         Output directory
  -w or --window_size     Window size for mutation impact prediction. In default, 7 amino acids in both side of a mutation.
  -m or --model           Trained model path
  -t or --task            Prediction type: 1=Mutation impact prediction, 2=PTM site prediction
  -e or --ensemble        Ensemble method, 1: average, 2: meta_lr, default is 1.
  -s or --explain_model   Perform model interpretability analysis
  -b or --bg_data         Data used as background data in model interpretability

Mutation impact prediction:

Below is an example for mutation impact prediction. The input for -i is a TSV format file which contains mutation information. The input for -d is a protein database file in FASTA format which contains the protein sequences for the wild type of proteins.

python deepvep.py predict -m models/ -i example/mutation_input.tsv -d example/Q5S007.fasta -t 1 -o mutation_output_folder

The required columns for input of -i include Protein, AA_Ref, AA_Pos and AA_Var. An example ("example/mutation_input.tsv") is shown below:

Protein  AA_Ref  AA_Pos  AA_Var
Q5S007   R       1441    C
Q5S007   R       1441    H

Below please find the description of each column in the output file "example/mutation_input.tsv":

Column name Description
Protein the protein name in the input fasta file
AA_Ref the wild type amino acid
AA_Pos the mutation position on the protein (1-based: the position of the first amino acid on the protein is 1)
AA_Var the mutation amino acid

For the above example input (-i), the wild type protein sequence of protein Q5S007 should be present in the input protein database for -d. The first 10 lines of the file "example/Q5S007.fasta" are shown below:

>Q5S007
MASGSCQGCEEDEETLKKLIVRLNNVQEGKQIETLVQILEDLLVFTYSERASKLFQGKNI
HVPLLIVLDSYMRVASVQQVGWSLLCKLIEVCPGTMQSLMGPQDVGNDWEVLGVHQLILK
MLTVHNASVNLSVIGLKTLDLLLTSGKITLLILDEESDIFMLIFDAMHSFPANDEVQKLG
CKALHVLFERVSEEQLTEFVENKDYMILLSALTNFKDEEEIVLHVLHCLHSLAIPCNNVE
VLMSGNVRCYNIVVEAMKAFPMSERIQEVSCCLLHRLTLGNFFNILVLNEVHEFVVKAVQ
QYPENAALQISALSCLALLTETIFLNQDLEEKNENQENDDEGEEDKLFWLEACYKALTWH
RKNKHVQEAACWALNNLLMYQNSLHEKIGDEDGHFPAHREVMLSMLMHSSSKEVFQASAN
ALSTLLEQNVNFRKILLSKGIHLNVLELMQKHIHSPEVAESGCKMLNHLFEGSNTSLDIM
AAVVPKILTVMKRHETSLPVQLEALRAILHFIVPGMPEESREDTEFHHKLNMVKKQCFKN

The output folder ("mutation_output_folder") of the example command line looks like below:

mutation_output_folder
├── deepvep-mutation_impact.tsv
├── acetylation_k
├── glycosylation_n
├── methylation_k
├── methylation_r
├── phosphorylation_st
├── phosphorylation_y
├── sumoylation_k
└── ubiquitination_k

The output file "mutation_output_folder/deepvep-mutation_impact.tsv" contains the predicted mutation impact on all the PTM sites supported by DeepVEP. This is the only file that users need to use for downstream analysis. The other folders contain intermediate prediction files.

Protein  AA_Ref  AA_Pos  AA_Var  pos   diff_pos  w_pep            m_pep            w_prob      m_prob      delta_prob   ptm
Q5S007   R       1441    C       1443  -2        FNIKARASSSPVILV  FNIKACASSSPVILV  0.7837325   0.2552052   -0.5285273   phosphorylation_st
Q5S007   R       1441    C       1444  -3        NIKARASSSPVILVG  NIKACASSSPVILVG  0.88949174  0.43282443  -0.45666731  phosphorylation_st
Q5S007   R       1441    C       1445  -4        IKARASSSPVILVGT  IKACASSSPVILVGT  0.8771168   0.49486965  -0.38224715  phosphorylation_st
Q5S007   R       1441    H       1443  -2        FNIKARASSSPVILV  FNIKAHASSSPVILV  0.7837325   0.26106784  -0.52266466  phosphorylation_st
Q5S007   R       1441    H       1444  -3        NIKARASSSPVILVG  NIKAHASSSPVILVG  0.88949174  0.44303632  -0.44645542  phosphorylation_st
Q5S007   R       1441    H       1445  -4        IKARASSSPVILVGT  IKAHASSSPVILVGT  0.8771168   0.48524436  -0.39187244  phosphorylation_st

Below please find the description of each column in the output file "output_folder/site_prediction.tsv":

Column name Description
Protein the protein name in the input fasta file
AA_Ref the wild type amino acid
AA_Pos the mutation position on the protein (1-based: the position of the first amino acid on the protein is 1)
AA_Var the mutation amino acid
pos the PTM site on the protein (1-based: the position of the first amino acid on the protein is 1)
diff_pos the distance between the PTM site and the mutation site
w_pep the wild type peptide sequence in which the center is PTM site
m_pep the mutant peptide sequence in which the center is PTM site
w_prob predicted PTM site probability for wild type sequence
m_prob predicted PTM site probability for mutant sequence
delta_prob mutation impact on the PTM site: m_prob - w_prob
ptm PTM name

The prediction took less than 3 minutes using CPU on a Linux server (64G RAM and 16 CPUs).

PTM site prediction:

Below is an example for PTM site prediction. The input (-d) is a protein database file in FASTA format which contains the protein sequences to predict. The following command line is used to predict all PTM sites supported by DeepVEP.

python deepvep.py predict -m models/ -d example/Q5S007.fasta -t 2 -o output_folder

The output folder ("output_folder") of the example command line looks like below:

output_folder/
├── site_prediction.tsv
├── acetylation_k
├── glycosylation_n
├── methylation_k
├── methylation_r
├── phosphorylation_st
├── phosphorylation_y
├── sumoylation_k
└── ubiquitination_k

The output file "output_folder/site_prediction.tsv" contains all the predicted PTM sites. This is the only file that users need to use for downstream analysis. The other folders contain intermediate prediction files.

protein  aa  pos   x                                y_pred          fpr                 ptm
Q5S007   K   17    ASGSCQGCEEDEETLKKLIVRLNNVQEGKQI  0.99253386      0.0027506112469437  acetylation_k
Q5S007   K   18    SGSCQGCEEDEETLKKLIVRLNNVQEGKQIE  0.54781103      0.1937652811735941  acetylation_k
Q5S007   K   30    TLKKLIVRLNNVQEGKQIETLVQILEDLLVF  0.0013190061    0.9529339853300732  acetylation_k
Q5S007   K   53    ILEDLLVFTYSERASKLFQGKNIHVPLLIVL  0.70806825      0.1253056234718826  acetylation_k
Q5S007   K   58    LVFTYSERASKLFQGKNIHVPLLIVLDSYMR  0.0148314405    0.7331907090464548  acetylation_k
Q5S007   K   87    MRVASVQQVGWSLLCKLIEVCPGTMQSLMGP  0.0005026354    0.9819682151589242  acetylation_k
Q5S007   K   120   VGNDWEVLGVHQLILKMLTVHNASVNLSVIG  0.00056051335   0.9801344743276283  acetylation_k
Q5S007   K   137   LTVHNASVNLSVIGLKTLDLLLTSGKITLLI  0.007324457     0.82059902200489    acetylation_k
Q5S007   K   147   SVIGLKTLDLLLTSGKITLLILDEESDIFML  0.0003533643    0.9865525672371638  acetylation_k

Below please find the description of each column in the output file "output_folder/site_prediction.tsv":

Column name Description
protein the protein name from the input fasta file
aa the amino acid PTM site to predict
pos the PTM site on the protein (1-based: the position of the first amino acid on the protein is 1)
x a peptide sequence of 31 amino acids in which the center is the predicted PTM site
y_pred predicted probability
fpr false positive rate using the predicted probability as the threshold to define positive PTM site
ptm PTM name

The prediction took less than 3 minutes using CPU on a Linux server (64G RAM and 16 CPUs).

How to cite:

There is no a manuscript to cite yet.