This predictor takes protein sequence fasta files, and predicts the amino acid structure in 3 state format.
It is trained using cas2.3line dataset. And it also predicts sequence structure in 3 state format.
data/testset.dat is an example of sequences it could predict.
To use this predictor, please set work directory as /StrucPred and run the predictor.py file.
Features:
-
Use evolutionary information (psiblast and PSSM)
-
Use the neighbor amino acids information in prediction (Builing amino acid windows)
-
A variaty of SVM methods to choose, including linear SVM and rbfsvm.
-
Cross Validation methods are used to split the dataset.
-
Other machine learning method can be used to compare the prediction result, including random forest and simple decision tree.
-
Prediction and evaluation of predaction are stored in the result folder. (You can find some evaluation of predictions I have tried.
This predictor is written in Python3
Download:
Packages required:
To use this predictor, you need to install pickle, pandas 0.22.0, scikiy-learn 0.19.1 package。
There are several models to choose from in the model folder.You can change the model in lin293, predictor.py
The evalations are stored in /result folder, which include Q3 and coeffeience co.
You can also get the cross validation store and prediction accuracy by removing the triple-quotes. This may take a long time.
-
model_*.py files are files I used to create different models.
-
additional_dataset_parser.py is used to parser additional 50 protein sequences.
In pssm folder:
Folder 'Sequences': sequences to be psiblasted
Folder 'psiblast_pssm': raw pssm result given by psiblast
Folder 'pssmMatrix': pssm in csv format
formatdb.sh : formating psiblast database-
psiblast.sh: carry out psiblast
extractPSSM.py : raw pssm result to pssm.csv
parser_PSSMtoSVM_MultipleFiles.py : parse pssm and use it later in svm