The aims of this project is to predict localization of the protein based on sequence using RNN. This problem has a long history, there are a lot of avaliable tools and algorithms to predict subcellular localization of the protein based on the strucure. Machine learning and natural language processing is reasonable instruments for creating a predictive model. This review collected the recent acheavments in that field.
The typical accuracy is aroud 65% for example and this value can be used as the reference point.
For training the model I used Uniprot database with more than 40.000 proteins with described localizations, the typical keywords are: membrane, cytoplasm, mitochondrion and nucleus which represent usual localization of the proteins (add picture)
The actual dataset was obtained from Uniprot database, uniprot_trembl_human
file.
The information about localization was converted into 1x2 vector, where 2 is number of target localizations (membrane and nucleus), each position represent occurance of the corresponding keyword. For example: Text
CC -!- SUBCELLULAR LOCATION: Nucleus. Cytoplasm. Note=Shuttles between the CC nucleus and the cytoplasm. Upon muscle cells differentiation, it CC accumulates in the nuclei of myotubes, suggesting a positive role of CC nuclear HDAC4 in muscle differentiation. The export to cytoplasm CC depends on the interaction with a 14-3-3 chaperone protein and is due CC to its phosphorylation at Ser-246, Ser-467 and Ser-632 by CaMK4 and CC SIK1. The nuclear localization probably depends on sumoylation. CC Interaction with SIK3 leads to HDAC4 retention in the cytoplasm (By CC similarity). {ECO:0000250|UniProtKB:Q6NZM9}.
Encoded location vector
[0, 1]
The protein sequence is encoded as one-hot vector using 24 residue symbols (including non-standard, line B is Aspartic acid or Asparagine) plus '#' as the placeholder when sequence is shorter than threshold.
Model is LSTM (input size is 512, hidden size is 128) and 2 fully connected layers
This model was trained with 70 epochs, batch size=256.
The balanced accuracy for this model is 73% for membrane and 76% for nuclear localization.