Intro

The aims of this project is to predict localization of the protein based on sequence using RNN. This problem has a long history, there are a lot of avaliable tools and algorithms to predict subcellular localization of the protein based on the strucure. Machine learning and natural language processing is reasonable instruments for creating a predictive model. This review collected the recent acheavments in that field.

The typical accuracy is aroud 65% for example and this value can be used as the reference point.

Dataset

For training the model I used Uniprot database with more than 40.000 proteins with described localizations, the typical keywords are: membrane, cytoplasm, mitochondrion and nucleus which represent usual localization of the proteins (add picture)

Data preprocessing

The actual dataset was obtained from Uniprot database, uniprot_trembl_human file.

Localization

The information about localization was converted into 1x2 vector, where 2 is number of target localizations (membrane and nucleus), each position represent occurance of the corresponding keyword. For example: Text

CC   -!- SUBCELLULAR LOCATION: Nucleus. Cytoplasm. Note=Shuttles between the
CC       nucleus and the cytoplasm. Upon muscle cells differentiation, it
CC       accumulates in the nuclei of myotubes, suggesting a positive role of
CC       nuclear HDAC4 in muscle differentiation. The export to cytoplasm
CC       depends on the interaction with a 14-3-3 chaperone protein and is due
CC       to its phosphorylation at Ser-246, Ser-467 and Ser-632 by CaMK4 and
CC       SIK1. The nuclear localization probably depends on sumoylation.
CC       Interaction with SIK3 leads to HDAC4 retention in the cytoplasm (By
CC       similarity). {ECO:0000250|UniProtKB:Q6NZM9}.

Encoded location vector [0, 1]

Sequence

The protein sequence is encoded as one-hot vector using 24 residue symbols (including non-standard, line B is Aspartic acid or Asparagine) plus '#' as the placeholder when sequence is shorter than threshold.

Model

Model is LSTM (input size is 512, hidden size is 128) and 2 fully connected layers

This model was trained with 70 epochs, batch size=256.

Results

The balanced accuracy for this model is 73% for membrane and 76% for nuclear localization.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
docs		docs
examples		examples
model		model
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intro

Dataset

Data preprocessing

Localization

Sequence

Model

Results

About

Releases

Packages

Languages

knawel/Sequence_to_Cell_Localization

Folders and files

Latest commit

History

Repository files navigation

Intro

Dataset

Data preprocessing

Localization

Sequence

Model

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages