MSA-Regularized Protein Sequence Transformer toward Predicting Genome-Wide Chemical-Protein Interactions: Application to GPCRome Deorphanization
This is the repository to replicate experiments for the fine-tuning of classifier with pretrained ALBERT in the paper DISAE.
- python 3.7
- Pytorch
- rdkit
- Transformers (Huggingface. version 2.3.0)
All data could be download here and put it under this repository, i.e. in the same directory as the
There will be four subdirectories in the data folder.
- activity: gives you the train/dev/test set split based on protein similarity at threshold of bitscore 0.035
- albertdata: gives you pretrained ALBERT model. The ALBERT is pretraind on distilled triplets of whole Pfam
- Integrated: gives collected chemicals from several database
- protein: gives you mapping from uniprot ID to triplets form
To run ALBERT model (default: ALBERRT frozen transformer):
python --protein_embedding_type="albert"
To try other freezing options, change "frozen_list" to choose modules to be frozen.
To run LSTM model:
python --protein_embedding_type="lstm"