MSA-Regularized Protein Sequence Transformer toward Predicting Genome-Wide Chemical-Protein Interactions: Application to GPCRome Deorphanization
This is the repository to replicate experiments for the fine-tuning of classifier with pretrained ALBERT in the paper DISAE.
- python 3.7
- Pytorch
- rdkit
- Transformers (Huggingface. version 2.3.0)
All data could be download here and put it under this repository, i.e. in the same directory as the finetuning_train.py.
There will be four subdirectories in the data folder.
- activity: gives you the train/dev/test set split based on protein similarity at threshold of bitscore 0.035
- albertdata: gives you pretrained ALBERT model. The ALBERT is pretraind on distilled triplets of whole Pfam
- Integrated: gives collected chemicals from several database
- protein: gives you mapping from uniprot ID to triplets form
To run ALBERT model (default: ALBERRT frozen transformer):
python finetuning_train.py --protein_embedding_type="albert"
To try other freezing options, change "frozen_list" to choose modules to be frozen.
To run LSTM model:
python finetuning_train.py --protein_embedding_type="lstm"