The repository implements the Protein Structure Transformer (PST). The PST model endows the pretrained protein sequence model ESM-2 with structural knowledge, allowing for extracting representations of protein structures. Full details of PST can be found in the paper.
Please use the following to cite our work:
@misc{chen2024endowing,
title={Endowing Protein Language Models with Structural Knowledge},
author={Dexiong Chen and Philip Hartout and Paolo Pellizzoni and Carlos Oliver and Karsten Borgwardt},
year={2024},
eprint={2401.14819},
archivePrefix={arXiv},
primaryClass={q-bio.QM}
}
PST uses a structure extractor to incorporate protein structures into existing pretrained protein language models (PLMs) such as ESM-2.
The structure extractor adopts a GNN to extract subgraph representations of the 8Å-neighborhood protein structure graph at each residue (i.e., nodes on the graph). The resulting residue-level subgraph representations are then add to the
Below you can find an overview of PST with ESM-2 as the sequence backbone. The ESM-2 model weights were frozen during the training of the structure extractor. The structure extractor was trained on AlphaFold SwissProt, a dataset of 542K proteins with predicted structures. The resulting PST model can then be finetuned on a downstream task, e.g., torchdrug or proteinshake tasks. PST can also be used to simply extract representations of protein structures.
Model name | Sequence model | #Layers | Embed dim | Notes | Model URL |
---|---|---|---|---|---|
pst_t6 |
esm2_t6_8M_UR50D |
6 | 320 | Standard | link |
pst_t6_so |
esm2_t6_8M_UR50D |
6 | 320 | Train struct only | link |
pst_t12 |
esm2_t12_35M_UR50D |
12 | 480 | Standard | link |
pst_t12_so |
esm2_t12_35M_UR50D |
12 | 480 | Train struct only | link |
pst_t30 |
esm2_t30_150M_UR50D |
30 | 640 | Standard | link |
pst_t30_so |
esm2_t30_150M_UR50D |
30 | 640 | Train struct only | link |
pst_t33 |
esm2_t33_650M_UR50D |
33 | 1280 | Standard | link |
pst_t33_so |
esm2_t33_650M_UR50D |
33 | 1280 | Train struct only | link |
The dependencies are managed by mamba or conda
mamba env create -f environment.yaml
mamba activate pst
pip install -e .
Optionally, you can install the following dependencies to run the experiments:
pip install torchdrug
You can PST to simply extract representations of protein structures stored in PDB files. Just run
python scripts/pst_extract.py --help
If you want to work with your own dataset, just create a my_dataset
directory in scripts
and put all the PDB files into my_dataset/raw/
,
and run:
python scripts/pst_extract.py --datadir ./scripts/my_dataset --model pst_t33_so --include_seq
You can use PST to perform Gene Ontology prediction, Enzyme Commission Number prediction and any other protein function prediction tasks.
To train an MLP on top of the representations extracted by the pretrained PST models for Enzyme Commission prediction, run:
python experiments/fixed/predict_gearnet.py dataset=gearnet_ec # dataset=gearnet_go_bp, gearnet_go_cc or gearnet_go_mf for GO prediction
To finetune the PST model for function prediction tasks, run:
python experiments/finetune/finetune_gearnet.py dataset=gearnet_ec # dataset=gearnet_go_bp, gearnet_go_cc or gearnet_go_mf for GO prediction
Run the following code to train a PST model based on the 6-layer ESM-2 model by only training the structure extractor:
python train_pst.py base_model=esm2_t6 model.train_struct_only=true
You can replace esm2_t6
with esm2_t12
, esm2_t30
, esm2_t33
or any pretrained ESM-2 model.
We have folded structures that were not available in the PDB for our VEP datasets. You can download the dataset from here, and unzip it in ./datasets
, provided your current path is the root of this repository. Similarly, download the SCOP dataset here.