The repository contains the code for the winning project of the 2023 BioHack NYC hackathon. Our goal was to use graph neural networks to predict to predict the identity of amino acids within ligand binding pockets.
##Inspiration
One of the major goals in the field of protein design is creating a protein that can bind any arbitraty small molecule. We were particularly struck by the work of McCann et al.. They were able to design a protein that selectily binds VX nerve agent. The protein was designed using an algorithm called protEvolver, a physics-based genetic algorithim that optimizes protein-ligand interfaces. Methods like these are slow and computationally expensive. The success of proteinMPNN inspired us to try to create a smiliar machine learning model that could incorporate small molecule information in the sequence prediction task. Such a model copuld be used to greatly speed up the design process.
n.b. Since our work on this project, the developers of proteinMPNN have released LigandMPNN
To run this repository on your own the first thing you need to do is download the PDBind refined set and extract the download into this repository.
This dataset contains high resolution protein ligand complexes with well characterized binding information. You can read more about the dataset on the website.
Use the yml file to install all required dependencies
conda env create -f environment.yml
Our model achieved 78% sequence revocery across the entire validation set.
There appears to be a slight bias towards over represeneted amino acids and away from under represented amino acids.
As expected, the models performance was better on proteins with a large number of homologs. Interestingly, the average sequence recovery on proteins with no homologs was 30% with some individual proteins achieving 100% sequence recovery. This demonstrates the models ability to generalize to novel protein ligand complexes but is still too limited to be considered a valuable tool for design.
The notebooks need to be run in a specific order.
- Clean Pockets and Examine Ligand Data
- Generate Amino Acid Embeddings
- Compile Graphs
- Writing Fasta Optional
- Sequence Similarity Optional
- Model Training
- Model Evaluation