Skip to content

MulinforCPI: enhancing precision of compound-protein interaction prediction through novel perspectives on multi-level information integration

Notifications You must be signed in to change notification settings

dmis-lab/MulinforCPI

Repository files navigation

MulinforCPI

A PyTorch Implementation of Paper:

MulinforCPI: enhancing precision of compound-protein interaction prediction through novel perspectives on multi-level information integration

Our repository uses 3DInformax from https://github.com/HannesStark/3DInfomax as a backbone for pretraining PNA for compound information extraction and ESM_Fold from https://github.com/facebookresearch/esm for predicting protein fold.

Forecasting the interaction between compounds and proteins is crucial for discovering new drugs. However, previous sequence-based studies have not utilized three-dimensional (3D) information on compounds and proteins, such as atom coordinates and distance matrices, to predict binding affinity. Furthermore, numerous widely adopted computational techniques have relied on sequences of amino acid characters for protein representations. This approach may constrain the model's ability to capture meaningful biochemical features, impeding a more comprehensive understanding of the underlying proteins. Here, we propose a two-step deep learning strategy named MulinforCPI that incorporates transfer learning techniques with multi-level resolution features to overcome these limitations. Our approach leverages 3D information from both proteins and compounds and acquires a profound understanding of the atomic-level features of proteins. Besides, our research highlights the divide between first-principle and data-driven methods, offering new research prospects for compound protein interaction tasks. We applied the proposed method to six datasets: Davis, Metz, KIBA, CASF-2016, DUD-E, and BindingDB, to evaluate the effectiveness of our approach. Image

Clustering the dataset:

In our experiment we use cross, we used the cross-cluster validation technique. Leave one out for testing while the validation set is randomly taken from the training set with a ratio 20/80.
data_file: The file contains dataset (Davis,KIBA,metz)
output_folder: The folder contains five clusters

python prepare_cluster_data_2023 #data_file #output_folder

Image

To train MulinforCPI model:

Set up the environment: In our experiment we use, Python 3.7 with PyTorch 1.12.1 + CUDA 11.7.

git clone https://github.com/dmis-lab/MulinforCPI.git
conda env create -f environment.yml

Your data should be in the format .csv, and the column names are: 'smiles', 'sequence', 'label'.

  1. Generate the 3D fold of protein from the dataset.
    data_folder: Folder of dataset
python generate_protein_fold.py #data_folder

Image

The output of ESM fold. The 3D fold contains 1) The Card, 2) Atom Number, 3) Atom Type, 4) Three-Letter Amino Acid Code, 5) Chain ID, 6) Residue Number, 7) Atom Coordinates, 8) Atom Occupancy, 9) Atomic Displacement Parameter, 10) Element.
  1. Calculate the Alpha Carbon distances.
    input_folder: Output folder from ESM prediction.(Output of step 1.)
    output_folder: Folder of processed file
    data_name: Name of dataset
python generate_distance_map.py #input_folder #output_folder #data_name
  1. Align the training dataset following the output of ESM fold. (FOR data-making purposes)
    data_folder: Folder of Training dataset
    output_folder: Folder of processed file
    prot_dict: _data.csvdic.csv file in output folder of ESM prediction.
    data_name: "davis or bindingdb or kiba or metz..."
python match_protein_index_esm.py #data_folder #output_folder #prot_dict #data_name
  1. Generate the pickle .pt file for training
    data name: Folder of Training dataset
    data_path: Aligned dataset (Output of step 3.)
    output_folder: Folder of Training dataset
    distance metric pt file: Alpha Carbon distances. ( Output of step 2.)
    esm prediction folder: Output folder from ESM prediction. (Output of step 1.)
python create_train_cuscpidata_ecfp.py #data_name #data_path #output_folder #distance_metric_pdb_file #esm_prediction_folder 
  1. Train the model
    Change the data_path: the processed data folder in .pt format ( Output of step 4.) in best_configs/tune_cus_cpi.yml
    To save time and result reimplementation please download the pre-trained file: Download runs files: https://github.com/HannesStark/3DInfomax/tree/master/runs/PNA_qmugs_NTXentMultiplePositives_620000_123_25-08_09-19-52
python train_cuscpi.py --config best_configs/tune_cus_cpi.yml

Dataset:

The related links are as follows:

KIBA, Davis:https://github.com/kexinhuang12345/DeepPurpose
Metz: https://github.com/sirimullalab/KinasepKipred
BindingDB: https://github.com/IBM/InterpretableDTIP
DUD-E Diverse: http://dude.docking.org/subsets/diverse
QMugs: https://libdrive.ethz.ch/index.php/s/X5vOBNSITAG5vzM
CASF-2016: http://www.pdbbind.org.cn/casf.php

To take inference:

Change the test_data_path and checkpoint in best_configs/inference_cpi.yml to take the inference (with test_data_path in made following step 1-2-3)

python inferencecpi.py --config=best_configs/inference_cpi.yml

Citation:

If you find the models useful in your research, please consider citing the paper:

@article{nguyen2024mulinforcpi,
  title={MulinforCPI: enhancing precision of compound--protein interaction prediction through novel perspectives on multi-level information integration},
  author={Nguyen, Ngoc-Quang and Park, Sejeong and Gim, Mogan and Kang, Jaewoo},
  journal={Briefings in Bioinformatics},
  volume={25},
  number={1},
  pages={bbad484},
  year={2024},
  publisher={Oxford University Press}
}

License

MIT

About

MulinforCPI: enhancing precision of compound-protein interaction prediction through novel perspectives on multi-level information integration

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published