Official implementation for our paper:
MoleRec: Combinatorial Drug Recommendation with Substructure-Aware Molecular Representation Learning
Nianzu Yang, Kaipeng Zeng, Qitian Wu, Junchi Yan* (* denotes correspondence)
Proceedings of the ACM Web Conference 2023 (TheWebConf (a.k.a. WWW) 2023)
MoleRec has been incorporated into the PyHealth package as a benchmark method for the combinatorial drug recommendation task! 👏
-
data/
folder contains necessary data or scripts for generating data.-
drug-atc.csv
,ndc2atc_level4.csv
,ndc2rxnorm_mapping.txt
: mapping files for drug code transformation -
atc2rxnorm.pkl
: It maps ATC-4 code to rxnorm code and then query to drugbank. -
idx2SMILES.pkl
: Drug ID (we use ATC-4 level code to represent drug ID) to drug SMILES string dictionary. -
drug-DDI.csv
: A file containing the drug DDI information which is coded by CID. This file is large and you can download it from https://drive.google.com/file/d/1s3sHmz9ueVA8YAGTARY8jwrhRdRvVaXs/view?usp=sharing. -
ddi_mask_H.pkl
: A mask matrix containing the relations between molecule and substructures. If drug molecule$i$ contains substructure$j$ , the$j$ -th column of$i$ -the row of the matrix is set to 1. -
substructure_smiles.pkl
: A list containing the smiles of all the substructures. -
ddi_mask_H.py
: The python script responsible for generatingddi_mask_H.pkl
andsubstructure_smiles.pkl
. -
processing.py
: The python script responsible for generatingvoc_final.pkl
,records_final.pkl
,data_final.pkl
andddi_A_final.pkl
.
-
-
src/
folder contains all the source code.-
modules/
: Code for model definition. -
utils.py
: Code for metric calculations and some data preparation. -
training.py
: Code for the functions used in training and evaluation. -
main.py
: Train or evaluate our MoleRec Model.
-
Remark: data/
only contains part of the data. See the Data Generation section for more details.
The MoleRec.yml
lists all the dependencies of the MoleRec. To quickly set up a environment for our model, use the following command
conda env create -f MoleRec.yml
The usage of MIMIC-III datasets requires certification, so it's illegal for us to provide the raw data here. Therefore, if you want to have access to MIMIC-III datasets, you have to obtain the certification first and then download it from https://physionet.org/content/mimiciii/.
After downloading the MIMIC-III dataset, put the three csv file PRESCRIPTIONS.csv
, DIAGNOSES_ICD.csv
and PROCEDURES_ICD.csv
from the raw data into the data/
folder and generate the necessary files for training and evaluating apart from the files that we already have provided in thte data/
folder, using the command as below:
cd data
python processing.py
For the explanation of each output file, please refer to the SafeDrug repository. Note that in our paper, we follow the same data processing procedure as the SafeDrug after the commit c7218d0.
If you want to re-generate ddi_matrix_H.pkl
and substructure_smiles.pkl
, use the following command:
cd data
python ddi_mask_H.py
Note that the BRICS decomposition method generates substructures in a random order. Since that ddi_matrix_H.pkl
and substructure_smiles.pkl
are effected by this order, if you re-generate these two files, please re-train the model. For convenience, we've already provided the generated result by us in data/
folder, which can be used for training and evaluating directly.
We provide two versions of our model. They learn the substructure representations using embedding table and GNNs, respectively. If you want to train or evaluate our model, please change your working directory first via:
cd src
To train the model, use the following command:
python main.py --device ${device} --embedding --lr ${learning rate} --dp ${dropout rate} --dim ${dim} --target_ddi ${expected ddi} --coef ${coefficient of annealing weight} --epochs ${epochs}
To evaluate a well-trained model, use the following command:
python main.py --Test --embedding --resume_path ${model_path}
We've provide our well-trained model in folder best_models/
, to evaluate it, use the command
python main.py --Test --embedding --resume_path ../best_models/embedding_table/MoleRec.model
This version learns the substructure representation using GNNs, which is more powerful but has more parameters. You can use the following command to train the model:
python main.py --device ${device} --lr ${learning rate} --dp ${dropout rate} --dim ${dim} --target_ddi ${expected ddi} --coef ${coefficient of annealing weight} --epochs ${epochs}
To evaluate a well-trained model, use the following command:
python main.py --Test --resume_path ${model_path}
We also provide a well-trained model weight for this version, which can be evaluated by:
python main.py --Test --resume_path ../best_models/GNN/MoleRec.model
If you find our work useful in your research, please consider citing:
@inproceedings{yang2023molerec,
title={MoleRec: Combinatorial Drug Recommendation with Substructure-Aware Molecular Representation Learning},
author={Yang, Nianzu and Zeng, Kaipeng and Wu, Qitian and Yan, Junchi},
booktitle={Proceedings of the ACM Web Conference 2023},
pages={4075--4085},
year={2023}
}
Welcome to contact us yangnianzu@sjtu.edu.cn or zengkaipeng@sjtu.edu.cn for any question.
We sincerely thank these repositories GAMENet and SafeDrug for their well-implemented pipeline upon which we build our codebase.