Readme for Hierarchical Network for Drug Response Prediction with Attention's supporting file
1.Training Training python file works by typing following command: python HiDRA_training.py –t training_list –v validation_list –i input_dir/ –e epoch –o output_model If you want to run the HiDRA with the toy example, python HiDRA_training.py –t Training.csv –v Validation.csv –i input_dir/ –e epoch –o output_model.hdf5
The training and validation list file is csv file with three columns: Drug name, Cell line name, IC50. The column name should be [‘Drug name’,’Cell line name’,’IC50’].
The input directory is the directory of that input files are stored. The input files consist of: 0.csv - 186.csv: The normalized expression value of member genes in pathway 0 - 186. Each file consists of Cell line name and member genes. First column is Cell line name and Normalized gene expression values are following. The first column name should be ‘Cell line name’ and exact gene symbol from the second column. drug.csv: The Morgan fingerprint for each drugs. First column is Drug name and 512-bits Morgan fingerprint in binary form is following. The first column name should be ‘Drug name’ and 0~511 from second column.
The epoch is the number iteration of training process (integer). Higher epoch, higher performance (Low loss), but there is a possibility of overfitting. In my study, the model was trained with 20 epochs.
The output model is the directory and file name that output model file will be stored. The file name extension should be hdf5. This trained model will be used for the Prediction process.
2.Prediction Prediction python file works by typing following command: Python HiDRA_prediction.py –m model_file –p prediction_list –i input_dir –o output_file If you want to run the HiDRA with the toy example, Python HiDRA_prediction.py –m model.hdf5 –p Prediction.csv –i input_dir –o prediction_result.csv
The model file is hdf5 file which stores trained HiDRA model. It can be generated by using Training process (HiDRA_training.py).
The prediction list file is csv file with two columns: Drug name, Cell line name. The column name should be [‘Drug name’,’Cell line name’].
The input directory is same with those of Training process.
The output file is the directory and file name that prediction result will be stored. The output file consists of three columns: Drug name, Cell line name, predicted_IC50.
I put the toy example that makes the reader can execute the training and prediction scheme. However, it only contains 10 cell lines and 10 drugs because of two reasons: the redistribution policy of dataset, the file size. About the redistribution policy of dataset, it is hard to upload the full GDSC dataset to the online without any permission of GDSC team. About the file size, the file size of supporting file is limited to 10MB. The original file size was too big, so I just picked up 10 cell lines and 10 drugs as the toy example. If you want to train the HiDRA with the original dataset that I used or prediction the list with the model that I generated to check the performance of HiDRA, please give an e-mail to skchin53@gist.ac.kr.