This repository demonstrates how to use the IMPROVE library v0.1.0-alpha for building a drug response prediction (DRP) model using LightGBM (LGBM), and provides examples with the benchmark cross-study analysis (CSA) dataset.
This version, tagged as v0.1.0-2024-09-27
, introduces a new API which is designed to encourage broader adoption of IMPROVE and its curated models by the research community.
Installation instuctions are detialed below in Step-by-step instructions.
ML framework:
- Tensorflow -- deep learning framework for building the prediction model
IMPROVE dependencies:
Benchmark data for cross-study analysis (CSA) can be downloaded from this site.
The data tree is shown below:
csa_data/raw_data/
├── splits
│ ├── CCLE_all.txt
│ ├── CCLE_split_0_test.txt
│ ├── CCLE_split_0_train.txt
│ ├── CCLE_split_0_val.txt
│ ├── CCLE_split_1_test.txt
│ ├── CCLE_split_1_train.txt
│ ├── CCLE_split_1_val.txt
│ ├── ...
│ ├── GDSCv2_split_9_test.txt
│ ├── GDSCv2_split_9_train.txt
│ └── GDSCv2_split_9_val.txt
├── x_data
│ ├── cancer_copy_number.tsv
│ ├── cancer_discretized_copy_number.tsv
│ ├── cancer_DNA_methylation.tsv
│ ├── cancer_gene_expression.tsv
│ ├── cancer_miRNA_expression.tsv
│ ├── cancer_mutation_count.tsv
│ ├── cancer_mutation_long_format.tsv
│ ├── cancer_mutation.parquet
│ ├── cancer_RPPA.tsv
│ ├── drug_ecfp4_nbits512.tsv
│ ├── drug_info.tsv
│ ├── drug_mordred_descriptor.tsv
│ └── drug_SMILES.tsv
└── y_data
└── response.tsv
Note that original_work
folder contains data files and scripts used to train and evaluate the DeepCDR for the original paper.
deepcdr_preprocess_improve.py
- takes benchmark data files and transforms into files for trianing and inferencedeepcdr_train_improve.py
- trains a deepcdr DRP modeldeepcdr_infer_improve.py
- runs inference with the trained deepcdr modelmodel_params_def.py
- definitions of parameters that are specific to the modeldeepcdr_params.txt
- default parameter file (parameter values specified in this file override the defaults)
git clone https://github.com/JDACS4C-IMPROVE/DeepCDR.git
cd DeepCDR
git checkout develop
Option 1: Create the conda env using the yml file.
conda env create -f parsl_env.yml
Option 2: Use the following commands to create the environment.
conda create --name DeepCDR_IMPROVE_env python=3.10
conda activate DeepCDR_IMPROVE_env
conda install tensorflow-gpu=2.10.0
pip install rdkit==2023.9.6
pip install deepchem==2.8.0
pip install PyYAML
source setup_improve.sh
This will:
- Download cross-study analysis (CSA) benchmark data into
./csa_data/
. - Clone IMPROVE repo (checkout
develop
) outside the LGBM model repo. - Set up env variables:
IMPROVE_DATA_DIR
(to./csa_data/
) andPYTHONPATH
(adds IMPROVE repo).
python deepcdr_preprocess_improve.py --input_dir ./csa_data/raw_data --output_dir exp_result
Preprocesses the CSA data and creates train, validation (val), and test datasets.
Generates:
- five model input data files:
cancer_dna_methy_model
,cancer_gen_expr_model
,cancer_gen_mut_model
,drug_features.pickle
,norm_adj_mat.pickle
- three tabular data files, each containing the drug response values (i.e. AUC) and corresponding metadata:
train_y_data.csv
,val_y_data.csv
,test_y_data.csv
exp_result
├── param_log_file.txt
├── cancer_dna_methy_model
├── cancer_gen_expr_model
├── cancer_gen_mut_model
├── test_y_data.csv
├── train_y_data.csv
├── val_y_data.csv
├── drug_features.pickle
└── norm_adj_mat.pickle
python deepcdr_train_improve.py --input_dir exp_result --output_dir exp_result
Trains DeepCDR using the model input data generated in the previous step.
Generates:
- trained model:
DeepCDR_model
- predictions on val data (tabular data):
val_y_data_predicted.csv
- prediction performance scores on val data:
val_scores.json
exp_result
├── param_log_file.txt
├── cancer_dna_methy_model
├── cancer_gen_expr_model
├── cancer_gen_mut_model
├── test_y_data.csv
├── train_y_data.csv
├── val_y_data.csv
├── drug_features.pickle
├── norm_adj_mat.pickle
├── DeepCDR_model
├── val_scores.json
└── val_y_data_predicted.csv
python deepcdr_infer_improve.py --input_data_dir exp_result --input_model_dir exp_result --output_dir exp_result --calc_infer_score true
Evaluates the performance on a test dataset with the trained model.
Generates:
- predictions on test data (tabular data):
test_y_data_predicted.csv
- prediction performance scores on test data:
test_scores.json
exp_result
├── param_log_file.txt
├── cancer_dna_methy_model
├── cancer_gen_expr_model
├── cancer_gen_mut_model
├── test_y_data.csv
├── train_y_data.csv
├── val_y_data.csv
├── drug_features.pickle
├── norm_adj_mat.pickle
├── DeepCDR_model
├── val_scores.json
├── val_y_data_predicted.csv
├── test_scores.json
└── test_y_data_predicted.csv