DeepCDR

This repository demonstrates how to use the IMPROVE library v0.1.0-alpha for building a drug response prediction (DRP) model using LightGBM (LGBM), and provides examples with the benchmark cross-study analysis (CSA) dataset.

This version, tagged as v0.1.0-2024-09-27, introduces a new API which is designed to encourage broader adoption of IMPROVE and its curated models by the research community.

Dependencies

Installation instuctions are detialed below in Step-by-step instructions.

ML framework:

Tensorflow -- deep learning framework for building the prediction model

IMPROVE dependencies:

IMPROVE tag v0.1.0-2024-09-27

Dataset

Benchmark data for cross-study analysis (CSA) can be downloaded from this site.

The data tree is shown below:

csa_data/raw_data/
├── splits
│   ├── CCLE_all.txt
│   ├── CCLE_split_0_test.txt
│   ├── CCLE_split_0_train.txt
│   ├── CCLE_split_0_val.txt
│   ├── CCLE_split_1_test.txt
│   ├── CCLE_split_1_train.txt
│   ├── CCLE_split_1_val.txt
│   ├── ...
│   ├── GDSCv2_split_9_test.txt
│   ├── GDSCv2_split_9_train.txt
│   └── GDSCv2_split_9_val.txt
├── x_data
│   ├── cancer_copy_number.tsv
│   ├── cancer_discretized_copy_number.tsv
│   ├── cancer_DNA_methylation.tsv
│   ├── cancer_gene_expression.tsv
│   ├── cancer_miRNA_expression.tsv
│   ├── cancer_mutation_count.tsv
│   ├── cancer_mutation_long_format.tsv
│   ├── cancer_mutation.parquet
│   ├── cancer_RPPA.tsv
│   ├── drug_ecfp4_nbits512.tsv
│   ├── drug_info.tsv
│   ├── drug_mordred_descriptor.tsv
│   └── drug_SMILES.tsv
└── y_data
    └── response.tsv

Note that original_work folder contains data files and scripts used to train and evaluate the DeepCDR for the original paper.

Model scripts and parameter file

deepcdr_preprocess_improve.py - takes benchmark data files and transforms into files for trianing and inference
deepcdr_train_improve.py - trains a deepcdr DRP model
deepcdr_infer_improve.py - runs inference with the trained deepcdr model
model_params_def.py - definitions of parameters that are specific to the model
deepcdr_params.txt - default parameter file (parameter values specified in this file override the defaults)

Step-by-step instructions

1. Clone the model repository

git clone https://github.com/JDACS4C-IMPROVE/DeepCDR.git
cd DeepCDR
git checkout develop

2. Set computational environment

Option 1: Create the conda env using the yml file.

conda env create -f parsl_env.yml

Option 2: Use the following commands to create the environment.

conda create --name DeepCDR_IMPROVE_env python=3.10
conda activate DeepCDR_IMPROVE_env
conda install tensorflow-gpu=2.10.0
pip install rdkit==2023.9.6
pip install deepchem==2.8.0
pip install PyYAML

3. Run `setup_improve.sh`.

source setup_improve.sh

This will:

Download cross-study analysis (CSA) benchmark data into ./csa_data/.
Clone IMPROVE repo (checkout develop) outside the LGBM model repo.
Set up env variables: IMPROVE_DATA_DIR (to ./csa_data/) and PYTHONPATH (adds IMPROVE repo).

4. Preprocess CSA benchmark data (raw data) to construct model input data (ML data)

python deepcdr_preprocess_improve.py --input_dir ./csa_data/raw_data --output_dir exp_result

Preprocesses the CSA data and creates train, validation (val), and test datasets.

Generates:

five model input data files: cancer_dna_methy_model, cancer_gen_expr_model, cancer_gen_mut_model, drug_features.pickle, norm_adj_mat.pickle
three tabular data files, each containing the drug response values (i.e. AUC) and corresponding metadata: train_y_data.csv, val_y_data.csv, test_y_data.csv

exp_result
 ├── param_log_file.txt
 ├── cancer_dna_methy_model
 ├── cancer_gen_expr_model
 ├── cancer_gen_mut_model
 ├── test_y_data.csv
 ├── train_y_data.csv
 ├── val_y_data.csv
 ├── drug_features.pickle
 └── norm_adj_mat.pickle

5. Train DeepCDR model

python deepcdr_train_improve.py --input_dir exp_result --output_dir exp_result

Trains DeepCDR using the model input data generated in the previous step.

Generates:

trained model: DeepCDR_model
predictions on val data (tabular data): val_y_data_predicted.csv
prediction performance scores on val data: val_scores.json

exp_result
 ├── param_log_file.txt
 ├── cancer_dna_methy_model
 ├── cancer_gen_expr_model
 ├── cancer_gen_mut_model
 ├── test_y_data.csv
 ├── train_y_data.csv
 ├── val_y_data.csv
 ├── drug_features.pickle
 ├── norm_adj_mat.pickle
 ├── DeepCDR_model
 ├── val_scores.json
 └── val_y_data_predicted.csv

6. Run inference on test data with the trained model

python deepcdr_infer_improve.py --input_data_dir exp_result --input_model_dir exp_result --output_dir exp_result --calc_infer_score true

Evaluates the performance on a test dataset with the trained model.

Generates:

predictions on test data (tabular data): test_y_data_predicted.csv
prediction performance scores on test data: test_scores.json

exp_result
 ├── param_log_file.txt
 ├── cancer_dna_methy_model
 ├── cancer_gen_expr_model
 ├── cancer_gen_mut_model
 ├── test_y_data.csv
 ├── train_y_data.csv
 ├── val_y_data.csv
 ├── drug_features.pickle
 ├── norm_adj_mat.pickle
 ├── DeepCDR_model
 ├── val_scores.json
 ├── val_y_data_predicted.csv
 ├── test_scores.json
 └── test_y_data_predicted.csv

Name		Name	Last commit message	Last commit date
Latest commit History 175 Commits
Reproducing_model_all		Reproducing_model_all
_misc		_misc
data/GDSC		data/GDSC
original_work		original_work
updated_codes_with_CSA		updated_codes_with_CSA
LICENSE		LICENSE
README.md		README.md
create_data_generator.py		create_data_generator.py
csa_bruteforce_params.ini		csa_bruteforce_params.ini
csa_bruteforce_params_def.py		csa_bruteforce_params_def.py
csa_bruteforce_wf.py		csa_bruteforce_wf.py
csa_params.test.ini		csa_params.test.ini
csa_postproc.py		csa_postproc.py
deepcdr_env.yml		deepcdr_env.yml
deepcdr_infer_improve.py		deepcdr_infer_improve.py
deepcdr_params.txt		deepcdr_params.txt
deepcdr_preprocess_improve.py		deepcdr_preprocess_improve.py
deepcdr_train_improve.py		deepcdr_train_improve.py
download_csa.sh		download_csa.sh
end_to_end_csa.sh		end_to_end_csa.sh
hyperparameters_default.json		hyperparameters_default.json
infer.sh		infer.sh
model_params_def.py		model_params_def.py
parsl_env.yml		parsl_env.yml
parsl_env_clean.yml		parsl_env_clean.yml
preprocess.sh		preprocess.sh
setup_improve.sh		setup_improve.sh
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepCDR

Dependencies

Dataset

Model scripts and parameter file

Step-by-step instructions

1. Clone the model repository

2. Set computational environment

3. Run `setup_improve.sh`.

4. Preprocess CSA benchmark data (raw data) to construct model input data (ML data)

5. Train DeepCDR model

6. Run inference on test data with the trained model

About

Releases

Packages

Languages

License

JDACS4C-IMPROVE/DeepCDR

Folders and files

Latest commit

History

Repository files navigation

DeepCDR

Dependencies

Dataset

Model scripts and parameter file

Step-by-step instructions

1. Clone the model repository

2. Set computational environment

3. Run setup_improve.sh.

4. Preprocess CSA benchmark data (raw data) to construct model input data (ML data)

5. Train DeepCDR model

6. Run inference on test data with the trained model

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

3. Run `setup_improve.sh`.

Packages