- Members: Jinmiao Huang, Cesar Osorio, and Luke Wicent Sy (all three provided equal contribution)
- Publication: arxiv (2018), elsevier (updated 2019)
- If you used this code in your work, please cite the following publication: Huang, J., Osorio, C., & Sy, L. W. (2019). An empirical evaluation of deep learning for ICD-9 code assignment using MIMIC-III clinical notes. Computer Methods and Programs in Biomedicine, 177, 141–153. https://doi.org/10.1016/J.CMPB.2019.05.024
- (optional) cleaned NOTEEVENTS.csv using postgresql. imported NOTEEVENTS.csv by modifying mimic iii github and using the commands "select regexp_replace(field, E'[\n\r]+', ' ', 'g' )". the cleaned version (NOTEEVENTS-2.csv) can be downloaded in the google drive mentioned in "Environment Setup (local)"
- run preprocess.ipynb to produce DATA_HADM and DATA_HADM_CLEANED.
- run describe_icd9code.ipynb and describe_icd9category.ipynb to produce the descriptive statistics.
- (optional) run word2vec-generator.ipynb to produce the word2vec models
- run feature_extraction_seq.ipynb and feature_extraction_nonseq.ipynb to produce the input features for the machine learning and deep learning classifiers.
- run ml_baseline.py to get the results for Logistic Regression and Random Forest.
- run nn_baseline_train.py and nn_baseline_test.py to get the results for Feed-Forward Neural Network.
- run wordseq_train.py and wordseq_test.py to get the results for Conv1D, RNN, LSTM and GRU (refer to 'help' or the guide below on training and testing for Keras Deep Learning Models)
-
Prerequirest: Keras + Tensorflow, or Keras + Theano
-
models are specified in
nn_baseline_models.py
-
run
nn_baseline_preprocessing
to prapare the data for training and testing use. -
Training:
- You can also run training with default arguments:
pythno nn_baseline_train.py
, - Or run training script with customized input arguments:
python nn_baseline_train.py --epoch 10 --batch_size 128 --model_name nn_model_1 --pre_train False
- Please refer to
parse_args()
function innn_baseline_train.py
for the full list of the input arguments
- You can also run training with default arguments:
-
Testing:
- Test model with default model and data file:
python tfidf_test.py
- Please refer to
parse_args()
function innn_baseline_train.py
for the full list of the input arguments
- Test model with default model and data file:
- Similar to Feed Forward Neural Network, users can run the training and tesing with the default settings in
wordseq_train.py
andwordseq_test.py
. All the model architectures are specified inwordseq_models.py
- conda env create -f environment.yml
- Install spark or download spark binary from here
- pip install https://dist.apache.org/repos/dist/dev/incubator/toree/0.2.0/snapshots/dev1/toree-pip/toree-0.2.0.dev1.tar.gz
- the above command should install toree. if it fails, refer to github link.
- note that toree was not included in environment.yml because including it there didn't work for me before.
- jupyter toree install --user --spark_home=/spark-2.1.0-bin-hadoop2.7 --interpreters=PySpark
- Extract the ff. files to the directory "code/data":
- DIAGNOSES_ICD.csv (from MIMIC-III database)
- NOTEEVENTS-2.csv (cleaned version of MIMIC-III NOTEEVENTS.csv, replaced '\n' with ' ') link
- D_ICD_DIAGNOSES.csv (from MIMIC-III database)
- model_word2vec_v2_*dim.txt (generated word2vec)
- bio_nlp_vec/PubMed-shuffle-win-*.txt Download here (you will need to convert the .bin files to .txt. I used gensim to do this)
- model_doc2vec_v2_*dim_final.csv (generated word2vec)
- To run data preprocessing, data statistics, and ipynb related stuff, start the jupyter notebook. Don't forget to set the kernel to "Toree Pyspark".
- jupyter notebook
- To run the deep learning experiments, follow the corresponding guide below.
- Setup Docker w/ GPU following this guide
- Using Azure's portal, select the vm's firewall (in my case, it showed "azure01-firewall" in "all resources"), then "allow" port 22 (ssh) and 8888 (jupyter) for both inbound and outbound.
- You can ssh the VM through one of the ff:
- docker-machine ssh azure01
- ssh docker-user@public_ip_addr
- Spark can be installed by following the instructions in "Environment Setup (local), but note that this will not be as powerful as HDInsights. I recommend taking advantage of the VM's large memory by setting the spark memory to a higher value (/conf/spark-defaults.conf)
- If you have a jupyter notebook running in this VM, you can access via http://public_ip_addr:8888/
- To enable the GPUs for deep learning, follow the instructions in the tensorflow website link
- you can check the GPUs' status by "nvidia-smi"