Skip to content

lsy3/clinical-notes-diagnosis-dl-nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

An Empirical Evaluation of CNNs and RNNs for ICD-9 Code Assignment using MIMIC-III Clinical Notes

  • Members: Jinmiao Huang, Cesar Osorio, and Luke Wicent Sy (all three provided equal contribution)
  • Publication: arxiv (2018), elsevier (updated 2019)
  • If you used this code in your work, please cite the following publication: Huang, J., Osorio, C., & Sy, L. W. (2019). An empirical evaluation of deep learning for ICD-9 code assignment using MIMIC-III clinical notes. Computer Methods and Programs in Biomedicine, 177, 141–153. https://doi.org/10.1016/J.CMPB.2019.05.024

General Pipeline

  1. (optional) cleaned NOTEEVENTS.csv using postgresql. imported NOTEEVENTS.csv by modifying mimic iii github and using the commands "select regexp_replace(field, E'[\n\r]+', ' ', 'g' )". the cleaned version (NOTEEVENTS-2.csv) can be downloaded in the google drive mentioned in "Environment Setup (local)"
  2. run preprocess.ipynb to produce DATA_HADM and DATA_HADM_CLEANED.
  3. run describe_icd9code.ipynb and describe_icd9category.ipynb to produce the descriptive statistics.
  4. (optional) run word2vec-generator.ipynb to produce the word2vec models
  5. run feature_extraction_seq.ipynb and feature_extraction_nonseq.ipynb to produce the input features for the machine learning and deep learning classifiers.
  6. run ml_baseline.py to get the results for Logistic Regression and Random Forest.
  7. run nn_baseline_train.py and nn_baseline_test.py to get the results for Feed-Forward Neural Network.
  8. run wordseq_train.py and wordseq_test.py to get the results for Conv1D, RNN, LSTM and GRU (refer to 'help' or the guide below on training and testing for Keras Deep Learning Models)

Training and Testing for Feed Forward Neural Network

  • Prerequirest: Keras + Tensorflow, or Keras + Theano

  • models are specified in nn_baseline_models.py

  • run nn_baseline_preprocessing to prapare the data for training and testing use.

  • Training:

    • You can also run training with default arguments: pythno nn_baseline_train.py,
    • Or run training script with customized input arguments: python nn_baseline_train.py --epoch 10 --batch_size 128 --model_name nn_model_1 --pre_train False
    • Please refer to parse_args() function in nn_baseline_train.py for the full list of the input arguments
  • Testing:

    • Test model with default model and data file: python tfidf_test.py
    • Please refer to parse_args() function in nn_baseline_train.py for the full list of the input arguments

Training and Testing for Recurrent and Convolution Neural Network

  • Similar to Feed Forward Neural Network, users can run the training and tesing with the default settings in wordseq_train.py and wordseq_test.py. All the model architectures are specified in wordseq_models.py

Environment Setup (local)

  1. conda env create -f environment.yml
  2. Install spark or download spark binary from here
  3. pip install https://dist.apache.org/repos/dist/dev/incubator/toree/0.2.0/snapshots/dev1/toree-pip/toree-0.2.0.dev1.tar.gz
    • the above command should install toree. if it fails, refer to github link.
    • note that toree was not included in environment.yml because including it there didn't work for me before.
  4. jupyter toree install --user --spark_home=/spark-2.1.0-bin-hadoop2.7 --interpreters=PySpark
  5. Extract the ff. files to the directory "code/data":
    • DIAGNOSES_ICD.csv (from MIMIC-III database)
    • NOTEEVENTS-2.csv (cleaned version of MIMIC-III NOTEEVENTS.csv, replaced '\n' with ' ') link
    • D_ICD_DIAGNOSES.csv (from MIMIC-III database)
    • model_word2vec_v2_*dim.txt (generated word2vec)
    • bio_nlp_vec/PubMed-shuffle-win-*.txt Download here (you will need to convert the .bin files to .txt. I used gensim to do this)
    • model_doc2vec_v2_*dim_final.csv (generated word2vec)
  6. To run data preprocessing, data statistics, and ipynb related stuff, start the jupyter notebook. Don't forget to set the kernel to "Toree Pyspark".
    • jupyter notebook
  7. To run the deep learning experiments, follow the corresponding guide below.

Environment Setup (azure)

  1. Setup Docker w/ GPU following this guide
  2. Using Azure's portal, select the vm's firewall (in my case, it showed "azure01-firewall" in "all resources"), then "allow" port 22 (ssh) and 8888 (jupyter) for both inbound and outbound.
  3. You can ssh the VM through one of the ff:
    • docker-machine ssh azure01
    • ssh docker-user@public_ip_addr
  4. Spark can be installed by following the instructions in "Environment Setup (local), but note that this will not be as powerful as HDInsights. I recommend taking advantage of the VM's large memory by setting the spark memory to a higher value (/conf/spark-defaults.conf)
  5. If you have a jupyter notebook running in this VM, you can access via http://public_ip_addr:8888/
  6. To enable the GPUs for deep learning, follow the instructions in the tensorflow website link
    • you can check the GPUs' status by "nvidia-smi"

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •