This repository provides code for the analysis of Clinical Reading Comprehension task in the ACL2020 paper: Clinical Reading Comprehension: A Thorough Analysis of the emrQA Dataset
@inproceedings{yue2020CliniRC,
title={Clinical Reading Comprehension: A Thorough Analysis of the emrQA Dataset},
author={Xiang Yue and Bernal Jimenez Gutierrez and Huan Sun},
booktitle={ACL},
year={2020}
}
Run the following commands to clone the repository and install requirements. It requires Python 3.5 or higher. It also requires installing PyTorch version 1.0 or higher and Tensorflow version 1.1 or higher. The other dependencies are listed in requirements.txt.
$ git clone https://github.com/xiangyue9607/CliniRC.git
$ pip install -r requirements.txt
Our analysis is based on the recently released clinical QA dataset: emrQA [EMNLP'18].
Note that the emrQA dataset is generated from the n2c2 (previously "i2b2") datasets.
We do not have the right to include either the emrQA dataset or n2c2 datasets in this repo.
Users need to first sign up the n2c2 data use agreement and then follow the instructions in the emrQA repo to generate the emrQA dataset.
After you generate the emrQA dataset, create directory /data/datasets
and put the data.json
into the directory.
We first provide proprocessing script to help clean up the generated emrQA dataset. Specifically, the preprocessing script have the following functions:
- Remove extra whitespaces, punctuations and newlines. Join sentences into one paragraph;
- Reformulate the dataset as the "SQuAD" format;
- Randomly split the dataset into train/dev/test set (7:1:2).
$ python src/preprocessing.py \
--data_dir ./data/datasets \
--filename data.json \
--out_dir ./data/datasets
Note that there are 5 subsets in the emrQA dataset. We only use the Medication
and Relation
subsets, as (1) they makeup 80% of the entire emrQA dataset and
(2) their format is consistent with the span extraction task, which is more challenging and meaningful for clinical decision making support.
After running preprocessing.py
script, you will obtain 6 json files in your output
directory (i.e., train, dev, test sets for Medication
and Relation
datasets)
As we have demonstrated in the paper (Section 4.1), though there are more than 1 million questions in the emrQA dataset, many questions and their patterns are very similar since they are generated from the same question templates. And we show that we do not so many questions to train a CliniRC system and using a sampled subset can achieve roughly the same performance that is based on the entire dataset.
To randomly sample question from the original dataset, you can:
$ python src/sample_dataset.py \
--data_dir ./data/datasets \
--filename medication-train \
--out_dir ./data/datasets \
--sample_ratio 0.2
$ python src/sample_dataset.py \
--data_dir ./data/datasets \
--filename relation-train \
--out_dir ./data/datasets \
--sample_ratio 0.05
--sample_ratio controls how many questions are sampled from each document.
In our paper, we compare some state-of-tha-art QA models on the emrQA dataset. Here, we give two examples: BERT and DocReader. For other QA models tested in the paper, you can refer to their github repos for further details.
- Download the pretrained BERT models (including bert-base-cased, BioBERT-base-PubMed and ClinicalBERT). (Feel free to try other BERT models-:)
$ chmod +x download_pretrained_models.sh; ./download_pretrained_models.sh
- Train (Fine-tune) a BERT model on the emrQA medication/relation dataset. The training script is adopted from BERT github repo
$ CUDA_VISIBLE_DEVICES=0 python ./BERT/run_squad.py \
--vocab_file=./pretrained_bert_models/clinicalbert/vocab.txt \
--bert_config_file=./pretrained_bert_models/clinicalbert/bert_config.json \
--init_checkpoint=./pretrained_bert_models/clinicalbert/model.ckpt-100000 \
--do_train=True \
--train_file=./data/datasets/relation-train-sampled-0.05.json \
--do_predict=True \
--do_lower_case=False \
--predict_file=./data/datasets/relation-dev.json \
--train_batch_size=6 \
--learning_rate=3e-5 \
--num_train_epochs=4.0 \
--max_seq_length=384 \
--doc_stride=128 \
--output_dir=./output/bert_models/clinicalbert_relation_0.05/
- Inference on the test set.
$ python ./BERT/run_squad.py \
--vocab_file=./pretrained_bert_models/clinicalbert/vocab.txt \
--bert_config_file=./pretrained_bert_models/clinicalbert/bert_config.json \
--init_checkpoint=./output/bert_models/clinical_relation_0.05_epoch51/model.ckpt-21878 \
--do_train=False \
--do_predict=True \
--do_lower_case=False \
--predict_file=./data/relation-test.json \
--train_batch_size=6 \
--learning_rate=3e-5 \
--num_train_epochs=3.0 \
--max_seq_length=384 \
--doc_stride=128 \
--output_dir=./output/bert_models/clinical_relation_0.05_epoch51_test/
- Eval the model. We adopt the official eval script from SQuAD v1.1.
$ python ./src/evaluate-v1.1.py ./data/datasets/medication-dev.json ./output/bert_models/bertbase_medication_0.2/predictions.json
We adopt the DocReader module code from DrQA github repo.
- Set up
$ git clone https://github.com/facebookresearch/DrQA.git
$ cd DrQA; python setup.py develop
- Download the pretrained GloVe embeddings and put it into the
data/embeddings
. You can also run our script to automatically finish this step:
$ chmod +x ../download_glove_embeddings.sh; ../download_glove_embeddings.sh
- Preprocessing the train/dev files:
$ python scripts/reader/preprocess.py \
../data/datasets/ \
../data/datasets/ \
--split relation-train-sampled-0.05 \
--tokenizer spacy
$ python scripts/reader/preprocess.py \
../data/datasets/ \
../data/datasets/ \
--split relation-dev \
--tokenizer spacy
- Train the Reader:
$ python scripts/reader/train.py \
--embedding-file glove.840B.300d.txt \
--tune-partial 1000 \
--train-file relation-train-sampled-0.05-processed-spacy.txt \
--dev-file relation-dev-processed-spacy.txt \
--dev-json relation-dev.json \
--random-seed 20 \
--batch-size 16 \
--test-batch-size 16 \
--official-eval True \
--valid-metric exact_match \
--checkpoint True \
--model-dir ../output/drqa-models/relation \
--data-dir ../data/datasets \
--embed-dir ../data/embeddings \
--data-workers 0 \
--max-len 30
- Inference on the test set:
python scripts/reader/predict.py \
../data/datasets/relations-mimic-new-qs-ver3.json \
--model ../output/drqa-models/[YOUR MODEL NAME] \
--batch-size 16 \
--official \
--tokenizer spacy \
--out-dir ../output/drqa-models/ \
--embedding ../data/embeddings/glove.840B.300d.txt \
- Eval the model. We adopt the official eval script from SQuAD v1.1.
$ cd ..
$ python ./src/evaluate-v1.1.py ./data/datasets/medication-dev.json ./output/drqa-models/predictions.json