PHICON

1.Introduction

This repository contains source code for paper "PHICON: Improving Generalization of Clinical Text De-identification Models via Data Augmentation"(accepted by EMNLP'20 The 3rd Clinical Natural Language Processing Workshop). PHICON is a simple yet effective data augmentation method to alleviate the generalization issue in de-identification. PHICON consists of PHI augmentation and Context augmentation (as shown in Figure 1), which creates augmented training corpora by replacing PHI entities with named-entities sampled from external sources, and by changing background context with synonym replacement or random word insertion, respectively.

Figure. 1: Toy examples of our PHICON data augmentation. SR: synonym replacement. RI: random insertion.

2. Usage

Setup

Download Stanford Parser, and change the corresponding path in rule_modules.py file
Install spaCy package

PHI Augmentation

The i2b2 2006 and i2b2 2014 de-identification dataset can be accessed from: https://portal.dbmi.hms.harvard.edu.

The data processing mainly refers to the guidance from:
https://github.com/juand-r/entity-recognition-datasets/tree/master/data/i2b2_2006
https://github.com/juand-r/entity-recognition-datasets/tree/master/data/i2b2_2014

We also show detailed steps on data process and PHI augmentation in the following two files:
PHI augmentation-i2b2-2006 dataset.ipynb
PHI augmentation-i2b2-2014 dataset.ipynb

If users already have de-identification datasets in BIO format, users can directly conduct PHI Augmentation according to the guidance in this file:
PHI augmentation-your-own-dataset.ipynb

Context Augmentation

python Context_Aug.py

3. Citation

Please kindly cite the paper if you use the code or any resources in this repo:

@inproceedings{yue2020phicon,
 title={PHICON: Improving Generalization of Clinical Text De-identification Models via Data Augmentation},
 author={Xiang Yue and Shuang Zhou},
 booktitle={Proceedings of the 3rd Clinical Natural Language Processing Workshop},
 year={2020}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PHICON

1.Introduction

2. Usage

Setup

PHI Augmentation

Context Augmentation

3. Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

PHICON

1.Introduction

2. Usage

Setup

PHI Augmentation

Context Augmentation

3. Citation