PHICON

1.Introduction

This repository contains source code for paper "PHICON: Improving Generalization of Clinical Text De-identification Models via Data Augmentation"(accepted by EMNLP'20 The 3rd Clinical Natural Language Processing Workshop). PHICON is a simple yet effective data augmentation method to alleviate the generalization issue in de-identification. PHICON consists of PHI augmentation and Context augmentation (as shown in Figure 1), which creates augmented training corpora by replacing PHI entities with named-entities sampled from external sources, and by changing background context with synonym replacement or random word insertion, respectively.

Figure. 1: Toy examples of our PHICON data augmentation. SR: synonym replacement. RI: random insertion.

2. Usage

Setup

Download Stanford Parser, and change the corresponding path in rule_modules.py file
Install spaCy package

PHI Augmentation

The i2b2 2006 and i2b2 2014 de-identification dataset can be accessed from: https://portal.dbmi.hms.harvard.edu.

The data processing mainly refers to the guidance from:
https://github.com/juand-r/entity-recognition-datasets/tree/master/data/i2b2_2006
https://github.com/juand-r/entity-recognition-datasets/tree/master/data/i2b2_2014

We also show detailed steps on data process and PHI augmentation in the following two files:
PHI augmentation-i2b2-2006 dataset.ipynb
PHI augmentation-i2b2-2014 dataset.ipynb

If users already have de-identification datasets in BIO format, users can directly conduct PHI Augmentation according to the guidance in this file:
PHI augmentation-your-own-dataset.ipynb

Context Augmentation

python Context_Aug.py

3. Citation

Please kindly cite the paper if you use the code or any resources in this repo:

@inproceedings{yue2020phicon,
 title={PHICON: Improving Generalization of Clinical Text De-identification Models via Data Augmentation},
 author={Xiang Yue and Shuang Zhou},
 booktitle={Proceedings of the 3rd Clinical Natural Language Processing Workshop},
 year={2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
all_PHI_from_internet		all_PHI_from_internet
data/conll2003/en		data/conll2003/en
gen_data		gen_data
resources/for_insert		resources/for_insert
Context_Aug.py		Context_Aug.py
PHI augmentation-i2b2-2006 dataset.ipynb		PHI augmentation-i2b2-2006 dataset.ipynb
PHI augmentation-i2b2-2014 dataset.ipynb		PHI augmentation-i2b2-2014 dataset.ipynb
PHI augmentation-your-own-dataset.ipynb		PHI augmentation-your-own-dataset.ipynb
PHICON_example.png		PHICON_example.png
README.md		README.md
data_handling_for_heuristic.py		data_handling_for_heuristic.py
parameters.ini		parameters.ini
rule_modules.py		rule_modules.py
subset_generator.py		subset_generator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PHICON

1.Introduction

2. Usage

Setup

PHI Augmentation

Context Augmentation

3. Citation

About

Releases

Packages

Languages

betterzhou/PHICON

Folders and files

Latest commit

History

Repository files navigation

PHICON

1.Introduction

2. Usage

Setup

PHI Augmentation

Context Augmentation

3. Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages