python 2.7
DynetVersion commit 284838815ece9297a7100cc43035e1ea1b133a5
In the data/
, create a directory per language as shown for data/Spanish
. Download the CoNLL train/dev/test NER datasets for that language here. To acquire LDC datasets, please get the required access.
For storing the trained models, create directory saved_models
in the parent folder.
Combine monolingual data acquired from Wikipedia with the plain text extracted from the labeled data. Train 100-d Glove embeddings
The best NER performance was obtained using fine-tuning training scheme. The scripts below runs simulation active learning runs for different active learning strategies:
cd commands
- ETAL + Partial-CRF + CT (Proposed recipe)
./ETAL_PARTIAL_CRF_CT.sh
- ETAL + Full-CRF + CT
./ETAL_FULL_CRF_CT.sh
- CFEAL + Full-CRF + CT
./CFEAL_PARTIAL_CRF_CT.sh
- SAL + CT
./SAL_CT.sh
Things to note:
We load the vocabulary from the following path--aug_lang_train_path
. Therefore, create a conll formatted file with dummy labels from the unlabeled text.
For our experiments, we concatenated the transferred data with the unlabeled data (which was the entire training dataset) into a single conll formatted file.
The conll format is a tab separated two-column format as shown below:
El O
grupo O
The LDC NER label set differ from the CoNLL label set by one tag. Therefore, add --misc
to the argument set when running any experiments on CoNLL data. The label set has been hard-coded in the data_loaders/data_loader.py
file.
We used the model proposed by (Xie et al. 2018) to get the cross-lingually transferred data from English. Please refer to their code here.
For the Fine-Tune training scheme, train a base NER model on the transferred model as follows:
MODEL_NAME="spanish_full_transfer_baseline"
python -u ../main.py \
--dynet-seed 3278657 \
--word_emb_dim 100 \
--batch_size 10 \
--model_name ${MODEL_NAME} \
--lang es \
--fixedVocab \
--test_conll \
--tot_epochs 1000 \
--aug_lang_train_path $DATA/vocab.conll \
--init_lr 0.015 \
--valid_freq 1300 \
--misc \
--pretrain_emb_path $DATA/esp.vec \
--dev_path $DATA/esp.dev \
--test_path $DATA/esp.test \
--train_path $DIR/transferred_data.conll 2>&1 | tee ${MODEL_NAME}.log
If you make use of this software for research purposes, we will appreciate citing the following:
@inproceedings{chaudhary19emnlp,
title = {A Little Annotation does a Lot of Good: A Study in Bootstrapping Low-resource Named Entity Recognizers},
author = {Aditi Chaudhary and Jiateng Xie and Zaid Sheikh and Graham Neubig and Jaime Carbonell},
booktitle = {Conference on Empirical Methods in Natural Language Processing (EMNLP)},
address = {Hong Kong},
month = {November},
url = {http://arxiv.org/abs/1908.08983},
year = {2019}
}
For any issues, please feel free to reach out to aschaudh@andrew.cmu.edu
.