Skip to content
/ TENER Public
forked from fastnlp/TENER

Codes for "TENER: Adapting Transformer Encoder for Named Entity Recognition"

Notifications You must be signed in to change notification settings

zeionara/TENER

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Set up russian dataset

mkdir -p ../data/conll2003ru
cp conll2003ru.txt ../data/conll2003ru/train.txt
ln ../data/conll2003ru/train.txt ../data/conll2003ru/test.txt
ln ../data/conll2003ru/train.txt ../data/conll2003ru/dev.txt

Train locally

python train_tener_ru.py --training_dataset conll2003ru-bio-super-distinct --models_folder /home/dima/models/ner

After line 81 in file static_embedding.py you should set path for the russian embeddings:

elif model_dir_or_name == 'ru':
            model_path = '/home/nami/models/ArModel100w2v.txt'

Train on google's GPU

Jupyter notebook is available here.

Test

python test_tener_ru.py --training_dataset conll2003ru-bio-super-distinct --testing_dataset conll2003ru-bio-super-distinct --model_file /home/dima/models/ner/bio  --subset dev

Predict on dataset

python predict_tener_ru.py --training_dataset conll2003ru-bio-super-distinct --prediction_dataset conll2003ru-bio-super-distinct --model_file /home/dima/models/ner/bio  --subset dev --output_file preds.txt

Predict on raw text

python extract_entities_ru.py --input raw/text.txt --output predictions.txt

Predict on labelled text

python predict_tener_ru.py --training_dataset conll2003ru-big --prediction_dataset ../../ner-comparison/eval.tagged.txt --model_file /home/dima/model/big --output_file ../../ner-comparison/eval.tagged.tener.txt

TENER: Adapting Transformer Encoder for Named Entity Recognition

This is the code for the paper TENER.

TENER (Transformer Encoder for Named Entity Recognition) is a Transformer-based model which aims to tackle the NER task. Compared with the naive Transformer, we found relative position embedding is quite important in the NER task. Experiments in the English and Chinese NER datasets prove the effectiveness.

Requirements

This project needs the natural language processing python package fastNLP. You can install by the following command

pip install fastNLP

Run the code

(1) Prepare the English dataset.

Conll2003

Your file should like the following (The first token in a line is the word, the last token is the NER tag.)

LONDON NNP B-NP B-LOC
1996-08-30 CD I-NP O

West NNP B-NP B-MISC
Indian NNP I-NP I-MISC
all-rounder NN I-NP O
Phil NNP I-NP B-PER

OntoNotes

Suggest to use the following code to prepare your data OntoNotes-5.0-NER. Or you can prepare data like the Conll2003 style, and then replace the OntoNotesNERPipe with Conll2003NERPipe in the code.

For English datasets, we use the Glove 100d pretrained embedding. FastNLP will download it automatically.

You can use the following code to run (make sure you have changed the data path)

python train_tener_en.py --dataset conll2003

or

python train_tener_en.py --dataset en-ontonotes

Although we tried hard to make sure you can reproduce our results, the results may still disappoint you. This is usually caused by the best dev performance does not correlate well with the test performance . Several runs should be helpful.

The ELMo version (FastNLP will download ELMo weights automatically, you just need to change the data path in train_elmo_en.)

python train_elmo_en.py --dataset en-ontonotes
MSRA, OntoNotes4.0, Weibo, Resume

Your data should only have two columns, the first is the character, the second is the tag, like the following

口 O
腔 O
溃 O
疡 O
加 O
上 O

For the Chinese datasets, you can download the pretrained unigram and bigram embeddings in Baidu Cloud. Download the 'gigaword_chn.all.a2b.uni.iter50.vec' and 'gigaword_chn.all.a2b.bi.iter50.vec'. Then replace the embedding path in train_tener_cn.py

You can run the code by the following command

python train_tener_cn.py --dataset ontonotes

About

Codes for "TENER: Adapting Transformer Encoder for Named Entity Recognition"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 82.6%
  • Jupyter Notebook 15.8%
  • Shell 1.6%