This is an implementation of a simple maximum entropy model for named entity recognition (NER) task.
- Python 3.6
- NLTK version 3.2 or newer
To enable plotting, the following packages are also needed
- Matplotlib version 2.0 or newer
- Numpy version 1.13 or newer
Once all dependencies are installed, you'll also need to install this package in editable mode.
git clone https://github.com/kmkurn/ner-task.git # clone this repository
cd ner-task # move into this project directory
pip install -e . # install the package in editable mode
One corpus that can be used is CoNLL corpus. This corpus has two files, train.conll
and dev.conll
. The first 20 lines from train.conll
are
-DOCSTART- O
EU ORG
rejects O
German MISC
call O
to O
boycott O
British MISC
lamb O
. O
Peter PER
Blackburn PER
BRUSSELS LOC
1996-08-22 O
The O
European ORG
As we can see, each word is in its own line with its tag, separated by a tab (\t
) character. Each sentence is separated by a blank line and each document is separated by a special line with -DOCSTART-
as the word. Any corpus that are compatible with this format can be used.
Script src/corpus.py
provides functionality to print corpus summary and perform sampling (sentences or words having a certain tag). To print corpus summary, use
python src/corpus.py summarize [corpus file, e.g. train.conll]
This will print statistics of the corpus file like number of sentences, words, etc. To sample sentences, use
python src/corpus.py --size 5 sample [corpus file]
This will sample 5 sentences from the corpus file. To sample words instead, use
python src/corpus.py --size 5 -w sample [corpus file]
By default, this will sample only words with tag O
. To specify another tag, use -t
option. For more info, run python src/corpus.py -h
File src/vocab.py
can be used to unkify the corpus.
python src/vocab.py train.conll train.conll > train.conll.unk
python src/vocab.py train.conll dev.conll > dev.conll.unk
The first argument is the training file from which the vocabulary will be built. The second argument is the corpus file to unkify. All words that are not contained in the vocabulary will be converted into a special UNK token. By default, this token is -UNK-
. Specify another token with --unk-token [UNK token]
option. Also, by default, only words that occur at least twice in the training file that are included in the vocabulary. To change this, use --min-count [cutoff]
. As usual, more info can be viewed by running python src/vocab.py -h
.
Model training is provided by src/main.py
script. The full usage of this script is
usage: main.py [-h] --model-name {majority,memo,maxent} --corpus CORPUS
--model-path MODEL_PATH [--mode {train,test}] [--cutoff CUTOFF]
[--max-iter MAX_ITER] [--contexts [CONTEXTS [CONTEXTS ...]]]
The main script to run NER models
optional arguments:
-h, --help show this help message and exit
--model-name {majority,memo,maxent}, -n {majority,memo,maxent}
model name
--corpus CORPUS, -c CORPUS
path to corpus file
--model-path MODEL_PATH, -m MODEL_PATH
path to save/load the trained model
--mode {train,test} whether to do training or testing/inference (default:
train)
--cutoff CUTOFF feature count cutoff for maxent (default: 2)
--max-iter MAX_ITER max number of training iteration for maxent (default:
50)
--contexts [CONTEXTS [CONTEXTS ...]]
contexts to include as features for maxent (default:
-2 -1 0 1 2)
To train the baseline model (which only memorizes word-tag assignment in the training data), run
python src/main.py -n memo -c train.conll.unk -m memo-model.pkl --mode train > train-memo.log 2>&1
This will save the trained model to memo-model.pkl
file and the training log to train-memo.log
. Similarly, training the maximum entropy model can be done by
python src/main.py -n maxent -c train.conll.unk -m maxent-model.pkl --mode train > train-maxent.log 2>&1
By default, this will use 5 features (current word, two words before, and two words after). Option --contexts
can be used to specify which words to include as features. Options --cutoff
and --max-iter
can be used to specify minimum number of feature occurrence to be included in the model (features occurring fewer than the cutoff will be discarded) and the number of iterations when training respectively.
To evaluate the model against a development/testing set, the same src/main.py
script can be used. As an example
python src/main.py -n maxent -c dev.conll.unk -m maxent-model.pkl --mode test > output-maxent.conll 2> test-maxent.log
This will write the predicted tags of the words in dev.conll.unk
file to output-maxent.conll
in the same tab-delimited format and the log messages to test-maxent.log
. The log file will also contain the precision, recall, F1 score, and the confusion matrix of the model on the development set.
A more complete evaluation is provided by src/evaluation.script
. As an illustration
python src/evaluation.py -v dev.conll.unk output-maxent.conll > report-maxent.out 2> report-maxent.err
This will output the scores (like in test-maxent.log
) to report-maxent.out
and a list of words that are misclassified (along with the true and predicted tag) to report-maxent.err
. To plot the confusion matrix and save it to a file, use --save-cm-to [filename]
option. Run python src/evaluation.py -h
for more info.
This software is licensed with the MIT license. See LICENSE.txt
for the full text.