Code by Thai-Hoang Pham at Alt Inc. (Utilize some code at a repository)
A demo website is available at nnvlp.org
NNVLP is a Python implementation of the system described in a paper NNVLP: A Neural Network-Based Vietnamese Language Processing Toolkit. This system is used for some common sequence labeling tasks for Vietnamese including part-of-speech (POS) tagging, chunking, named entity recognition (NER). The architecture of this system is the combination of bi-directional Long Short-Term Memory (Bi-LSTM), Conditional Random Field (CRF), and word embeddings that is the concatenation of pre-trained word embeddings learnt from skip-gram model and character-level word features learnt from Convolutional Neural Network (CNN).
Figure 1: The CNN layer for extracting character-level word features of word Học_sinh (Student).Figure 2: The Bi-LSTM-CRF layers for input sentence Anh rời EU hôm qua (UK left EU yesterday).
Our system achieves an accuracy of 91.92%, F1 scores of 84.11% and 92.91% for POS tagging, chunking, and NER tasks respectively.
The following tables compare the performance of NNVLP and other previous toolkit on POS tagging, chunking, and NER task respectively.
System | Accuracy |
---|---|
Vitk | 88.41 |
vTools | 90.73 |
RDRPOSTagger | 91.96 |
NNVLP | 91.92 |
System | P | R | F1 |
---|---|---|---|
vTools | 82.79 | 83.55 | 83.17 |
NNVLP | 83.93 | 84.28 | 84.11 |
System | P | R | F1 |
---|---|---|---|
Vitk | 88.36 | 89.20 | 88.78 |
vie-ner-lstm | 91.09 | 93.03 | 92.05 |
NNVLP | 92.76 | 93.07 | 92.91 |
This software depends on Numpy, Theano, and Lasagne. You must have them installed before using NNVLP.
The simple way to install them is using pip:
$ pip install -U numpy theano lasagne
The input data's format of NNVLP follows CoNLL format. The corpus of POS tagging task consists of two columns namely word, and POS tag. For chunking task, there are three columns namely word, POS tag, and chunk in the corpus. The corpus of NER task consists of four columns. The order of these columns are word, POS tag, chunk, and named entity. The table below describes an example Vietnamese sentence in NER corpus.
Word | POS | Chunk | NER |
---|---|---|---|
Từ | E | B-PP | O |
Singapore | NNP | B-NP | B-LOC |
, | CH | O | O |
chỉ | R | O | O |
khoảng | N | B-NP | O |
vài | L | B-NP | O |
chục | M | B-NP | O |
phút | Nu | B-NP | O |
ngồi | V | B-VP | O |
phà | N | B-NP | O |
là | V | B-VP | O |
dến | V | B-VP | O |
được | R | O | O |
Batam | NNP | B-NP | B-LOC |
. | CH | O | O |
To access the full dataset of VLSP, you need to sign the user agreement of the VLSP consortium.
You can use NNVLP software by shell commands:
For POS tagging:
$ bash pos.sh
For chunking:
$ bash chunk.sh
For NER:
$ bash ner.sh
Arguments in these scripts:
train_dir
: path for training datadev_dir
: path for development datatest_dir
: path for testing dataword_dir
: path for word dictionaryvector_dir
: path for vector dictionarychar_embedd_dim
: character embedding dimensionnum_units
: number of hidden units for LSTMnum_filters
: number of filters for CNNgrad_clipping
: grad clippingpeepholes
: peepholes (True or False)learning_rate
: learning ratedecay_rate
: decay ratedropout
: dropout for input data (True or False)batch_size
: size of input batch for training this system.patience
: number used for early stopping in training stage
Note: In the first time of running NNVLP, this system will automatically download word embeddings for Vietnamese from the internet. (It may take a long time because a size of this embedding set is about 1 GB). If the system cannot automatically download this embedding set, you can manually download it from here (vector, unknown vector, word) and put it into embedding directory.
@inproceedings{Pham:2017b,
title={NNVLP: A Neural Network-Based Vietnamese Language Processing Toolkit},
author={Thai-Hoang Pham and Xuan-Khoai Pham and Tuan-Anh Nguyen and Phuong Le-Hong},
booktitle={Proceedings of The 8th International Joint Conference on Natural Language Processing},
year={2017},
}
@inproceedings{Pham:2017a,
title={End-to-end Recurrent Neural Network Models for Vietnamese Named Entity Recognition: Word-level vs. Character-level},
author={Thai-Hoang Pham and Phuong Le-Hong},
booktitle={Proceedings of The 15th International Conference of the Pacific Association for Computational Linguistics},
year={2017},
}
Thai-Hoang Pham < phamthaihoang.hn@gmail.com >
Alt Inc, Hanoi, Vietnam