Subword encoding for Word Segmentation using Lattice LSTM.
Models and results can be found at our paper Subword Encoding in Lattice LSTM for Chinese Word Segmentation.
Python: 2.7
PyTorch: 0.3.0
CoNLL format (prefer BMES tag scheme), with each character its label for one line. Sentences are splited with a null line.
中 B-SEG
国 E-SEG
最 B-SEG
大 E-SEG
氨 B-SEG
纶 M-SEG
丝 E-SEG
生 B-SEG
产 E-SEG
基 B-SEG
地 E-SEG
在 S-SEG
连 B-SEG
云 M-SEG
港 E-SEG
建 B-SEG
成 E-SEG
新 B-SEG
华 M-SEG
社 E-SEG
北 B-SEG
京 E-SEG
十 B-SEG
二 M-SEG
月 E-SEG
二 B-SEG
十 M-SEG
六 M-SEG
日 E-SEG
电 S-SEG
The pretrained character and word embeddings are the same with the embeddings in the baseline of RichWordSegmentor
- Character embeddings (gigaword_chn.all.a2b.uni.ite50.vec): Google Drive or Baidu Pan
- Character bigram embeddings (gigaword_chn.all.a2b.bi.ite50.vec): Google Drive or Baidu Pan
- Word embeddings (ctb.50d.vec): Google Drive or Baidu Pan
- Subword(BPE) embeddings: zh.wiki.bpe.op200000.d50.w2v.txt
- Download the character embeddings, character bigram embeddings, BPE (or word) embeddings and set their directories in
main.py
. - Modify the
run_seg.py
by adding your train/dev/test file directory. sh run_seg.py
Cite our paper as:
@article{yang2019subword,
title={Subword Encoding in Lattice LSTM for Chinese Word Segmentation},
author={Jie Yang, Yue Zhang, and Shuailong Liang},
booktitle={NAACL},
year={2019}
}