Skip to content
This repository has been archived by the owner on May 13, 2020. It is now read-only.

experiments with chinese segmenters and embeddings

Notifications You must be signed in to change notification settings

merlon/nlp-chinese-experiments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chinese embeddings and segmentation research

In this repo are code and notes related to experimentation with chinese embeddings. We would like to use embeddings of chinese words as input to our disambiguation and involvement models. However, since chinese is not naturally segmented language (there are no spaces) this is problematic. For any questions about this repo, contact Jan Bogar.

This README is organized as follows:

  1. Papers and other sources I read, with notes
  2. experiment with segmentation
  3. Conclusion from the experiment, where to go next
  4. For the future: where to get data

Papers

With segmentation

Fastext

Primer (just a blog):

chinese NER

Character enhanced chinese embeddings

Cw2vec

Without segmentation

Sembei

Combination of Sembei and Fasttext

Ngram2vec

Experiments with segmentation

Idea: preprocess whole index with segmentation? How much would it cost? How long would it take?

Zhtw factiva indexed has 120 GB total (there is also Zhcn factiva). Jieba segmenter (https://github.com/fxsjy/jieba) claims 1.5 MB/s, which means whole indexed zhtw factiva would be segmented in a single day on one machine.

Jieba also has java and rust implementations, which might be even faster:

Accuracy of segmentation

In this notebook I compared accuracy of jieba segmenter and Stanford NLP segmenter against annotated dataset: https://github.com/oapio/nlp-chinese-experiments/blob/master/segmenters%20test.ipynb

Results: average boundaries in a sentence: 26.3 average jieba wrong boundaries: number: 4.38 , ratio: 0.167 average stanford wrong boundaries: number: 3.49 , ratio: 0.133

Dataset size- 46364 sentences Number of sentences without error: 4622 for jieba, 4843 for stanfordNLP

word length distribution

https://github.com/oapio/nlp-chinese-experiments/blob/master/word%20lengths.ipynb

Results: 95 % of words has three characters or less, 90 % characters has 2 characters or less

Conclusion

Majority of approaches to chinese NLP (including embeddings) assumes segmentation of chinese sentences as a first step. In light of that, I researched two tools for chinese language segmentation: Stanford NLP Segmenter and Jieba .

On my human annotated dataset, only about 10% of sentences are segmented without any error. Even if the segmentation rules are not clear, it is alarmingly low success ratio. Accuracy on individual boundaries is for both jieba and stanford sentencer are below 90 %.

Jieba would be capable of segmenting whole factiva in a matter of 1-2 days on a single machine. There are three possible ways to use chinese embeddings in our pipeline:

Segmentation + embeddings with fasttext

  • Segmentation could simplify other NLP tasks
  • This would keep chinese pipeline similar to english pipeline, which would simplify development
  • Fasttext embeddings for chinese words are freely available and would be easy to use . Also training the embeddings on our own datasets would be relatively easy.
  • Fasttext might also be partially immune to effects of erroneous segmentation, since it uses subword information for learning of embeddings, and therefore might assign approximately correct vector also to word that is incomplete or has some characters added. This is however untested hypothesis.

Segmentation free approach

  • segmentation is unreliable and introduces another source of error early on in the pipeline.
  • For segmentation-free approaches, implementations are few. Training of our own embeddings would require a lot more effort (about a week of research and coding for working prototype at best, unless ngram2vec proves viable option).
  • The pipeline would be simplified (but would also diverge from english pipeline) and precision could be potentially higher.

Use fasttext on unsegmented text ( e.g. use all ngrams in the text instead of words)

  • Very easy
  • Likely worse than both of the above

My recommendation is to use segmentation+fasttext and fasttext on unsegmented text as as a reasonable first step. As a second iteration I would focuse on Ngram2Vec. I would trick it to treat each character as separate word. Since ngram2vec is supposed to learn embeddings for ngrams of words, it would instead learn embeddigns of ngrams of characters. Since it's likely to work almost out of the box, it is a reasonable segmentation free approach and we would get all utilities shipped with it for free.

For the future: where to get data

For any future chinese embeddigns research, we will need huge corpus of raw chinese text. Luckily, we have whole factiva clone in jsonl format in google storage (also indexed in elasticsearch).

Data is described in the beggining of this document: https://docs.google.com/document/d/1j_5AYKNEM0tbRgixkmM1OzGjLbUHWIB52kYPRzCdVbY/edit#heading=h.xtuqoz5uvrzr

Link to the data is: https://console.cloud.google.com/storage/browser/factiva-snapshots-processed/snapshot-full?project=merlon-182319&authuser=0&pli=1&angularJsUrl=%2Fstorage%2Fbrowser%2Ffactiva-snapshots-processed%3Fproject%3Dmerlon-182319%26authuser%3D1%26pli%3D1

To download the data and other operations, I strongly reccomend use of gsutil tool. If you will train embeddings in google cloud, you don't have to download the data, so just download one month or one year for experiments. It is really just a bunch of jsonl files with one article per line, sorted into directories by language, year and month.

Person of contact for the data is Michal Nanasi.

About

experiments with chinese segmenters and embeddings

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published