Skip to content

Get started N-grams with insuranceqa-corpus and srilm toolkit

License

Notifications You must be signed in to change notification settings

hailiang-wang/n-grams-get-started

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

n-grams-get-started

Welcome

N-gram介绍: http://blog.csdn.net/lengyuhong/article/details/6022053

Ngram语言模型: https://flystarhe.github.io/2016/08/16/ngram/

N-gram应用

Google N-grams Downloader: http://blog.csdn.net/liweibin1994/article/details/77387485

n元模型与word2vec http://xiaosheng.me/2017/06/08/article69/

Install tools

srilm

Download srilm toolkit 1.7.0 to downloads/srilm-1.7.0.tgz.

scripts/install_srilm.sh # verified on Mac OSX and Ubuntu 16.04
source tools/env.sh

Corpus

corpus/insuranceqa/ngrams/iqa.ngram.vocab 词汇

corpus/insuranceqa/ngrams/iqa.ngram.valid 验证: fit hyper params

corpus/insuranceqa/ngrams/iqa.ngram.train 训练: train language models

corpus/insuranceqa/ngrams/iqa.ngram.test 测试: evaluate language models

Train

Count File

$ ./scripts/srilm_0_count.sh # almost run 5seconds on my laptop, i5 cpu, 8GB mem.
corpus/iqa.ngram.train: line 39445: 39445 sentences, 2.16426e+06 words, 0 OOVs
0 zeroprobs, logprob= 0 ppl= 1 ppl1= 1

Language Model

$ ./scripts/srilm_1_lm.sh # almost run 5mins on my laptop, i5 cpu, 8GB mem.
CONTEXT 此 支付 numerator 0.389585 denominator 0.988694 BOW -0.404459
CONTEXT 此 是 numerator 0.383682 denominator 0.887665 BOW -0.364277
CONTEXT 此 找到 numerator 0.389585 denominator 0.942731 BOW -0.383785
CONTEXT 此 具有 numerator 0.325904 denominator 0.935851 BOW -0.458116
writing 24999 1-grams
writing 113392 2-grams
writing 98471 3-grams

Calculate perplexity for test data

$ ./scripts/srilm_2_ppl.sh
reading 24999 1-grams
reading 522852 2-grams
reading 1337703 3-grams
        p( </s> | 否定 ...)     = [3gram] 0.029511 [ -1.53002 ]
1 sentences, 58 words, 0 OOVs
0 zeroprobs, logprob= -111.2 ppl= 76.6906 ppl1= 82.649
58 words, rank1= 0.206897 rank5= 0.37931 rank10= 0.534483
59 words+sents, rank1wSent= 0.20339 rank5wSent= 0.389831 rank10wSent= 0.542373 qloss= 0.91776 absloss= 0.900032

Trouble Shotting

pysrilm[not work in Mac OSX]

cd tools/pysrilm
source ~/venv-py2/bin/activate # use py2
pip install cython
python setup.py install
  1. clang: error: unsupported option '-fopenmp' on Mac OSX ppwwyyxx/OpenPano#16
brew install gcc --without-multilib
export CXX=/usr/local/Cellar/gcc/7.2.0/bin/g++-7

About

Get started N-grams with insuranceqa-corpus and srilm toolkit

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages