Skip to content

script to evaluate pre-trained Japanese word2vec model on Japanese similarity dataset

License

Notifications You must be signed in to change notification settings

shihono/evaluate_japanese_w2v

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

evaluate_japanese_w2v

日本語類似度評価データセットをword2vecモデルに適用するためのスクリプト

mecab-python3SudachiPy による分かち書きに対応

mecab-python3 and SudachiPy for tokenizing Japanese

Requirements

  • chardet
  • numpy
  • scipy
  • gensim
  • mecab-python3
  • sudachipy
  • sudachidict-core

Usage

$ python eval.py model data [option]
  • model: gensimで読み込み可能なモデルファイル

  • data: 単語1, 単語2, (類似度などの)数値の3つの列を持つcsvファイルもしくはcsvファイルを含むディレクトリ

    • --col で3つの列を指定可能 (デフォルトは [0,1,2])
  • model: The word2vec model file that can be load by gensim.

  • data: csv file or directory path. The files contain 3 columns of word1, word2, similarity score

    • 3 columns can be specified by --col (default [0,1,2])
optional arguments:
  -h, --help            show this help message and exit
  --col COL COL COL     indexes of word1, word2, similarity
  --verbose, -v         verbose
  --mecab, -m           use mecab
  --mecab_dict MECAB_DICT, -d MECAB_DICT
                        mecab dictionary path
  --sudachi, -s         use sudachi
  --sudachi_mode SUDACHI_MODE
                        select sudachi tokenizer mode: A or B or C
  --output OUTPUT, -o OUTPUT
                        output csv path or directory path

Example

Example for Mecab

$ python eval.py /path/to/latest-ja-word2vec-gensim-model/word2vec.gensim.model \
    /path/to/JWSAN/jwsan-1400.txt \
    -v --col 1 2 4 -m --mecab_dict /usr/local/lib/mecab/dic/mecab-ipadic-neologd 

Output:

[XXXX-XX-XX XX:XX:XX,XXX] [__main__] [INFO] set logger
[XXXX-XX-XX XX:XX:XX,XXX] [__main__] [INFO] Word vector 50 dim, Vocab size 335476
[XXXX-XX-XX XX:XX:XX,XXX] [__main__] [INFO] Use mecab : dict setting is /usr/local/lib/mecab/dic/mecab-ipadic-neologd
[XXXX-XX-XX XX:XX:XX,XXX] [__main__] [INFO] load filepath : /path/to/JWSAN/jwsan-1400.csv, 1400 data
[XXXX-XX-XX XX:XX:XX,XXX] [__main__] [INFO] Evaluate 1359 data
[XXXX-XX-XX XX:XX:XX,XXX] [__main__] [INFO] spearmanr SpearmanrResult(correlation=0.4155930561711437, pvalue=6.97399627506598e-58)
Data    1400
OOV     41
Corr    0.416

More results on 学習済み日本語word2vecとその評価について (write in Japanese)

About

script to evaluate pre-trained Japanese word2vec model on Japanese similarity dataset

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages