Distribution vector representations of words and Wikipedia entities appeared in Japanese Wikipedia. The data is created based on the specification of "Japanese-Wikipedia Entity Vectors (in Japanese)" released by Masatoshi Suzuki of Inui-Suzuki Laboratory in Tohoku University.
Here are examples with Gensim library. Segmented words are registered as word2vec words.
>>> from gensim.models import KeyedVectors
>>> w2v_model = KeyedVectors.load_word2vec_format('entity_vector.model.bin', binary=True, unicode_errors='ignore')
>>> print(w2v_model.most_similar(['こと']))
[('事', 0.9618349075317383), ('もの', 0.7732754349708557), ('ため', 0.7425500154495239), ... ]
Entities that link to Wikipedia articles are registered as word2vec words with square brackets on either side in the format [entity]
.
>>> from gensim.models import KeyedVectors
>>> w2v_model = KeyedVectors.load_word2vec_format('entity_vector.model.bin', binary=True, unicode_errors='ignore')
>>> print(w2v_model.most_similar(['[72時間ホンネテレビ]']))
[('[AbemaPrime]', 0.6929168701171875), ('[原宿AbemaNews]', 0.6854027509689331), ...]
Due to the large file size, files are uploaded to Dropbox.
https://www.dropbox.com/sh/601gucye55nr1gq/AABekRrz4IYtp2n0_lTrKsGma
Dictionary: mecab-ipadic
File | jawikicorpus | Dictionary | md5 |
---|---|---|---|
jawikivec.ipadic.20181120.tar.xz | jawikicorpus.20181120 | mecab-ipadic-2.7.0-20070801 | bc370d107f9076f9abbfd70ab74b1972 |
jawikivec.ipadic.20181101.tar.xz | jawikicorpus.20181101 | mecab-ipadic-2.7.0-20070801 | ce2b0a197555021e5c0aac96e428c08c |
jawikivec.ipadic.20181020.tar.xz | jawikicorpus.20181020 | mecab-ipadic-2.7.0-20070801 | 2524636714d1418cba5ff0cbf1947c50 |
jawikivec.ipadic.20181001.tar.xz | jawikicorpus.20181001 | mecab-ipadic-2.7.0-20070801 | 693a9d75b936c9a2cb25147575f51eea |
jawikivec.ipadic.20180920.tar.xz | jawikicorpus.20180920 | mecab-ipadic-2.7.0-20070801 | ae63c1cb0c64382773ddfc823c0fce10 |
jawikivec.ipadic.20180901.tar.xz | jawikicorpus.20180901 | mecab-ipadic-2.7.0-20070801 | 0a55a6a33e8e79151f7347378f70e5b5 |
jawikivec.ipadic.20180820.tar.xz | jawikicorpus.20180820 | mecab-ipadic-2.7.0-20070801 | cc524c551cccf8fae29b086add0252b5 |
jawikivec.ipadic.20180720.tar.xz | jawikicorpus.20180720 | mecab-ipadic-2.7.0-20070801 | b3841ad1b46a024b403ed384609d4aad |
jawikivec.ipadic.20180701.tar.xz | jawikicorpus.20180701 | mecab-ipadic-2.7.0-20070801 | 65ee15ad182adf96cfc722b55c17b9ea |
jawikivec.ipadic.20180620.tar.xz | jawikicorpus.20180620 | mecab-ipadic-2.7.0-20070801 | ac7afc5daaf15080b0beb3985281636b |
jawikivec.ipadic.20180601.tar.xz | jawikicorpus.20180601 | mecab-ipadic-2.7.0-20070801 | a72e03aec91be9c287678ea7f3e17527 |
jawikivec.ipadic.20180520.tar.xz | jawikicorpus.20180520 | mecab-ipadic-2.7.0-20070801 | 898b2562d6b851b84e4b467b92e5782a |
Dictionary: mecab-ipadic-NEologd
File | jawikicorpus | Dictionary | md5 |
---|---|---|---|
jawikivec.ipadic-neologd.20181120.tar.xz | jawikicorpus.20181120 | mecab-ipadic-NEologd,b3f3ac6fbdb5130894243c40726a9c4878075649 | f8d04ec98699a88c215601b3c86017e4 |
jawikivec.ipadic-neologd.20181101.tar.xz | jawikicorpus.20181101 | mecab-ipadic-NEologd,b3f3ac6fbdb5130894243c40726a9c4878075649 | 315a1ca2d0fee5d302ceef8cebbd2fe5 |
jawikivec.ipadic-neologd.20181020.tar.xz | jawikicorpus.20181020 | mecab-ipadic-NEologd,b3f3ac6fbdb5130894243c40726a9c4878075649 | 4b4713493a7ffd8e1104bc393e0b3344 |
jawikivec.ipadic-neologd.20181001.tar.xz | jawikicorpus.20181001 | mecab-ipadic-NEologd,1e9da37787c202f157e59d4c9b19cd4636d8a60d | 03c0536e5e68310f8ce40559728a7c06 |
jawikivec.ipadic-neologd.20180920.tar.xz | jawikicorpus.20180920 | mecab-ipadic-NEologd,3326dc5bb7467b51e7875f0f332cef6d89049617 | 2d8e0a4e38dc31f073eb97a32d14e684 |
jawikivec.ipadic-neologd.20180901.tar.xz | jawikicorpus.20180901 | mecab-ipadic-NEologd,3326dc5bb7467b51e7875f0f332cef6d89049617 | 084942f0153444c5e56ff76db81706dd |
jawikivec.ipadic-neologd.20180820.tar.xz | jawikicorpus.20180820 | mecab-ipadic-NEologd,5dc3499bc3fcd28eed960ed03cd51765c5330fe2 | 8d361239c9ec57df78b1f2d527029f44 |
jawikivec.ipadic-neologd.20180720.tar.xz | jawikicorpus.20180720 | mecab-ipadic-NEologd,172cfaa0aad1375d53879d273426cefe4a322e98 | 1587854da8d6efb742117d9e2933ab02 |
jawikivec.ipadic-neologd.20180701.tar.xz | jawikicorpus.20180701 | mecab-ipadic-NEologd,f4d27e2d50c5980a375d326fd8f0e95c881ed1ca | a6c996ab30adbf924270fcb3f292268e |
jawikivec.ipadic-neologd.20180620.tar.xz | jawikicorpus.20180620 | mecab-ipadic-NEologd,1c6e9eb600bba348fa772e218b8ce57d4ce70d85 | 1431b93833a8431689fa2b8eef5d45c4 |
jawikivec.ipadic-neologd.20180601.tar.xz | jawikicorpus.20180601 | mecab-ipadic-NEologd,3f6f113bc2b7b9eecbce45103a628ba715af3b33 | 2c88c8685ad9a821ffdc0ea833475e9a |
jawikivec.ipadic-neologd.20180520.tar.xz | jawikicorpus.20180520 | mecab-ipadic-NEologd,b8b282537589becf7256e74c80c543aa2eba5674 | 9d67c83dfe2ceb79bb3ac446a42ede40 |
By decompressing an archive with the following tar command, 5 files are created.
tar xvJf jawikivec.[dictionary].yyyyMMdd.tar.xz
An output file saved in binary word2vec format.
An output file saved in text word2vec format.
A tsv file containing terms appeared in a plain text and corresponding Wikipedia entities. More details are described in Japanese-Wikipedia Wikification Corpus.
A YAML-formatted file to store version information for referred dictionary and corpus.
Document regarding licensing.
Distribution vector representations are created in the following settings.
Option | Value |
---|---|
-size | 200 |
-window | 5 |
-sample | 1e-3 |
-negative | 5 |
-hs | 0 |
-iter | 5 |
-min-count | 5 |
-cbow | 1 |