A tornado based web application template used for web-based interaction with word2vec models trained using gensim/fasttext. This web app is optimized for both desktop browsers and mobile browsers using booststrap library.
You can type the word you want to search in your word2vec model and press GO
to find 10 most similar words to it. Simple arithmetic(+/-) of words is also supported.
Note that the +/-
operator you use must be en characters instead of em characters(Be careful when using models in languages that use em characters such as Chinese and Japanese).
Results are displayed with similarity scores and corresponding images like this:
The image search uses getsy as client-side web scraper.
If you check the LOG
box in the left side of the search bar, search history will be stored at the bottom of the page.
First you need to clone this repository to your local disk:
git clone https://github.com/superkerokero/word2vec-search-app.git
Then you need to install the dependencies of the app:
pip install tornado gensim
Go to the directory of the app you have cloned, and edit config.json
file in the root folder:
{
"model": "YOUR WORD2VEC MODEL FILE PATH",
"fasttext": false,
"debug": false
}
Change model
to the string of your word2vec model path.
If you are using a fasttext word2vec model, change fasttext
to true
. Note that model path for fasttext models should be *.vec
file instead of *.bin
file.
If you don't have your own word2vec model, please refer to section Pre-trained-word2vec-models for downloading pre-trained word2vec models from the internet.
If you need debugging, change debug
to true
.
Next open a terminal from the root folder, and use the following command to start the web server:
python server.py
You should see a message in the terminal indicating that your model is being loaded:
loading word2vec model...
After the loading is complete, you will see this message:
Word2vec model load complete.
Now you can open your browser and enter the following address to use the app.
http://localhost:8000
There are many pre-trained word2vec models available from the internet.
For English word2vec models(from 3Top):
Model file | Number of dimensions | Corpus (size) | Vocabulary size | Author | Architecture | Training Algorithm | Context window - size | Web page |
---|---|---|---|---|---|---|---|---|
Google News | 300 | Google News (100B) | 3M | word2vec | negative sampling | BoW - ~5 | link | |
Freebase IDs | 1000 | Gooogle News (100B) | 1.4M | word2vec, skip-gram | ? | BoW - ~10 | link | |
Freebase names | 1000 | Gooogle News (100B) | 1.4M | word2vec, skip-gram | ? | BoW - ~10 | link | |
DBPedia vectors (wiki2vec) | 1000 | Wikipedia (?) | ? | Idio | word2vec | word2vec, skip-gram | BoW, 10 | link |
For other languages(from Kyubyong):
Language | ISO 639-1 | Vector Size | Corpus Size | Vocabulary Size |
---|---|---|---|---|
Bengali (w) | Bengali (f) | bn | 300 | 147M | 10059 |
Catalan (w) | Catalan (f) | ca | 300 | 967M | 50013 |
Chinese (w) | Chinese (f) | zh | 300 | 1G | 50101 |
Danish (w) | Danish (f) | da | 300 | 295M | 30134 |
Dutch (w) | Dutch (f) | nl | 300 | 1G | 50160 |
Esperanto (w) | Esperanto (f) | eo | 300 | 1G | 50597 |
Finnish (w) | Finnish (f) | fi | 300 | 467M | 30029 |
French (w) | French (f) | fr | 300 | 1G | 50130 |
German (w) | German (f) | de | 300 | 1G | 50006 |
Hindi (w) | Hindi (f) | hi | 300 | 323M | 30393 |
Hungarian (w) | Hungarian (f) | hu | 300 | 692M | 40122 |
Indonesian (w) | Indonesian (f) | id | 300 | 402M | 30048 |
Italian (w) | Italian (f) | it | 300 | 1G | 50031 |
Japanese (w) | Japanese (f) | ja | 300 | 1G | 50108 |
Javanese (w) | Javanese (f) | jv | 100 | 31M | 10019 |
Korean (w) | Korean (f) | ko | 200 | 339M | 30185 |
Malay (w) | Malay (f) | ms | 100 | 173M | 10010 |
Norwegian (w) | Norwegian (f) | no | 300 | 1G | 50209 |
Norwegian Nynorsk (w) | Norwegian Nynorsk (f) | nn | 100 | 114M | 10036 |
Polish (w) | Polish (f) | pl | 300 | 1G | 50035 |
Portuguese (w) | Portuguese (f) | pt | 300 | 1G | 50246 |
Russian (w) | Russian (f) | ru | 300 | 1G | 50102 |
Spanish (w) | Spanish (f) | es | 300 | 1G | 50003 |
Swahili (w) | Swahili (f) | sw | 100 | 24M | 10222 |
Swedish (w) | Swedish (f) | sv | 300 | 1G | 50052 |
Tagalog (w) | Tagalog (f) | tl | 100 | 38M | 10068 |
Thai (w) | Thai (f) | th | 300 | 696M | 30225 |
Turkish (w) | Turkish (f) | tr | 200 | 370M | 30036 |
Vietnamese (w) | Vietnamese (f) | vi | 100 | 74M | 10087 |