Skip to content
This repository has been archived by the owner on May 20, 2022. It is now read-only.

A tornado based web application template used for interaction with word2vec models.

License

Notifications You must be signed in to change notification settings

superkerokero/word2vec-search-app

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

word2vec-search-app

License: MIT

A tornado based web application template used for web-based interaction with word2vec models trained using gensim/fasttext. This web app is optimized for both desktop browsers and mobile browsers using booststrap library.

You can type the word you want to search in your word2vec model and press GO to find 10 most similar words to it. Simple arithmetic(+/-) of words is also supported.

Note that the +/- operator you use must be en characters instead of em characters(Be careful when using models in languages that use em characters such as Chinese and Japanese).

Web application interface

Results are displayed with similarity scores and corresponding images like this:

Results

The image search uses getsy as client-side web scraper.

If you check the LOG box in the left side of the search bar, search history will be stored at the bottom of the page.

How to setup

First you need to clone this repository to your local disk:

git clone https://github.com/superkerokero/word2vec-search-app.git

Then you need to install the dependencies of the app:

pip install tornado gensim

Go to the directory of the app you have cloned, and edit config.json file in the root folder:

{
    "model": "YOUR WORD2VEC MODEL FILE PATH",
    "fasttext": false,
    "debug": false
}

Change model to the string of your word2vec model path. If you are using a fasttext word2vec model, change fasttext to true. Note that model path for fasttext models should be *.vec file instead of *.bin file.

If you don't have your own word2vec model, please refer to section Pre-trained-word2vec-models for downloading pre-trained word2vec models from the internet.

If you need debugging, change debug to true.

Next open a terminal from the root folder, and use the following command to start the web server:

python server.py

You should see a message in the terminal indicating that your model is being loaded:

loading word2vec model...

After the loading is complete, you will see this message:

Word2vec model load complete.

Now you can open your browser and enter the following address to use the app.

http://localhost:8000

Pre-trained-word2vec-models

There are many pre-trained word2vec models available from the internet.

For English word2vec models(from 3Top):

Model file Number of dimensions Corpus (size) Vocabulary size Author Architecture Training Algorithm Context window - size Web page
Google News 300 Google News (100B) 3M Google word2vec negative sampling BoW - ~5 link
Freebase IDs 1000 Gooogle News (100B) 1.4M Google word2vec, skip-gram ? BoW - ~10 link
Freebase names 1000 Gooogle News (100B) 1.4M Google word2vec, skip-gram ? BoW - ~10 link
DBPedia vectors (wiki2vec) 1000 Wikipedia (?) ? Idio word2vec word2vec, skip-gram BoW, 10 link

For other languages(from Kyubyong):

Language ISO 639-1 Vector Size Corpus Size Vocabulary Size
Bengali (w) | Bengali (f) bn 300 147M 10059
Catalan (w) | Catalan (f) ca 300 967M 50013
Chinese (w) | Chinese (f) zh 300 1G 50101
Danish (w) | Danish (f) da 300 295M 30134
Dutch (w) | Dutch (f) nl 300 1G 50160
Esperanto (w) | Esperanto (f) eo 300 1G 50597
Finnish (w) | Finnish (f) fi 300 467M 30029
French (w) | French (f) fr 300 1G 50130
German (w) | German (f) de 300 1G 50006
Hindi (w) | Hindi (f) hi 300 323M 30393
Hungarian (w) | Hungarian (f) hu 300 692M 40122
Indonesian (w) | Indonesian (f) id 300 402M 30048
Italian (w) | Italian (f) it 300 1G 50031
Japanese (w) | Japanese (f) ja 300 1G 50108
Javanese (w) | Javanese (f) jv 100 31M 10019
Korean (w) | Korean (f) ko 200 339M 30185
Malay (w) | Malay (f) ms 100 173M 10010
Norwegian (w) | Norwegian (f) no 300 1G 50209
Norwegian Nynorsk (w) | Norwegian Nynorsk (f) nn 100 114M 10036
Polish (w) | Polish (f) pl 300 1G 50035
Portuguese (w) | Portuguese (f) pt 300 1G 50246
Russian (w) | Russian (f) ru 300 1G 50102
Spanish (w) | Spanish (f) es 300 1G 50003
Swahili (w) | Swahili (f) sw 100 24M 10222
Swedish (w) | Swedish (f) sv 300 1G 50052
Tagalog (w) | Tagalog (f) tl 100 38M 10068
Thai (w) | Thai (f) th 300 696M 30225
Turkish (w) | Turkish (f) tr 200 370M 30036
Vietnamese (w) | Vietnamese (f) vi 100 74M 10087

About

A tornado based web application template used for interaction with word2vec models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published