Word-Embedding

Train Word2Vec and FastText word embedding model
The models are trained on 5 Game of Thrones books (A Song of Ice and Fire).
There are 4 models trained:

Word2Vec with CBOW (Continuous Bag of Words)
Word2Vec with Skip-Gram
FastText with CBOW
FastText with Skip-Gram

Because .bin models are very huge in size, only .txt models are uploaded.

Setup

The code is written in Python 3.6 pip install -r requirements.txt

Running the code:

To train word embedding from machine-readable documents in .pdf or .txt format

python word_embedding.py -m [model_type] -sg [0 or 1] -s [stopwords_file] -p [path_to_data_folder] -epoch [number_of_epochs]

To load word embedding model and search associated words:

python word_embedding_load.py -m [model_type] -sg [0 or 1] -w [words_to_search] -p [model_path] -topn [number_of_top_similar_words_to_show]

If you only want to convert pdf to csv:

python convert_pdf_text.py -i [directory] -o [output_file_name]

Model Type must be either "word2vec" or "fasttext".
SG equals to 0 refers to for CBOW and 1 refers to Skip-Gram.
The code will load model in .bin format.
If the model is in .txt format, type model_type as the file name.
Default topn is 10, based on Gensim documentation.
We can search for multiple words and phrases.

To search for multiple words, separate different words with space
To search for phrases, separate multiple words within a phrase with underscore

Example:

python convert_pdf_text.py -i ./data -o text.csv

python word_embedding.py -m word2vec -sg 0 -s stopwords.txt -p ./data -epoch 1000

python word_embedding_load.py -m word2vec -sg 0 -w stark jon_snow -p ./GOT_model

Results:

Model name:  model_word2vec

Word to search:  stark 

Most similar words:  [('winterfell', 0.3728424310684204), ('father', 0.35336238145828247), ('robb', 0.3526594638824463), ('lord', 0.30873656272888184), ('king', 0.303555965423584), ('catelyn', 0.28472280502319336), ('son', 0.26981428265571594), ('realm', 0.269493043422699), ('lady', 0.2655394971370697), ('dead', 0.25631558895111084)] 


Word to search:  jon_snow 

Most similar words:  [('jon', 0.44456174969673157), ('bran', 0.2948411703109741), ('castle_black', 0.2853333055973053), ('man', 0.2648966908454895), ('wall', 0.25946474075317383), ('mance', 0.25083184242248535), ('winterfell', 0.24593092501163483), ('night_watch', 0.24114950001239777), ('crow', 0.23541709780693054), ('alfyn_crowkiller', 0.2346397042274475)]

Another example with FastText model:

python word_embedding_load.py -m fasttext -sg 0 -w stark jon_snow -p ./GOT_model

Result:

Model name:  model_fasttext

Word to search:  stark 

Most similar words:  [('starks', 0.5628231763839722), ('starks_winterfell', 0.5375465750694275), ('lord_stark', 0.5259763598442078), ('lord_eddard_stark', 0.50931715965271), ('ward_eddard_stark', 0.48959609866142273), ('stark_winterfell', 0.48720109462738037), ('lady_stark', 0.48462986946105957), ('son_eddard_stark', 0.47510993480682373), ('house_stark', 0.4697829484939575), ('karstark', 0.46301573514938354)] 


Word to search:  jon_snow 

Most similar words:  [('jon_snow_ygritte', 0.6690560579299927), ('jon_snow_reflected', 0.561360776424408), ('jon', 0.5090968608856201), ('fallen_snow', 0.42110997438430786), ('night_watch', 0.39625218510627747), ('night_watch_takes', 0.3886195421218872), ('lord_snow', 0.3877685070037842), ('lord_commander_night_watch', 0.37715011835098267), ('benjen_stark', 0.37189415097236633), ('wildlings', 0.35399898886680603)]

Another example with Skip-Gram:

python word_embedding_load.py -m word2vec -sg 1 -w stark jon_snow -p ./GOT_model

Result:

Model name:  model_word2vec_sg

Word to search:  stark 

Most similar words:  [('ward_lady_catelyn', 0.3863148093223572), ('jammos', 0.3757057785987854), ('iord', 0.36879587173461914), ('white_field', 0.3511703610420227), ('father_ward', 0.34293508529663086), ('ser_whalen', 0.3387035131454468), ('ser_forley', 0.33681389689445496), ('pardoned', 0.3281732499599457), ('lady_lysa_arryn', 0.32632267475128174), ('sack_king_landing', 0.32433879375457764)] 


Word to search:  jon_snow 

Most similar words:  [('ned_stark_bastard', 0.32651451230049133), ('father_ward', 0.32257044315338135), ('squire_dalbridge', 0.3176880180835724), ('dark_anger', 0.3141580820083618), ('raider_leader_war_band', 0.31240659952163696), ('raised_hood', 0.30803531408309937), ('mance_rayder', 0.3076817989349365), ('ragwyle', 0.30176836252212524), ('wolves_shadowcat', 0.3009984791278839), ('jon', 0.2983658015727997)]

Conclusion

Compared to Word2Vec, FastText model chunks the words into subwords, therefore the most associated words will tend to have similar subwords. Hence, FastText model will be useful if, for example, we want to find mispelled words in our corpus.

For this corpus, CBOW model seems to perform better than Skip-Gram

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Word-Embedding

Setup

Running the code:

Conclusion

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
GOT_model		GOT_model
README.md		README.md
convert_pdf_txt.py		convert_pdf_txt.py
requirements.txt		requirements.txt
stopwords.txt		stopwords.txt
word_embedding.py		word_embedding.py
word_embedding_load.py		word_embedding_load.py

imayachita/Word-Embedding

Folders and files

Latest commit

History

Repository files navigation

Word-Embedding

Setup

Running the code:

Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages