Train Word2Vec and FastText word embedding model
The models are trained on 5 Game of Thrones books (A Song of Ice and Fire).
There are 4 models trained:
- Word2Vec with CBOW (Continuous Bag of Words)
- Word2Vec with Skip-Gram
- FastText with CBOW
- FastText with Skip-Gram
Because .bin models are very huge in size, only .txt models are uploaded.
The code is written in Python 3.6
pip install -r requirements.txt
To train word embedding from machine-readable documents in .pdf or .txt format
python word_embedding.py -m [model_type] -sg [0 or 1] -s [stopwords_file] -p [path_to_data_folder] -epoch [number_of_epochs]
To load word embedding model and search associated words:
python word_embedding_load.py -m [model_type] -sg [0 or 1] -w [words_to_search] -p [model_path] -topn [number_of_top_similar_words_to_show]
If you only want to convert pdf to csv:
python convert_pdf_text.py -i [directory] -o [output_file_name]
Model Type must be either "word2vec" or "fasttext".
SG equals to 0 refers to for CBOW and 1 refers to Skip-Gram.
The code will load model in .bin format.
If the model is in .txt format, type model_type as the file name.
Default topn
is 10, based on Gensim documentation.
We can search for multiple words and phrases.
- To search for multiple words, separate different words with space
- To search for phrases, separate multiple words within a phrase with underscore
Example:
python convert_pdf_text.py -i ./data -o text.csv
python word_embedding.py -m word2vec -sg 0 -s stopwords.txt -p ./data -epoch 1000
python word_embedding_load.py -m word2vec -sg 0 -w stark jon_snow -p ./GOT_model
Results:
Model name: model_word2vec
Word to search: stark
Most similar words: [('winterfell', 0.3728424310684204), ('father', 0.35336238145828247), ('robb', 0.3526594638824463), ('lord', 0.30873656272888184), ('king', 0.303555965423584), ('catelyn', 0.28472280502319336), ('son', 0.26981428265571594), ('realm', 0.269493043422699), ('lady', 0.2655394971370697), ('dead', 0.25631558895111084)]
Word to search: jon_snow
Most similar words: [('jon', 0.44456174969673157), ('bran', 0.2948411703109741), ('castle_black', 0.2853333055973053), ('man', 0.2648966908454895), ('wall', 0.25946474075317383), ('mance', 0.25083184242248535), ('winterfell', 0.24593092501163483), ('night_watch', 0.24114950001239777), ('crow', 0.23541709780693054), ('alfyn_crowkiller', 0.2346397042274475)]
Another example with FastText model:
python word_embedding_load.py -m fasttext -sg 0 -w stark jon_snow -p ./GOT_model
Result:
Model name: model_fasttext
Word to search: stark
Most similar words: [('starks', 0.5628231763839722), ('starks_winterfell', 0.5375465750694275), ('lord_stark', 0.5259763598442078), ('lord_eddard_stark', 0.50931715965271), ('ward_eddard_stark', 0.48959609866142273), ('stark_winterfell', 0.48720109462738037), ('lady_stark', 0.48462986946105957), ('son_eddard_stark', 0.47510993480682373), ('house_stark', 0.4697829484939575), ('karstark', 0.46301573514938354)]
Word to search: jon_snow
Most similar words: [('jon_snow_ygritte', 0.6690560579299927), ('jon_snow_reflected', 0.561360776424408), ('jon', 0.5090968608856201), ('fallen_snow', 0.42110997438430786), ('night_watch', 0.39625218510627747), ('night_watch_takes', 0.3886195421218872), ('lord_snow', 0.3877685070037842), ('lord_commander_night_watch', 0.37715011835098267), ('benjen_stark', 0.37189415097236633), ('wildlings', 0.35399898886680603)]
Another example with Skip-Gram:
python word_embedding_load.py -m word2vec -sg 1 -w stark jon_snow -p ./GOT_model
Result:
Model name: model_word2vec_sg
Word to search: stark
Most similar words: [('ward_lady_catelyn', 0.3863148093223572), ('jammos', 0.3757057785987854), ('iord', 0.36879587173461914), ('white_field', 0.3511703610420227), ('father_ward', 0.34293508529663086), ('ser_whalen', 0.3387035131454468), ('ser_forley', 0.33681389689445496), ('pardoned', 0.3281732499599457), ('lady_lysa_arryn', 0.32632267475128174), ('sack_king_landing', 0.32433879375457764)]
Word to search: jon_snow
Most similar words: [('ned_stark_bastard', 0.32651451230049133), ('father_ward', 0.32257044315338135), ('squire_dalbridge', 0.3176880180835724), ('dark_anger', 0.3141580820083618), ('raider_leader_war_band', 0.31240659952163696), ('raised_hood', 0.30803531408309937), ('mance_rayder', 0.3076817989349365), ('ragwyle', 0.30176836252212524), ('wolves_shadowcat', 0.3009984791278839), ('jon', 0.2983658015727997)]
Compared to Word2Vec, FastText model chunks the words into subwords, therefore the most associated words will tend to have similar subwords. Hence, FastText model will be useful if, for example, we want to find mispelled words in our corpus.
For this corpus, CBOW model seems to perform better than Skip-Gram