- Authors: Omer Levy, and Yoav Goldberg
- NIPS 2014
- My literature review is here link
subsample.py
(if you need): remove words with a frequency-based probability to mitigate the effects of high-frequency words.- -f, --file_path: path, corpus
- -t, --threshold: float, threshold of remove probability (default is 1e-3)
many papers assign t to 1e-5, but in this case (kokoro, Soseki Natsume, 6383 types, 102763 tokens), 87% of words are removed.
When we assign t to 0.01, 21% of words are removed (just right?).INFO:root:[main] args: Namespace(file_path='kokoro_processed.txt', threshold=1e-05) INFO:root:[main] Count (raw) word frequency... INFO:root:[main] - most frequent words: [('の', 5818), ('た', 5350), ('。', 4654), ('に', 4363), ('は', 4037)] INFO:root:[main] - total_freq: 102763 INFO:root:[main] Subsampling... INFO:root:[main] - remove probability of most frequent words: [('の', 0.9867097996283141), ('た', 0.9861406936020674), ('。', 0.9851404657378731), ('に', 0.9846529191631387), ('は', 0.984045286407889)] INFO:root:[main] Save processed document... INFO:root:[main] Count (subsampled) word frequency... INFO:root:[main] - most frequent words: [('た', 83), ('、', 76), ('は', 71), ('。', 71), ('て', 69)] INFO:root:[main] - total_freq: 13380 INFO:root:[main] - 0.869797495207419% words are removed
From these results, we need to tune this parameter t.INFO:root:[main] args: Namespace(file_path='kokoro_processed.txt', threshold=0.01) INFO:root:[main] Count (raw) word frequency... INFO:root:[main] - most frequent words: [('の', 5818), ('た', 5350), ('。', 4654), ('に', 4363), ('は', 4037)] INFO:root:[main] - total_freq: 102763 INFO:root:[main] Subsampling... INFO:root:[main] - remove probability of most frequent words: [('の', 0.5797269626545619), ('た', 0.5617302499238903), ('。', 0.5301002676236953), ('に', 0.5146826912079521), ('は', 0.4954676563328255)] INFO:root:[main] Save processed document... INFO:root:[main] Count (subsampled) word frequency... INFO:root:[main] - most frequent words: [('の', 2401), ('た', 2277), ('。', 2181), ('に', 2101), ('は', 2051)] INFO:root:[main] - total_freq: 81176 INFO:root:[main] - 0.21006587974270896% words are removed
make\_id2word.py
: obtain target words from corpus.- -f, --file_path: path, corpus
- -t, --threshold: int, threshold of target words
- -f, --file_path: path, corpus you want to train
- -p, --pickle_id2word: path, pickle of index2word dictionary
- --cooccur_pretrained: path, output text file of pre-trained co-occur matrix
- --sppmi_pretrained: path, output text file of pre-trained sppmi matrix
- -t, --threshold: int, adopt threshold to cooccur matrix or not
- -a, --has_abs_dis: bool(call this argument: True, else False), adopt absolute discoutning smoothing or not
-
-c, --has_cds: bool(call this argument: True, else False), adopt contextual distribution smoothing or not
-
-w, --window_size: int, window size in counting co-occurence
-
-s, --shift: int, num of negative samples in word2vec (in SPPMI-SVD, SPPMI uses -log(#negative samples) )
-
-d, --dim: int, size of word vector