Skip to content

Version 1.1

Compare
Choose a tag to compare
@bmschmidt bmschmidt released this 08 Jan 23:11
· 92 commits to master since this release

A few changes, primarily to the functions for training vector spaces to produce higher quality models. A number of these changes are merged back in from the fork of this repo by github user @sunecasp . Thanks!

Some bug fixes

Filenames can now be up to 1024 characters. Some parameters on alpha decay may be fixed; I'm not entirely sure what sunecasp's changes do.

Changes to default number of iterations.

Models now default to 5 iterations through the text rather than 1. That means training may take 5 times as long; but particularly for small corpora, the vectors should be of higher quality. See below for an example.

More training arguments

You can now specify more flags to the word2vec code. ?train_word2vec gives a full list, but particularly useful are:

  1. window now accurately sets the window size.
  2. iter sets the number of iterations. For very large corpora, iter=1 will train most quickly; for very small corpora, iter=15 will give substantially better vectors. (See below). You should set this as high as you can stand within reason (Setting iter to a number higher than window is probably not that useful). But more text is better than more iterations.
  3. min_count gives a cutoff for vocabulary size. Tokens occurring fewer than min_count times will be dropped from the model. Setting this high can be useful. (But note that a trained model is sorted in order of frequency, so if you have the RAM to train a big model you can reduce it in size for analysis by just subsetting to the first 10,000 or whatever rows).

Example of vectors

Here's an example of training on a small set (c. 1000 speeches on the floor of the house of commons from the early 19th century).

proc.time({one = train_word2vec("/tmp2.txt","/1_iter.vectors",iter = 1)})
Error in train_word2vec("/tmp2.txt", "/1_iter.vectors", iter = 1) :
The output file '/1_iter.vectors' already exists: delete or give a new destination.
proc.time({one = train_word2vec("
/tmp2.txt","/1_iter.vectors",iter = 1)})
Starting training using file /Users/bschmidt/tmp2.txt
Vocab size: 4469
Words in train file: 407583
Alpha: 0.000711 Progress: 99.86% Words/thread/sec: 67.51k
Error in proc.time({ : 1 argument passed to 'proc.time' which requires 0
?proc.time
system.time({one = train_word2vec("
/tmp2.txt","/1_iter.vectors",iter = 1)})
Starting training using file /Users/bschmidt/tmp2.txt
Vocab size: 4469
Words in train file: 407583
Alpha: 0.000711 Progress: 99.86% Words/thread/sec: 66.93k user system elapsed
6.753 0.055 6.796
system.time({two = train_word2vec("
/tmp2.txt","~/2_iter.vectors",iter = 3)})
Starting training using file /Users/bschmidt/tmp2.txt
Vocab size: 4469
Words in train file: 407583
Alpha: 0.000237 Progress: 99.95% Words/thread/sec: 67.15k user system elapsed
18.846 0.085 18.896

two %>% nearest_to(two["debt"]) %>% round(3)
debt remainder Jan including drawback manufactures prisoners mercantile subsisting
0.000 0.234 0.256 0.281 0.291 0.293 0.297 0.314 0.314
Dec
0.318
one %>% nearest_to(one[["debt"]]) %>% round(3)
debt Christmas exception preventing Indies import remainder eye eighteen labouring
0.000 0.150 0.210 0.214 0.215 0.220 0.221 0.223 0.225 0.227

system.time({ten = train_word2vec("/tmp2.txt","/10_iter.vectors",iter = 10)})
Starting training using file /Users/bschmidt/tmp2.txt
Vocab size: 4469
Words in train file: 407583
Alpha: 0.000071 Progress: 99.98% Words/thread/sec: 66.13k user system elapsed
62.070 0.296 62.333

ten %>% nearest_to(ten[["debt"]]) %>% round(3)
debt surplus Dec remainder manufacturing grants Jan drawback prisoners
0.000 0.497 0.504 0.510 0.519 0.520 0.533 0.536 0.546
compelling
0.553