Skip to content

Version 2.0 beta

Compare
Choose a tag to compare
@bmschmidt bmschmidt released this 24 Jan 13:53
· 48 commits to dev since this release

Upgrade focusing on ease of use (with new, simpler syntax) and CRAN-ability. Bumping major version because of a breaking change in the behavior of nearest_to, which now returns a data.frame.

Changes

Change in nearest_to behavior.

There's a change in nearest_to that will break some existing code. Now it returns a data.frame instead of a list. The data.frame columns have elaborate names so they can easily be manipulated with dplyr, and/or plotted with ggplot. There are flags to return to the old behavior (as_df=FALSE).

New syntax for vector addition.

This package now allows formula scoping for the most common operations, and string inputs to access in the context of a particular matrix. This makes this much nicer for handling the bread and butter word2vec operations.

For instance, instead of writing (in normal matrix format, not the existing enhancements)

vectors %>% nearest_to(vectors[rownames(vectors)=="king",] - vectors[rownames(vectors)=="man",] + vectors[rownames(vectors)=="woman",])

(whew!), you can now write

vectors %>% nearest_to(~"king" - "man" + "woman")

(whew!), you can now write

vectors %>% nearest_to(~"king" - "man" + "woman")

Most basic math is supported in this interface; to overweight some words, say, you could just multiply out the vectors:

vectors %>% nearest_to(~"king" - "man"*2 + "woman" + "lady")

Reading tweaks.

In keeping with the goal of allowing manipulation of models in low-memory environments, it's now possible to read only rows with words matching certain criteria by passing an argument to read.binary.vectors(); either rowname_list for a fixed list, or rowname_regexp for a regular expression. (You could, say, read only the gerunds from a file by entering rowname_regexp = "*.ing").

Test Suite

The package now includes a test suite.

Other materials for rOpenScience and JOSS.

This package has enough users it might be nice to get it on CRAN. I'm trying doing so through rOpenSci. That requires a lot of small files scattered throughout.