- Extensions to skip-gram model to learn high-quality distributed vector representations
- Speed gains and more accurate representations
- Air + Canada cannot be easily combined to "Air Canada" - how to find such phrases in text
- Historically, grouping similar meanings close in the same vector space worked well
- More recently, skip-gram models are introduced - these don't need matrix multiplications and a single machine can train 100B words
- Surprisingly, linear translations such as vec("Madrid") - vec("Spain") + vec("France") = vec("Paris")
- Subsampling frequent words results in 2x-10x speedup
- Going from word model to phrase models such as "Air Canada", "Boston Globe" makes the model way more expressive
- 1st identify phrases
- 2nd treat phrases at individual tokens
- Objective: find word representations that are useful for predicting surrounding (-c, c) words
- Maximize the average log probability of
- Where v is input and v' is output
- This formulation is impractical due to the cost of computing the derivative of
log p(w0|w1)
- which is proportional to W, size of vocabulary- Hierarchical Softmax is an alternative: use binary Huffman tree to assign short codes to frequent words
- Negative Sampling is another one: posits that noise should be different from data based on logistic regression
- Subsampling of frequent words is another one: most words occur 100s of millions of times ('the', 'a', etc..)
- In theory we can train the skip-gram using all of the n-grams but that'd be too computationally intensive
- Phrases are similar to "Golden State Warriors", but also "Mark Zuckerberg", "Steve Ballmer", "New York Times"
- To find out which phrases are in the text, we use the following only for unigrams and bigrams:
- δ is used as a discounting coefficient and prevents too many phrases consisting of very infre- quent words to be formed.
- Best representations of phrases are learned by a model with the hierarchical softmax and subsampling
- Why are words possible to meaningfully combine by element-wise addition of their vector representations?
- Word vectors are LINEARLY related to the inputs of the softmax nonlinearity
- Vectors are a representation of the context in which a word appears (since they are trained to predict before/after words)
- Product works here as AND: words that are assigned high probability by both vectors will have higher probability
- Thus, if "Volga River" appears with high probability close to "Russia" and "River", then vec("Russia") + vec("River") ~= vec("Volga River")
- Empirically, closer tokens are much closer in meaning than that of previous works
- Notice that even if word2vec was trained in 30B words, 2 orders of magnintude more than others - it's trained in 1 day!
- Hyperparameter specification should be a task-specific decision.