The data consists of about 12,000 reviews, which is not enough to learn meaningful word vectors. We need to perform data augmentation to artificially create more information. When performing data augmentation is an image related task, one could add noise or apply transformations to the images and this could be sufficient augmentation, however with words, we can't simply as random words to our reviews as this would destroy hte syntactic relation between the words. Instead, we're going to duplicate the dataset and drop words randomly depending on their frequency. Words that are very common will have a higher probability of being dropped, whereas very uncommon words will have a lower probability of being dropped. This is meant to retain criticial information while artificially producing more data.
To determine wether a word is dropped, we're going to iterate through our vocabulary and assign a rejection probability to each word.
To augment the data, a new sentence will be created with fewer words by iterating through each word in the original sentence and comparing the rejection probability with a random number. We will only keep the sentences that we consider to be long, such as three or more words.
We want to calculate the probabities of getting certain arrangments of words within some context. In other words, we want to find the probability of some context word
As an example, let's consider the setence:
Again, like the coin toss, a muliplication of probabilities
To generalize, if we're going to calculate the likelyhood of a large body of text, it is given by multiplying the product over all the words in the corpus and the the product of the words in the context.
Instead of using a single vector representation as assumed above, we will use two vector representations for each word in our implemenation. This is because it makes more sense to have one vector for a word when it is a center word and another vector for when a word is a context or outside word. This doesn't doesn't change the mathematics.
To summarize, we want to calculate the probability of a sequence of words appearing together which will depend our word vector components