diff --git a/docs/notebooks/doc2vec-IMDB.ipynb b/docs/notebooks/doc2vec-IMDB.ipynb index 5b680eddad..61e83a9459 100644 --- a/docs/notebooks/doc2vec-IMDB.ipynb +++ b/docs/notebooks/doc2vec-IMDB.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Gensim Doc2vec Tutorial on the IMDB Sentiment Dataset" + "# Gensim `Doc2Vec` Tutorial on the IMDB Sentiment Dataset" ] }, { @@ -16,7 +16,7 @@ "In this tutorial, we will learn how to apply Doc2vec using gensim by recreating the results of Le and Mikolov 2014. \n", "\n", "### Bag-of-words Model\n", - "Previous state-of-the-art document representations were based on the bag-of-words model, which represent input documents as a fixed-length vector. For example, borrowing from the Wikipedia article, the two documents \n", + "Early state-of-the-art document representations were based on the bag-of-words model, which represent input documents as a fixed-length vector. For example, borrowing from the Wikipedia article, the two documents \n", "(1) `John likes to watch movies. Mary likes movies too.` \n", "(2) `John also likes to watch football games.` \n", "are used to construct a length 10 list of words \n", @@ -26,24 +26,29 @@ "(2) `[1, 1, 1, 1, 0, 0, 0, 1, 1, 1]` \n", "Bag-of-words models are surprisingly effective but still lose information about word order. Bag of n-grams models consider word phrases of length n to represent documents as fixed-length vectors to capture local word order but suffer from data sparsity and high dimensionality.\n", "\n", - "### Word2vec Model\n", - "Word2vec is a more recent model that embeds words in a high-dimensional vector space using a shallow neural network. The result is a set of word vectors where vectors close together in vector space have similar meanings based on context, and word vectors distant to each other have differing meanings. For example, `strong` and `powerful` would be close together and `strong` and `Paris` would be relatively far. There are two versions of this model based on skip-grams and continuous bag of words.\n", + "### `Word2Vec`\n", + "`Word2Vec` is a more recent model that embeds words in a lower-dimensional vector space using a shallow neural network. The result is a set of word-vectors where vectors close together in vector space have similar meanings based on context, and word-vectors distant to each other have differing meanings. For example, `strong` and `powerful` would be close together and `strong` and `Paris` would be relatively far. There are two versions of this model based on skip-grams (SG) and continuous-bag-of-words (CBOW), both implemented by the gensim `Word2Vec` class.\n", "\n", "\n", - "#### Word2vec - Skip-gram Model\n", - "The skip-gram word2vec model, for example, takes in pairs (word1, word2) generated by moving a window across text data, and trains a 1-hidden-layer neural network based on the fake task of given an input word, giving us a predicted probability distribution of nearby words to the input. The hidden-to-output weights in the neural network give us the word embeddings. So if the hidden layer has 300 neurons, this network will give us 300-dimensional word embeddings. We use one-hot encoding for the words.\n", + "#### `Word2Vec` - Skip-gram Model\n", + "The skip-gram word2vec model, for example, takes in pairs (word1, word2) generated by moving a window across text data, and trains a 1-hidden-layer neural network based on the synthetic task of given an input word, giving us a predicted probability distribution of nearby words to the input. A virtual one-hot encoding of words goes through a 'projection layer' to the hidden layer; these projection weights are later interpreted as the word embeddings. So if the hidden layer has 300 neurons, this network will give us 300-dimensional word embeddings.\n", "\n", - "#### Word2vec - Continuous-bag-of-words Model\n", - "Continuous-bag-of-words Word2vec is very similar to the skip-gram model. It is also a 1-hidden-layer neural network. The fake task is based on the input context words in a window around a center word, predict the center word. Again, the hidden-to-output weights give us the word embeddings and we use one-hot encoding.\n", + "#### `Word2Vec` - Continuous-bag-of-words Model\n", + "Continuous-bag-of-words Word2vec is very similar to the skip-gram model. It is also a 1-hidden-layer neural network. The synthetic training task now uses the average of multiple input context words, rather than a single word as in skip-gram, to predict the center word. Again, the projection weights that turn one-hot words into averageable vectors, of the same width as the hidden layer, are interpreted as the word embeddings. \n", "\n", - "### Paragraph Vector\n", - "Le and Mikolov 2014 introduces the Paragraph Vector, which outperforms more naïve representations of documents such as averaging the Word2vec word vectors of a document. The idea is straightforward: we act as if a paragraph (or document) is just another vector like a word vector, but we will call it a paragraph vector. We determine the embedding of the paragraph in vector space in the same way as words. Our paragraph vector model considers local word order like bag of n-grams, but gives us a denser representation in vector space compared to a sparse, high-dimensional representation.\n", + "But, Word2Vec doesn't yet get us fixed-size vectors for longer texts.\n", + "\n", + "\n", + "### Paragraph Vector, aka gensim `Doc2Vec`\n", + "The straightforward approach of averaging each of a text's words' word-vectors creates a quick and crude document-vector that can often be useful. However, Le and Mikolov in 2014 introduced the Paragraph Vector, which usually outperforms such simple-averaging.\n", + "\n", + "The basic idea is: act as if a document has another floating word-like vector, which contributes to all training predictions, and is updated like other word-vectors, but we will call it a doc-vector. Gensim's `Doc2Vec` class implements this algorithm. \n", "\n", "#### Paragraph Vector - Distributed Memory (PV-DM)\n", - "This is the Paragraph Vector model analogous to Continuous-bag-of-words Word2vec. The paragraph vectors are obtained by training a neural network on the fake task of inferring a center word based on context words and a context paragraph. A paragraph is a context for all words in the paragraph, and a word in a paragraph can have that paragraph as a context. \n", + "This is the Paragraph Vector model analogous to Word2Vec CBOW. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a center word based an average of both context word-vectors and the full document's doc-vector.\n", "\n", "#### Paragraph Vector - Distributed Bag of Words (PV-DBOW)\n", - "This is the Paragraph Vector model analogous to Skip-gram Word2vec. The paragraph vectors are obtained by training a neural network on the fake task of predicting a probability distribution of words in a paragraph given a randomly-sampled word from the paragraph.\n", + "This is the Paragraph Vector model analogous to Word2Vec SG. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a target word just from the full document's doc-vector. (It is also common to combine this with skip-gram testing, using both the doc-vector and nearby word-vectors to predict a single target word, but only one at a time.)\n", "\n", "### Requirements\n", "The following python modules are dependencies for this tutorial:\n", @@ -63,7 +68,9 @@ "metadata": {}, "source": [ "Let's download the IMDB archive if it is not already downloaded (84 MB). This will be our text data for this tutorial. \n", - "The data can be found here: http://ai.stanford.edu/~amaas/data/sentiment/" + "The data can be found here: http://ai.stanford.edu/~amaas/data/sentiment/\n", + "\n", + "This cell will only reattempt steps (such as downloading the compressed data) if their output isn't already present, so it is safe to re-run until it completes successfully. " ] }, { @@ -75,11 +82,22 @@ "name": "stdout", "output_type": "stream", "text": [ - "Total running time: 0.00035199999999990794\n" + "IMDB archive directory already available without download.\n", + "Cleaning up dataset...\n", + " train/pos: 12500 files\n", + " train/neg: 12500 files\n", + " test/pos: 12500 files\n", + " test/neg: 12500 files\n", + " train/unsup: 50000 files\n", + "Success, alldata-id.txt is available for next steps.\n", + "CPU times: user 17.3 s, sys: 14.1 s, total: 31.3 s\n", + "Wall time: 1min 2s\n" ] } ], "source": [ + "%%time \n", + "\n", "import locale\n", "import glob\n", "import os.path\n", @@ -87,11 +105,13 @@ "import tarfile\n", "import sys\n", "import codecs\n", - "import smart_open\n", + "from smart_open import smart_open\n", + "import re\n", "\n", "dirname = 'aclImdb'\n", "filename = 'aclImdb_v1.tar.gz'\n", "locale.setlocale(locale.LC_ALL, 'C')\n", + "all_lines = []\n", "\n", "if sys.version > '3':\n", " control_chars = [chr(0x85)]\n", @@ -104,14 +124,9 @@ " # Replace breaks with spaces\n", " norm_text = norm_text.replace('
', ' ')\n", " # Pad punctuation with spaces on both sides\n", - " for char in ['.', '\"', ',', '(', ')', '!', '?', ';', ':']:\n", - " norm_text = norm_text.replace(char, ' ' + char + ' ')\n", + " norm_text = re.sub(r\"([\\.\\\",\\(\\)!\\?;:])\", \" \\\\1 \", norm_text)\n", " return norm_text\n", "\n", - "import time\n", - "import smart_open\n", - "start = time.clock()\n", - "\n", "if not os.path.isfile('aclImdb/alldata-id.txt'):\n", " if not os.path.isdir(dirname):\n", " if not os.path.isfile(filename):\n", @@ -119,52 +134,44 @@ " print(\"Downloading IMDB archive...\")\n", " url = u'http://ai.stanford.edu/~amaas/data/sentiment/' + filename\n", " r = requests.get(url)\n", - " with smart_open.smart_open(filename, 'wb') as f:\n", + " with smart_open(filename, 'wb') as f:\n", " f.write(r.content)\n", + " # if error here, try `tar xfz aclImdb_v1.tar.gz` outside notebook, then re-run this cell\n", " tar = tarfile.open(filename, mode='r')\n", " tar.extractall()\n", " tar.close()\n", + " else:\n", + " print(\"IMDB archive directory already available without download.\")\n", "\n", - " # Concatenate and normalize test/train data\n", + " # Collect & normalize test/train data\n", " print(\"Cleaning up dataset...\")\n", " folders = ['train/pos', 'train/neg', 'test/pos', 'test/neg', 'train/unsup']\n", - " alldata = u''\n", " for fol in folders:\n", " temp = u''\n", + " newline = \"\\n\".encode(\"utf-8\")\n", " output = fol.replace('/', '-') + '.txt'\n", " # Is there a better pattern to use?\n", " txt_files = glob.glob(os.path.join(dirname, fol, '*.txt'))\n", - " for txt in txt_files:\n", - " with smart_open.smart_open(txt, \"rb\") as t:\n", - " t_clean = t.read().decode(\"utf-8\")\n", - " for c in control_chars:\n", - " t_clean = t_clean.replace(c, ' ')\n", - " temp += t_clean\n", - " temp += \"\\n\"\n", - " temp_norm = normalize_text(temp)\n", - " with smart_open.smart_open(os.path.join(dirname, output), \"wb\") as n:\n", - " n.write(temp_norm.encode(\"utf-8\"))\n", - " alldata += temp_norm\n", + " print(\" %s: %i files\" % (fol, len(txt_files)))\n", + " with smart_open(os.path.join(dirname, output), \"wb\") as n:\n", + " for i, txt in enumerate(txt_files):\n", + " with smart_open(txt, \"rb\") as t:\n", + " one_text = t.read().decode(\"utf-8\")\n", + " for c in control_chars:\n", + " one_text = one_text.replace(c, ' ')\n", + " one_text = normalize_text(one_text)\n", + " all_lines.append(one_text)\n", + " n.write(one_text.encode(\"utf-8\"))\n", + " n.write(newline)\n", "\n", - " with smart_open.smart_open(os.path.join(dirname, 'alldata-id.txt'), 'wb') as f:\n", - " for idx, line in enumerate(alldata.splitlines()):\n", + " # Save to disk for instant re-use on any future runs\n", + " with smart_open(os.path.join(dirname, 'alldata-id.txt'), 'wb') as f:\n", + " for idx, line in enumerate(all_lines):\n", " num_line = u\"_*{0} {1}\\n\".format(idx, line)\n", " f.write(num_line.encode(\"utf-8\"))\n", "\n", - "end = time.clock()\n", - "print (\"Total running time: \", end-start)" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": { - "collapsed": true - }, - "outputs": [], - "source": [ - "import os.path\n", - "assert os.path.isfile(\"aclImdb/alldata-id.txt\"), \"alldata-id.txt unavailable\"" + "assert os.path.isfile(\"aclImdb/alldata-id.txt\"), \"alldata-id.txt unavailable\"\n", + "print(\"Success, alldata-id.txt is available for next steps.\")" ] }, { @@ -176,28 +183,32 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "100000 docs: 25000 train-sentiment, 25000 test-sentiment\n" + "100000 docs: 25000 train-sentiment, 25000 test-sentiment\n", + "CPU times: user 5.3 s, sys: 1.25 s, total: 6.55 s\n", + "Wall time: 6.74 s\n" ] } ], "source": [ + "%%time\n", + "\n", "import gensim\n", "from gensim.models.doc2vec import TaggedDocument\n", "from collections import namedtuple\n", - "from smart_open import smart_open\n", "\n", + "# this data object class suffices as a `TaggedDocument` (with `words` and `tags`) \n", + "# plus adds other state helpful for our later evaluation/reporting\n", "SentimentDocument = namedtuple('SentimentDocument', 'words tags split sentiment')\n", "\n", - "alldocs = [] # Will hold all docs in original order\n", - "with smart_open('aclImdb/alldata-id.txt', 'rb') as alldata:\n", - " alldata = alldata.read().decode('utf-8')\n", + "alldocs = []\n", + "with smart_open('aclImdb/alldata-id.txt', 'rb', encoding='utf-8') as alldata:\n", " for line_no, line in enumerate(alldata):\n", " tokens = gensim.utils.to_unicode(line).split()\n", " words = tokens[1:]\n", @@ -208,9 +219,26 @@ "\n", "train_docs = [doc for doc in alldocs if doc.split == 'train']\n", "test_docs = [doc for doc in alldocs if doc.split == 'test']\n", - "doc_list = alldocs[:] # For reshuffling per pass\n", "\n", - "print('%d docs: %d train-sentiment, %d test-sentiment' % (len(doc_list), len(train_docs), len(test_docs)))" + "print('%d docs: %d train-sentiment, %d test-sentiment' % (len(alldocs), len(train_docs), len(test_docs)))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Because the native document-order has similar-sentiment documents in large clumps – which is suboptimal for training – we work with once-shuffled copy of the training set." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "from random import shuffle\n", + "doc_list = alldocs[:] \n", + "shuffle(doc_list)" ] }, { @@ -229,7 +257,7 @@ "`./word2vec -train ../alldata-id.txt -output vectors.txt -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1`\n", "\n", "We vary the following parameter choices:\n", - "* 100-dimensional vectors, as the 400-d vectors of the paper don't seem to offer much benefit on this task\n", + "* 100-dimensional vectors, as the 400-d vectors of the paper take a lot of memory and, in our tests of this task, don't seem to offer much benefit\n", "* Similarly, frequent word subsampling seems to decrease sentiment-prediction accuracy, so it's left out\n", "* `cbow=0` means skip-gram which is equivalent to the paper's 'PV-DBOW' mode, matched in gensim with `dm=0`\n", "* Added to that DBOW model are two DM models, one which averages context vectors (`dm_mean`) and one which concatenates them (`dm_concat`, resulting in a much larger, slower, more data-hungry model)\n", @@ -245,13 +273,16 @@ "name": "stdout", "output_type": "stream", "text": [ - "Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)\n", - "Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)\n", - "Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)\n" + "Doc2Vec(dbow,d100,n5,mc2,t4) vocabulary scanned & state initialized\n", + "Doc2Vec(\"alpha=0.05\",dm/m,d100,n5,w10,mc2,t4) vocabulary scanned & state initialized\n", + "Doc2Vec(dm/c,d100,n5,w5,mc2,t4) vocabulary scanned & state initialized\n", + "CPU times: user 28.7 s, sys: 414 ms, total: 29.1 s\n", + "Wall time: 29.1 s\n" ] } ], "source": [ + "%%time\n", "from gensim.models import Doc2Vec\n", "import gensim.models.doc2vec\n", "from collections import OrderedDict\n", @@ -261,20 +292,21 @@ "assert gensim.models.doc2vec.FAST_VERSION > -1, \"This will be painfully slow otherwise\"\n", "\n", "simple_models = [\n", - " # PV-DM w/ concatenation - window=5 (both sides) approximates paper's 10-word total window size\n", - " Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores),\n", - " # PV-DBOW \n", - " Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2, workers=cores),\n", - " # PV-DM w/ average\n", - " Doc2Vec(dm=1, dm_mean=1, size=100, window=10, negative=5, hs=0, min_count=2, workers=cores),\n", + " # PV-DBOW plain\n", + " Doc2Vec(dm=0, vector_size=100, negative=5, hs=0, min_count=2, sample=0, \n", + " epochs=20, workers=cores),\n", + " # PV-DM w/ default averaging; a higher starting alpha may improve CBOW/PV-DM modes\n", + " Doc2Vec(dm=1, vector_size=100, window=10, negative=5, hs=0, min_count=2, sample=0, \n", + " epochs=20, workers=cores, alpha=0.05, comment='alpha=0.05'),\n", + " # PV-DM w/ concatenation - big, slow, experimental mode\n", + " # window=5 (both sides) approximates paper's apparent 10-word total window size\n", + " Doc2Vec(dm=1, dm_concat=1, vector_size=100, window=5, negative=5, hs=0, min_count=2, sample=0, \n", + " epochs=20, workers=cores),\n", "]\n", "\n", - "# Speed up setup by sharing results of the 1st model's vocabulary scan\n", - "simple_models[0].build_vocab(alldocs) # PV-DM w/ concat requires one special NULL word so it serves as template\n", - "print(simple_models[0])\n", - "for model in simple_models[1:]:\n", - " model.reset_from(simple_models[0])\n", - " print(model)\n", + "for model in simple_models:\n", + " model.build_vocab(alldocs)\n", + " print(\"%s vocabulary scanned & state initialized\" % model)\n", "\n", "models_by_name = OrderedDict((str(model), model) for model in simple_models)" ] @@ -283,20 +315,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Le and Mikolov notes that combining a paragraph vector from Distributed Bag of Words (DBOW) and Distributed Memory (DM) improves performance. We will follow, pairing the models together for evaluation. Here, we concatenate the paragraph vectors obtained from each model." + "Le and Mikolov notes that combining a paragraph vector from Distributed Bag of Words (DBOW) and Distributed Memory (DM) improves performance. We will follow, pairing the models together for evaluation. Here, we concatenate the paragraph vectors obtained from each model with the help of a thin wrapper class included in a gensim test module. (Note that this a separate, later concatenation of output-vectors than the kind of input-window-concatenation enabled by the `dm_concat=1` mode above.)" ] }, { "cell_type": "code", "execution_count": 5, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "from gensim.test.test_doc2vec import ConcatenatedDoc2Vec\n", - "models_by_name['dbow+dmm'] = ConcatenatedDoc2Vec([simple_models[1], simple_models[2]])\n", - "models_by_name['dbow+dmc'] = ConcatenatedDoc2Vec([simple_models[1], simple_models[0]])" + "models_by_name['dbow+dmm'] = ConcatenatedDoc2Vec([simple_models[0], simple_models[1]])\n", + "models_by_name['dbow+dmc'] = ConcatenatedDoc2Vec([simple_models[0], simple_models[2]])" ] }, { @@ -317,49 +347,34 @@ "cell_type": "code", "execution_count": 6, "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/Users/daniel/miniconda3/envs/gensim/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.\n", - " from pandas.core import datetools\n" - ] - } - ], + "outputs": [], "source": [ "import numpy as np\n", "import statsmodels.api as sm\n", "from random import sample\n", - "\n", - "# For timing\n", - "from contextlib import contextmanager\n", - "from timeit import default_timer\n", - "import time \n", - "\n", - "@contextmanager\n", - "def elapsed_timer():\n", - " start = default_timer()\n", - " elapser = lambda: default_timer() - start\n", - " yield lambda: elapser()\n", - " end = default_timer()\n", - " elapser = lambda: end-start\n", " \n", "def logistic_predictor_from_data(train_targets, train_regressors):\n", + " \"\"\"Fit a statsmodel logistic predictor on supplied data\"\"\"\n", " logit = sm.Logit(train_targets, train_regressors)\n", " predictor = logit.fit(disp=0)\n", " # print(predictor.summary())\n", " return predictor\n", "\n", - "def error_rate_for_model(test_model, train_set, test_set, infer=False, infer_steps=3, infer_alpha=0.1, infer_subsample=0.1):\n", + "def error_rate_for_model(test_model, train_set, test_set, \n", + " reinfer_train=False, reinfer_test=False, \n", + " infer_steps=None, infer_alpha=None, infer_subsample=0.2):\n", " \"\"\"Report error rate on test_doc sentiments, using supplied model and train_docs\"\"\"\n", "\n", - " train_targets, train_regressors = zip(*[(doc.sentiment, test_model.docvecs[doc.tags[0]]) for doc in train_set])\n", + " train_targets = [doc.sentiment for doc in train_set]\n", + " if reinfer_train:\n", + " train_regressors = [test_model.infer_vector(doc.words, steps=infer_steps, alpha=infer_alpha) for doc in train_set]\n", + " else:\n", + " train_regressors = [test_model.docvecs[doc.tags[0]] for doc in train_set]\n", " train_regressors = sm.add_constant(train_regressors)\n", " predictor = logistic_predictor_from_data(train_targets, train_regressors)\n", "\n", " test_data = test_set\n", - " if infer:\n", + " if reinfer_test:\n", " if infer_subsample < 1.0:\n", " test_data = sample(test_data, int(infer_subsample * len(test_data)))\n", " test_regressors = [test_model.infer_vector(doc.words, steps=infer_steps, alpha=infer_alpha) for doc in test_data]\n", @@ -379,18 +394,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Bulk Training" + "## Bulk Training & Per-Model Evaluation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "We use an explicit multiple-pass, alpha-reduction approach as sketched in this [gensim doc2vec blog post](http://radimrehurek.com/2014/12/doc2vec-tutorial/) with added shuffling of corpus on each pass.\n", - "\n", - "Note that vector training is occurring on *all* documents of the dataset, which includes all TRAIN/TEST/DEV docs.\n", + "Note that doc-vector training is occurring on *all* documents of the dataset, which includes all TRAIN/TEST/DEV docs.\n", "\n", - "We evaluate each model's sentiment predictive power based on error rate, and the evaluation is repeated after each pass so we can see the rates of relative improvement. The base numbers reuse the TRAIN and TEST vectors stored in the models for the logistic regression, while the _inferred_ results use newly-inferred TEST vectors. \n", + "We evaluate each model's sentiment predictive power based on error rate, and the evaluation is done for each model. \n", "\n", "(On a 4-core 2.6Ghz Intel Core i7, these 20 passes training and evaluating 3 main models takes about an hour.)" ] @@ -398,13 +411,11 @@ { "cell_type": "code", "execution_count": 7, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "from collections import defaultdict\n", - "best_error = defaultdict(lambda: 1.0) # To selectively print only best errors achieved" + "error_rates = defaultdict(lambda: 1.0) # To selectively print only best errors achieved" ] }, { @@ -416,208 +427,82 @@ "name": "stdout", "output_type": "stream", "text": [ - "START 2017-07-08 17:48:01.470463\n", - "*0.404640 : 1 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 80.4s 2.3s\n", - "*0.361200 : 1 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)_inferred 80.4s 10.9s\n", - "*0.247520 : 1 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 31.0s 1.1s\n", - "*0.201200 : 1 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)_inferred 31.0s 3.5s\n", - "*0.264120 : 1 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 38.5s 0.7s\n", - "*0.203600 : 1 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)_inferred 38.5s 4.7s\n", - "*0.216600 : 1 passes : dbow+dmm 0.0s 1.7s\n", - "*0.199600 : 1 passes : dbow+dmm_inferred 0.0s 10.6s\n", - "*0.244800 : 1 passes : dbow+dmc 0.0s 2.0s\n", - "*0.219600 : 1 passes : dbow+dmc_inferred 0.0s 15.0s\n", - "Completed pass 1 at alpha 0.025000\n", - "*0.349560 : 2 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 52.7s 0.6s\n", - "*0.147400 : 2 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 20.3s 0.5s\n", - "*0.209200 : 2 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 28.3s 0.5s\n", - "*0.140280 : 2 passes : dbow+dmm 0.0s 1.4s\n", - "*0.149360 : 2 passes : dbow+dmc 0.0s 2.2s\n", - "Completed pass 2 at alpha 0.023800\n", - "*0.308760 : 3 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 50.4s 0.6s\n", - "*0.126880 : 3 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 19.5s 0.5s\n", - "*0.192560 : 3 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 37.8s 0.7s\n", - "*0.124440 : 3 passes : dbow+dmm 0.0s 1.8s\n", - "*0.126280 : 3 passes : dbow+dmc 0.0s 1.7s\n", - "Completed pass 3 at alpha 0.022600\n", - "*0.277160 : 4 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 75.2s 0.7s\n", - "*0.119120 : 4 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 31.0s 2.6s\n", - "*0.177960 : 4 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 48.3s 0.8s\n", - "*0.118000 : 4 passes : dbow+dmm 0.0s 2.2s\n", - "*0.119400 : 4 passes : dbow+dmc 0.0s 2.0s\n", - "Completed pass 4 at alpha 0.021400\n", - "*0.256040 : 5 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 75.2s 0.8s\n", - "*0.256800 : 5 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)_inferred 75.2s 9.0s\n", - "*0.115120 : 5 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 34.0s 1.6s\n", - "*0.115200 : 5 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)_inferred 34.0s 3.5s\n", - "*0.171840 : 5 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 42.5s 0.9s\n", - "*0.202400 : 5 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)_inferred 42.5s 6.2s\n", - "*0.111920 : 5 passes : dbow+dmm 0.0s 2.0s\n", - "*0.118000 : 5 passes : dbow+dmm_inferred 0.0s 11.6s\n", - "*0.113040 : 5 passes : dbow+dmc 0.0s 2.2s\n", - "*0.115600 : 5 passes : dbow+dmc_inferred 0.0s 17.3s\n", - "Completed pass 5 at alpha 0.020200\n", - "*0.236880 : 6 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 70.1s 2.0s\n", - "*0.109720 : 6 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 32.2s 0.9s\n", - "*0.166320 : 6 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 44.8s 0.9s\n", - "*0.108720 : 6 passes : dbow+dmm 0.0s 2.1s\n", - "*0.108480 : 6 passes : dbow+dmc 0.0s 2.0s\n", - "Completed pass 6 at alpha 0.019000\n", - "*0.221640 : 7 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 84.7s 0.9s\n", - "*0.107120 : 7 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 31.3s 1.9s\n", - "*0.164000 : 7 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 43.0s 0.9s\n", - "*0.106160 : 7 passes : dbow+dmm 0.0s 2.0s\n", - "*0.106680 : 7 passes : dbow+dmc 0.0s 2.0s\n", - "Completed pass 7 at alpha 0.017800\n", - "*0.209360 : 8 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 64.0s 0.8s\n", - "*0.106200 : 8 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 31.2s 0.8s\n", - "*0.161360 : 8 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 43.0s 0.9s\n", - "*0.104480 : 8 passes : dbow+dmm 0.0s 3.0s\n", - "*0.105640 : 8 passes : dbow+dmc 0.0s 2.0s\n", - "Completed pass 8 at alpha 0.016600\n", - "*0.203520 : 9 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 66.6s 1.0s\n", - "*0.105120 : 9 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 39.1s 1.1s\n", - "*0.160960 : 9 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 43.7s 0.7s\n", - " 0.104840 : 9 passes : dbow+dmm 0.0s 2.0s\n", - "*0.104240 : 9 passes : dbow+dmc 0.0s 2.0s\n", - "Completed pass 9 at alpha 0.015400\n", - "*0.195840 : 10 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 66.5s 1.7s\n", - "*0.197600 : 10 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)_inferred 66.5s 10.1s\n", - "*0.104280 : 10 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 31.3s 0.8s\n", - " 0.115200 : 10 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)_inferred 31.3s 4.7s\n", - "*0.158800 : 10 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 44.5s 0.9s\n", - "*0.182800 : 10 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)_inferred 44.5s 6.3s\n", - "*0.102760 : 10 passes : dbow+dmm 0.0s 3.1s\n", - "*0.110000 : 10 passes : dbow+dmm_inferred 0.0s 11.3s\n", - "*0.103920 : 10 passes : dbow+dmc 0.0s 2.2s\n", - "*0.109200 : 10 passes : dbow+dmc_inferred 0.0s 16.4s\n", - "Completed pass 10 at alpha 0.014200\n", - "*0.190800 : 11 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 71.3s 1.0s\n", - "*0.103840 : 11 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 33.8s 0.8s\n", - "*0.157440 : 11 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 44.5s 0.9s\n", - " 0.103240 : 11 passes : dbow+dmm 0.0s 3.0s\n", - " 0.104360 : 11 passes : dbow+dmc 0.0s 2.1s\n", - "Completed pass 11 at alpha 0.013000\n", - "*0.188520 : 12 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 65.4s 0.8s\n", - " 0.104600 : 12 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 33.3s 1.0s\n", - "*0.157240 : 12 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 53.5s 1.7s\n", - " 0.103880 : 12 passes : dbow+dmm 0.0s 2.8s\n", - " 0.104640 : 12 passes : dbow+dmc 0.0s 2.6s\n", - "Completed pass 12 at alpha 0.011800\n", - "*0.185760 : 13 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 71.8s 1.7s\n", - " 0.104040 : 13 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 31.9s 1.0s\n", - "*0.155960 : 13 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 45.7s 0.8s\n", - "*0.102720 : 13 passes : dbow+dmm 0.0s 2.0s\n", - " 0.104120 : 13 passes : dbow+dmc 0.0s 1.9s\n", - "Completed pass 13 at alpha 0.010600\n", - "*0.181960 : 14 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 80.3s 0.8s\n", - "*0.103680 : 14 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 23.1s 0.7s\n", - "*0.155040 : 14 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 31.4s 1.5s\n", - "*0.102440 : 14 passes : dbow+dmm 0.0s 1.6s\n", - "*0.103680 : 14 passes : dbow+dmc 0.0s 1.7s\n", - "Completed pass 14 at alpha 0.009400\n", - "*0.180680 : 15 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 48.5s 0.7s\n", - "*0.186000 : 15 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)_inferred 48.5s 12.0s\n", - " 0.104840 : 15 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 23.4s 0.7s\n", - "*0.101600 : 15 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)_inferred 23.4s 4.3s\n", - "*0.154000 : 15 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 53.2s 2.0s\n", - " 0.191600 : 15 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)_inferred 53.2s 4.8s\n", - " 0.102960 : 15 passes : dbow+dmm 0.0s 3.1s\n", - "*0.108400 : 15 passes : dbow+dmm_inferred 0.0s 11.4s\n", - " 0.104280 : 15 passes : dbow+dmc 0.0s 1.7s\n", - "*0.098400 : 15 passes : dbow+dmc_inferred 0.0s 14.1s\n", - "Completed pass 15 at alpha 0.008200\n", - "*0.180320 : 16 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 68.3s 1.0s\n", - "*0.103600 : 16 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 28.5s 2.1s\n", - " 0.154640 : 16 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 43.4s 0.7s\n", - " 0.102520 : 16 passes : dbow+dmm 0.0s 1.9s\n", - "*0.102480 : 16 passes : dbow+dmc 0.0s 2.9s\n", - "Completed pass 16 at alpha 0.007000\n", - "*0.178160 : 17 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 63.4s 2.0s\n", - "*0.103360 : 17 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 31.5s 0.8s\n", - " 0.154160 : 17 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 40.9s 1.0s\n", - "*0.102320 : 17 passes : dbow+dmm 0.0s 3.0s\n", - " 0.102680 : 17 passes : dbow+dmc 0.0s 2.0s\n", - "Completed pass 17 at alpha 0.005800\n", - "*0.177520 : 18 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 55.1s 0.8s\n", - "*0.103120 : 18 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 24.8s 0.7s\n", - "*0.153040 : 18 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 32.9s 0.8s\n", - " 0.102440 : 18 passes : dbow+dmm 0.0s 1.7s\n", - "*0.102480 : 18 passes : dbow+dmc 0.0s 2.6s\n", - "Completed pass 18 at alpha 0.004600\n", - "*0.177240 : 19 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 57.2s 1.5s\n", - "*0.103080 : 19 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 20.6s 1.8s\n", - "*0.152680 : 19 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 43.8s 0.8s\n", - " 0.102800 : 19 passes : dbow+dmm 0.0s 1.8s\n", - " 0.102600 : 19 passes : dbow+dmc 0.0s 1.7s\n", - "Completed pass 19 at alpha 0.003400\n", - "*0.176080 : 20 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 50.2s 0.6s\n", - " 0.188000 : 20 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)_inferred 50.2s 8.5s\n", - " 0.103400 : 20 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 19.7s 0.7s\n", - " 0.111600 : 20 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)_inferred 19.7s 4.1s\n", - "*0.152680 : 20 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 30.5s 0.6s\n" + "Training Doc2Vec(dbow,d100,n5,mc2,t4)\n", + "CPU times: user 18min 41s, sys: 59.7 s, total: 19min 41s\n", + "Wall time: 6min 49s\n", + "\n", + "Evaluating Doc2Vec(dbow,d100,n5,mc2,t4)\n", + "CPU times: user 1.85 s, sys: 226 ms, total: 2.07 s\n", + "Wall time: 673 ms\n", + "\n", + "0.102600 Doc2Vec(dbow,d100,n5,mc2,t4)\n", + "\n", + "Training Doc2Vec(\"alpha=0.05\",dm/m,d100,n5,w10,mc2,t4)\n", + "CPU times: user 28min 21s, sys: 1min 30s, total: 29min 52s\n", + "Wall time: 9min 22s\n", + "\n", + "Evaluating Doc2Vec(\"alpha=0.05\",dm/m,d100,n5,w10,mc2,t4)\n", + "CPU times: user 1.71 s, sys: 175 ms, total: 1.88 s\n", + "Wall time: 605 ms\n", + "\n", + "0.154280 Doc2Vec(\"alpha=0.05\",dm/m,d100,n5,w10,mc2,t4)\n", + "\n", + "Training Doc2Vec(dm/c,d100,n5,w5,mc2,t4)\n", + "CPU times: user 55min 8s, sys: 36.5 s, total: 55min 44s\n", + "Wall time: 14min 43s\n", + "\n", + "Evaluating Doc2Vec(dm/c,d100,n5,w5,mc2,t4)\n", + "CPU times: user 1.47 s, sys: 110 ms, total: 1.58 s\n", + "Wall time: 533 ms\n", + "\n", + "0.225760 Doc2Vec(dm/c,d100,n5,w5,mc2,t4)\n", + "\n" ] - }, + } + ], + "source": [ + "for model in simple_models: \n", + " print(\"Training %s\" % model)\n", + " %time model.train(doc_list, total_examples=len(doc_list), epochs=model.epochs)\n", + " \n", + " print(\"\\nEvaluating %s\" % model)\n", + " %time err_rate, err_count, test_count, predictor = error_rate_for_model(model, train_docs, test_docs)\n", + " error_rates[str(model)] = err_rate\n", + " print(\"\\n%f %s\\n\" % (err_rate, model))" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - " 0.182800 : 20 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)_inferred 30.5s 4.7s\n", - " 0.102600 : 20 passes : dbow+dmm 0.0s 1.6s\n", - " 0.112800 : 20 passes : dbow+dmm_inferred 0.0s 8.8s\n", - "*0.102440 : 20 passes : dbow+dmc 0.0s 2.1s\n", - " 0.103600 : 20 passes : dbow+dmc_inferred 0.0s 12.4s\n", - "Completed pass 20 at alpha 0.002200\n", - "END 2017-07-08 18:39:42.878219\n" + "\n", + "Evaluating Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(\"alpha=0.05\",dm/m,d100,n5,w10,mc2,t4)\n", + "CPU times: user 4.13 s, sys: 459 ms, total: 4.59 s\n", + "Wall time: 1.72 s\n", + "\n", + "0.103360 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(\"alpha=0.05\",dm/m,d100,n5,w10,mc2,t4)\n", + "\n", + "\n", + "Evaluating Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)\n", + "CPU times: user 4.03 s, sys: 351 ms, total: 4.38 s\n", + "Wall time: 1.38 s\n", + "\n", + "0.105080 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)\n", + "\n" ] } ], "source": [ - "from random import shuffle\n", - "import datetime\n", - "\n", - "alpha, min_alpha, passes = (0.025, 0.001, 20)\n", - "alpha_delta = (alpha - min_alpha) / passes\n", - "\n", - "print(\"START %s\" % datetime.datetime.now())\n", - "\n", - "for epoch in range(passes):\n", - " shuffle(doc_list) # Shuffling gets best results\n", - " \n", - " for name, train_model in models_by_name.items():\n", - " # Train\n", - " duration = 'na'\n", - " train_model.alpha, train_model.min_alpha = alpha, alpha\n", - " with elapsed_timer() as elapsed:\n", - " train_model.train(doc_list, total_examples=len(doc_list), epochs=1)\n", - " duration = '%.1f' % elapsed()\n", - " \n", - " # Evaluate\n", - " eval_duration = ''\n", - " with elapsed_timer() as eval_elapsed:\n", - " err, err_count, test_count, predictor = error_rate_for_model(train_model, train_docs, test_docs)\n", - " eval_duration = '%.1f' % eval_elapsed()\n", - " best_indicator = ' '\n", - " if err <= best_error[name]:\n", - " best_error[name] = err\n", - " best_indicator = '*' \n", - " print(\"%s%f : %i passes : %s %ss %ss\" % (best_indicator, err, epoch + 1, name, duration, eval_duration))\n", - "\n", - " if ((epoch + 1) % 5) == 0 or epoch == 0:\n", - " eval_duration = ''\n", - " with elapsed_timer() as eval_elapsed:\n", - " infer_err, err_count, test_count, predictor = error_rate_for_model(train_model, train_docs, test_docs, infer=True)\n", - " eval_duration = '%.1f' % eval_elapsed()\n", - " best_indicator = ' '\n", - " if infer_err < best_error[name + '_inferred']:\n", - " best_error[name + '_inferred'] = infer_err\n", - " best_indicator = '*'\n", - " print(\"%s%f : %i passes : %s %ss %ss\" % (best_indicator, infer_err, epoch + 1, name + '_inferred', duration, eval_duration))\n", - "\n", - " print('Completed pass %i at alpha %f' % (epoch + 1, alpha))\n", - " alpha -= alpha_delta\n", - " \n", - "print(\"END %s\" % str(datetime.datetime.now()))" + "for model in [models_by_name['dbow+dmm'], models_by_name['dbow+dmc']]: \n", + " print(\"\\nEvaluating %s\" % model)\n", + " %time err_rate, err_count, test_count, predictor = error_rate_for_model(model, train_docs, test_docs)\n", + " error_rates[str(model)] = err_rate\n", + " print(\"\\n%f %s\\n\" % (err_rate, model))" ] }, { @@ -629,31 +514,26 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Err rate Model\n", - "0.098400 dbow+dmc_inferred\n", - "0.101600 Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)_inferred\n", - "0.102320 dbow+dmm\n", - "0.102440 dbow+dmc\n", - "0.103080 Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)\n", - "0.108400 dbow+dmm_inferred\n", - "0.152680 Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)\n", - "0.176080 Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)\n", - "0.182800 Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)_inferred\n", - "0.186000 Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)_inferred\n" + "Err_rate Model\n", + "0.102600 Doc2Vec(dbow,d100,n5,mc2,t4)\n", + "0.103360 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(\"alpha=0.05\",dm/m,d100,n5,w10,mc2,t4)\n", + "0.105080 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)\n", + "0.154280 Doc2Vec(\"alpha=0.05\",dm/m,d100,n5,w10,mc2,t4)\n", + "0.225760 Doc2Vec(dm/c,d100,n5,w5,mc2,t4)\n" ] } ], "source": [ - "# Print best error rates achieved\n", - "print(\"Err rate Model\")\n", - "for rate, name in sorted((rate, name) for name, rate in best_error.items()):\n", + "# Compare error rates achieved, best-to-worst\n", + "print(\"Err_rate Model\")\n", + "for rate, name in sorted((rate, name) for name, rate in error_rates.items()):\n", " print(\"%f %s\" % (rate, name))" ] }, @@ -661,7 +541,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In our testing, contrary to the results of the paper, PV-DBOW performs best. Concatenating vectors from different models only offers a small predictive improvement over averaging vectors. There best results reproduced are just under 10% error rate, still a long way from the paper's reported 7.42% error rate." + "In our testing, contrary to the results of the paper, on this problem, PV-DBOW alone performs as good as anything else. Concatenating vectors from different models only sometimes offers a tiny predictive improvement – and stays generally close to the best-performing solo model included. \n", + "\n", + "The best results achieved here are just around 10% error rate, still a long way from the paper's reported 7.42% error rate. \n", + "\n", + "(Other trials not shown, with larger vectors and other changes, also don't come close to the paper's reported value. Others around the net have reported a similar inability to reproduce the paper's best numbers. The PV-DM/C mode improves a bit with many more training epochs – but doesn't reach parity with PV-DBOW.)" ] }, { @@ -680,20 +564,34 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "for doc 73872...\n", - "Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4):\n", - " [(73872, 0.7427197694778442), (43744, 0.42404329776763916), (75113, 0.41938722133636475)]\n", - "Doc2Vec(dbow,d100,n5,mc2,s0.001,t4):\n", - " [(73872, 0.9305995106697083), (64147, 0.6267511248588562), (80042, 0.6207213401794434)]\n", - "Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4):\n", - " [(73872, 0.7893393039703369), (67773, 0.7167356014251709), (32802, 0.6937947273254395)]\n" + "for doc 66229...\n", + "Doc2Vec(dbow,d100,n5,mc2,t4):\n", + " [(66229, 0.9756568670272827), (66223, 0.5901858806610107), (81851, 0.5678753852844238)]\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/neuscratch/Dev/gensim/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n", + " if np.issubdtype(vec.dtype, np.int):\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Doc2Vec(\"alpha=0.05\",dm/m,d100,n5,w10,mc2,t4):\n", + " [(66229, 0.9355567097663879), (71883, 0.49743932485580444), (74232, 0.49549904465675354)]\n", + "Doc2Vec(dm/c,d100,n5,w5,mc2,t4):\n", + " [(66229, 0.9248996376991272), (97306, 0.4372865557670593), (99824, 0.40370166301727295)]\n" ] } ], @@ -709,7 +607,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "(Yes, here the stored vector from 20 epochs of training is usually one of the closest to a freshly-inferred vector for the same words. Note the defaults for inference are very abbreviated – just 3 steps starting at a high alpha – and likely need tuning for other applications.)" + "(Yes, here the stored vector from 20 epochs of training is usually one of the closest to a freshly-inferred vector for the same words. Defaults for inference may benefit from tuning for each dataset or model parameters.)" ] }, { @@ -721,22 +619,32 @@ }, { "cell_type": "code", - "execution_count": 11, - "metadata": {}, + "execution_count": 18, + "metadata": { + "scrolled": false + }, "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/neuscratch/Dev/gensim/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n", + " if np.issubdtype(vec.dtype, np.int):\n" + ] + }, { "name": "stdout", "output_type": "stream", "text": [ - "TARGET (71919): «tweety is perched in his cage on the ledge and sylvester is across the street at the \" bird watching society \" building on about the same level . both are looking through binoculars , and they spot each other . tweety then utters his famous phrase , \" i taught i taw a puddy cat . \" ( thought i saw a pussy cat . ) sylvester scampers over to grab the bird . tweety flies out of his cage and granny comes to the rescue , bashing the cat and driving it away . the rest of the animated short shows a series of attempts by sylvester to grab tweetie - a familiar theme - and how either bad luck or granny thwarts him every time . the cat dons disguises and tries a number of clever schemes . . . all of which are funny and very entertaining . in all , a good cartoon and fun to watch .»\n", + "TARGET (34105): «even a decade after \" frontline \" aired on the abc , near as i can tell , \" current affairs \" programmes are still using the same tricks over and over . time after time , \" today tonight \" and \" a current affair \" are seen to be hiding behind the facade of journalistic professionalism , and yet they feed us nothing but tired stories about weight-loss and dodgy tradesmen , shameless network promotions and pointless celebrity puff-pieces . having often been subjected to that entertainment-less void between 'the simpsons' at 6 : 00 pm and 'sale of the century' ( or 'temptation' ) at 7 : 00 pm , i was all too aware of the little tricks that these shows would use to attract ratings . fortunately , four rising comedians – rob sitch , jane kennedy , santo cilauro and tom gleisner – were also all too aware of all this , and they crafted their frustrations into one of the most wickedly-hilarious media satires you'll ever see on television . the four entertainers had already met with comedic success , their previous most memorable television stint being on 'the late show , ' the brilliant saturday night variety show which ran for two seasons from 1992-1993 , and also featured fellow comedians mick molloy , tony martin , jason stephens and judith lucy . \" frontline \" boasts an ensemble of colourful characters , each with their own distinct and quirky personality . the current-affairs show is headed by nicely-groomed mike moore ( rob sitch ) , an ambitious , pretentious , dim-witted narcissist . mike works under the delusion that the show is serving a vital role for society – he is always adamant that they \" maintain their journalistic integrity \" – and his executive producers have excelled into getting him to believe just that . mike is basically a puppet to bring the news to the people ; occasionally he gets the inkling that he is being led along by the nose , but usually this thought is stamped out via appeals to his vanity or promises of a promotion . brooke vandenberg ( jane kennedy ) is the senior female reporter on the show . she is constantly concerned about her looks and public profile , and , if the rumours are to be believed , she has had a romantic liaison with just about every male celebrity in existence . another equally amoral reporter , marty di stasio , is portrayed by tiriel mora , who memorably played inept solicitor dennis denuto in the australian comedy classic , 'the castle . ' emma ward ( alison whyte ) is the line producer on the show , and the single shining beacon of morality on the \" frontline \" set . then there's the highly-amusing weatherman , geoffrey salter ( santo cilauro ) , mike's best friend and confidant . geoff makes a living out of always agreeing with mike's opinion , and of laughing uproariously at his jokes before admitting that he doesn't get them . for each of the shows three seasons , we are treated to a different ep , executive producer . brian thompson ( bruno lawrence ) , who unfortunately passed away in 1995 , runs the programme during season 1 . he has a decent set of morals , and is always civil to his employees , and yet is more-than-willing to cast these aside in favour of high ratings . sam murphy ( kevin j . wilson ) arrives on set in season 2 , a hard-nosed , smooth-talking producer who knows exactly how to string mike along ; the last episode of the second season , when mike finally gets the better of him , is a classic moment . graeme \" prowsey \" prowse ( steve bisley ) , ep for the third season , is crude , unpleasant and unashamedly sexist . it's , therefore , remarkable that you eventually come to like him . with its cast of distinctive , exaggerated characters , \" frontline \" has a lot of fun satirising current-affairs programmes and their dubious methods for winning ratings . many of the episodes were shot quickly and cheaply , often implementing many plot ideas from recent real-life situations , but this never really detracts from the show's topicality ten years on . celebrity cameos come in abundance , with some of the most memorable appearances including pauline hanson , don burke and jon english . watch out for harry shearer's hilarious appearance in the season 2 episode \" changing the face of current affairs , \" playing larry hadges , an american hired by the network to reform the show . particularly in the third season , i noticed that \" frontline \" boasted an extremely gritty form of black humour , uncharacteristic for such a light-hearted comedy show . genuinely funny moments are born from brooke being surreptitiously bribed into having an abortion , murder by a crazed gunman and mike treacherously betraying his best friend's hopes and dreams , only to be told that he is a good friend . the series' final minute – minus an added-scene during the credits , which was probably added just in case a fourth season was to be produced – was probably the greatest , blackest ending to a comedy series that i've yet seen . below is listed a very tentative list of my top five favourite \" frontline \" episodes , but , make no mistake , every single half-hour is absolutely hilarious and hard-hitting satire . 1 ) \" the siege \" ( season 1 ) 2 ) \" give 'em enough rope \" ( season 2 ) 3 ) \" addicted to fame \" ( season 3 ) 4 ) \" basic instincts \" ( season 2 ) 5 ) \" add sex and stir \" ( season 1 )»\n", "\n", - "SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4):\n", + "SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dbow,d100,n5,mc2,t4):\n", "\n", - "MOST (30440, 0.752430260181427): «in tweety's s . o . s , sylvester goes from picking garbage cans to being a stowaway on a cruise ship that happens to carry a certain canary bird-and granny , his owner . uh-oh ! once again , tweety and granny provide many obstacles to the cat's attempts to get the bird . sylvester also gets seasick quite a few times , too . and the second time the red-nosed feline goes to the place on the ship that has something that cures his ailments , tweety replaces it with nitroglycerin . so now sylvester can blow fire ! i'll stop here and say this is another excellent cartoon directed by friz freling starring the popular cat-and-bird duo . tweety's s . o . s is most highly recommended .»\n", + "MOST (34106, 0.6284705996513367): «the sad thing about frontline is that once you watch three or four episodes of it you really begin to understand that it is not far away from what happens in real life . what is really sad is that it also makes extremely funny . the frontline team in series one consists of brian thompson ( bruno lawrence ) - a man who truly lives and dies merely by the ratings his show gets . occasionally his stunts to achieve these ratings see him run in with his line producer emma thompson ( alison whyte ) ; a woman who hasn't lost all her journalistic integrity and is prepared to defend moral scruples on occasions . the same cannot be said of reporter brooke vandenberg ( jane kennedy ) - a reporter who has had all the substance sucked out of her- so much so that when interviewing ben elton she needs to be instructed to laugh . her reports usually consist of interviewing celebrities ( with whom she has or hasn't 'crossed paths' with before ) or scandalous unethical reports that usually backfire . martin de stasio ( tiriel mora ) is the reporter with whom the team relies on for gravitas and dignity , as he has the smarts of 21 years of journalism behind him . his doesn't have principles so much as a nous of what makes a good journalistic story , though he does draw the occasional line . parading over this chaos ( in name ) is mike moore ( rob sitch ) an egotistical , naive reporter who can't see that he's only a pretty face for the grubby journalism . he often finds his morals being compromised simply because brian appeals to his vanity and allows his stupidity to do the rest . frontline is the sort of show that there needs to be more of , because it shows that while in modern times happiness , safety and deep political insight are interesting things ; it's much easier to rate with scandal , fear and tabloid celebrities .»\n", "\n", - "MEDIAN (32141, 0.3800385594367981): «my entire family enjoyed this film , including 2 small children . great values without sex , violence , drugs , nudity , or profanity . also no zillion dollar special effects were added to try to misdirect viewers from a poorly written storyline . a simple little family fun movie . we especially like the songs in the movie . but we only got to hear a portion of the songs . . . mostly during the end credits . . . would love to buy a sound track cd from this movie . this is my 4th bill hillman movie and they all have the same guidelines as mentioned above . with all the movies out there that you don't want your kids to watch , this hillman fella has a no risk rating . we love his movies .»\n", + "MEDIAN (35245, 0.2309201955795288): «\" hell to pay \" bills itself as the rebirth of the classic western . . . it succeeds as a western genre movie that the entire family could see and not unlike the films baby-boomers experienced decades ago . the good guys are good and the bad guys are really bad ! . bo svenson , stella stevens , lee majors , andrew prine ( excellent in this film ) tim thomerson and james drury are all great and it's fun to see them again . james drury really shines in this one , maybe even better than his days as \" the virginian . \" in a way , \" hell to pay \" reminds me of those movies in the 60's where actors you know from so many shows make an appearance . if you're of a certain age , buck taylor , peter brown and denny miller and william smith provide a \" wow \" factor because we seldom get to see these icons these days . \" hell to pay \" features screen legends along with newer names in hollywood . most notable in the cast of \" newbies \" is rachel kimsey ( rebekah ) , who i've seen lately on \" the young and the restless \" and kevin kazakoff , who plays the angst-ridden kirby , a war-weary man who's torn between wanting to live and let live or stepping in to \" do the right thing . \" william gregory lee is excellent as chance , kirby's mischievous and womanizing brother . katie keane plays rachel , rebekah's sister , a woman who did what was necessary to stay alive but giving up her pride in the process . in a small but memorable role , jeff davis plays mean joe , a former confederate with a rather nasty mean streak . i think we'll be seeing more of these fine actors in the future . \" hell to pay \" is a fun movie with a great story to tell grab the popcorn , we're headin' west ! .»\n", "\n", - "LEAST (57712, -0.051298510283231735): «in a recent biography of burt lancaster , go tell the spartans is described as the best vietnam war film that nobody ever saw . hopefully with television and video products that will be corrected . i prefer to think of it as a prequel to platoon . this film is set in 1964 when america's participation was limited to advisers by this time raised to about 20 , 000 of them by president kennedy . whether if kennedy had lived and won a second term he would have increased our commitment to a half a million men as lyndon johnson did is open to much historical speculation . major burt lancaster heads such an advisory team with his number two captain marc singer . they get some replacements and a new assignment to build a fortress where the french tried years ago and failed . the replacements are a really mixed bag , a sergeant who lancaster has served with before and respects highly in jonathan goldsmith , a very green and eager second lieutenant in joe unger , a demolitions man who is a draftee and at that time vietnam service was a strictly volunteer thing in craig wasson , and a medic who is also a junkie in dennis howard . for one reason or another all of these get sent forward to build that outpost in a place that suddenly has acquired military significance . i said before this could be a prequel to platoon . platoon is set in the time a few years later when the usa was fully militarily committed in vietnam . platoon raises the same issues about the futility of that war , but i think go tell the spartans does a much better job . hard to bring your best effort into the fight since who and what you're fighting and fighting for seems to change weekly . originally this project was for william holden and i'm surprised holden passed on it . maybe for the better because lancaster strikes just the right note as the professional soldier in what was a backwater assignment who politics has passed over for promotion . knowing all that you will understand why lancaster makes the final decision he does . two others of note are evan kim who is the head of the south vietnamese regulars and interpreter who lancaster and company are training . he epitomizes the brutality of the struggle for us in a way that we can't appreciate from the other side because we never meet any of the viet cong by name . dolph sweet plays the general in charge of the american vietnam commitment , a general harnitz . he is closest to a real character because the general in charge their before johnson raised the troop levels and put in william westmoreland was paul harkins . joe unger is who i think gives the best performance as the shavetail lieutenant with all the conventional ideas of war and believes we have got to be with the good guys since we are americans . he learns fast that you issue uniforms for a reason and wars against people who don't have them are the most difficult . i think one could get a deep understanding of just what america faced in 1964 in vietnam by watching go tell the spartans .»\n", + "LEAST (261, -0.09666291624307632): «an unusual film from ringo lam and one that's strangely under-appreciated . the mix of fantasy kung-fu with a more realistic depiction of swords and spears being driven thru bodies is startling especially during the first ten minutes . a horseback rider get chopped in two and his waist and legs keep riding the horse . several horses get chopped up . it's very unexpected . the story is very simple , fong and his shaolin brothers are captured by a crazed maniac general and imprisoned in the red lotus temple which seems to be more of a torture chamber then a temple . the general has a similarity to kurtz in apocalypse now as he spouts warped philosophy and makes frightening paintings with human blood . the production is very impressive and the setting is bleak . blood is everywhere . the action is very well done and mostly coherent unlike many hk action scenes from the time . sometimes the movie veers into absurdity or the effects are cheesy but it's never bad enough to ruin the film . find this one , it's one of the best hk kung fu films from the early nineties . just remember it's not child friendly .»\n", "\n" ] } @@ -757,7 +665,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "(Somewhat, in terms of reviewer tone, movie genre, etc... the MOST cosine-similar docs usually seem more like the TARGET than the MEDIAN or LEAST.)" + "Somewhat, in terms of reviewer tone, movie genre, etc... the MOST cosine-similar docs usually seem more like the TARGET than the MEDIAN or LEAST... especially if the MOST has a cosine-similarity > 0.5. Re-run the cell to try another random target document." ] }, { @@ -769,10 +677,8 @@ }, { "cell_type": "code", - "execution_count": 12, - "metadata": { - "collapsed": true - }, + "execution_count": 13, + "metadata": {}, "outputs": [], "source": [ "word_models = simple_models[:]" @@ -780,83 +686,91 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "most similar words for 'thrilled' (276 occurences)\n" + "most similar words for 'spoilt' (97 occurences)\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/neuscratch/Dev/gensim/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n", + " if np.issubdtype(vec.dtype, np.int):\n" ] }, { "data": { "text/html": [ - "
Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)
[('pleased', 0.8135600090026855),
\n", - "('excited', 0.7601636648178101),
\n", - "('surprised', 0.7497514486312866),
\n", - "('delighted', 0.740871012210846),
\n", - "('impressed', 0.7300887107849121),
\n", - "('disappointed', 0.715817391872406),
\n", - "('shocked', 0.7109759449958801),
\n", - "('intrigued', 0.7000594139099121),
\n", - "('amazed', 0.6994709968566895),
\n", - "('fascinated', 0.6952326893806458),
\n", - "('saddened', 0.68060702085495),
\n", - "('satisfied', 0.674963116645813),
\n", - "('apprehensive', 0.6572576761245728),
\n", - "('entertained', 0.654381275177002),
\n", - "('disgusted', 0.6502282023429871),
\n", - "('overjoyed', 0.6485082507133484),
\n", - "('stunned', 0.6478738784790039),
\n", - "('entranced', 0.6438385844230652),
\n", - "('amused', 0.6437265872955322),
\n", - "('dissappointed', 0.6427538394927979)]
[(\"ifans'\", 0.44280144572257996),
\n", - "('shay', 0.4335209131240845),
\n", - "('crappers', 0.4007232189178467),
\n", - "('overflow', 0.40028804540634155),
\n", - "('yum', 0.3929170072078705),
\n", - "(\"monkey'\", 0.38661277294158936),
\n", - "('kholi', 0.38401469588279724),
\n", - "('fun-bloodbath', 0.38145124912261963),
\n", - "('breathed', 0.373812735080719),
\n", - "(\"eszterhas'\", 0.3729144334793091),
\n", - "('nob', 0.3723628520965576),
\n", - "(\"meatloaf's\", 0.3720172643661499),
\n", - "('ruegger', 0.3683895468711853),
\n", - "(\"haynes'\", 0.36665791273117065),
\n", - "('feigning', 0.36445197463035583),
\n", - "('torches', 0.35865518450737),
\n", - "('sirens', 0.3581739068031311),
\n", - "('insides', 0.35690629482269287),
\n", - "('swackhamer', 0.35603001713752747),
\n", - "('trolls', 0.3526684641838074)]
[('pleased', 0.7576382160186768),
\n", - "('excited', 0.7351139187812805),
\n", - "('delighted', 0.7220871448516846),
\n", - "('intrigued', 0.6748061180114746),
\n", - "('surprised', 0.6552557945251465),
\n", - "('shocked', 0.6505781412124634),
\n", - "('disappointed', 0.6428648233413696),
\n", - "('impressed', 0.6426182389259338),
\n", - "('overjoyed', 0.6259098052978516),
\n", - "('saddened', 0.6148285865783691),
\n", - "('anxious', 0.6140503883361816),
\n", - "('fascinated', 0.6126223802566528),
\n", - "('skeptical', 0.6025052070617676),
\n", - "('suprised', 0.5986943244934082),
\n", - "('upset', 0.596437931060791),
\n", - "('relieved', 0.593376874923706),
\n", - "('psyched', 0.5923721790313721),
\n", - "('captivated', 0.5753644704818726),
\n", - "('astonished', 0.574415922164917),
\n", - "('horrified', 0.5716636180877686)]
" + "
Doc2Vec(dbow,d100,n5,mc2,t4)Doc2Vec(\"alpha=0.05\",dm/m,d100,n5,w10,mc2,t4)Doc2Vec(dm/c,d100,n5,w5,mc2,t4)
[(\"wives'\", 0.4262964725494385),
\n", + "('horrificaly', 0.4177134335041046),
\n", + "(\"snit'\", 0.4037289619445801),
\n", + "('improf', 0.40169233083724976),
\n", + "('humiliatingly', 0.3946930170059204),
\n", + "('heart-pounding', 0.3938479423522949),
\n", + "(\"'jo'\", 0.38460421562194824),
\n", + "('kieron', 0.37991276383399963),
\n", + "('linguistic', 0.3727714419364929),
\n", + "('rothery', 0.3719364404678345),
\n", + "('zellwegger', 0.370682954788208),
\n", + "('never-released', 0.36564797163009644),
\n", + "('coffeeshop', 0.36534833908081055),
\n", + "('slater--these', 0.3643302917480469),
\n", + "('over-plotted', 0.36348140239715576),
\n", + "('synchronism', 0.36320072412490845),
\n", + "('exploitations', 0.3631579875946045),
\n", + "(\"donor's\", 0.36226314306259155),
\n", + "('neend', 0.3619685769081116),
\n", + "('renaud', 0.3611547350883484)]
[('spoiled', 0.6693772077560425),
\n", + "('ruined', 0.5701743960380554),
\n", + "('dominated', 0.554553747177124),
\n", + "('marred', 0.5456377267837524),
\n", + "('undermined', 0.5353708267211914),
\n", + "('unencumbered', 0.5345744490623474),
\n", + "('dwarfed', 0.5331343412399292),
\n", + "('followed', 0.5186703205108643),
\n", + "('entranced', 0.513541042804718),
\n", + "('emboldened', 0.5100494623184204),
\n", + "('shunned', 0.5044804215431213),
\n", + "('disgusted', 0.5000460743904114),
\n", + "('overestimated', 0.49955034255981445),
\n", + "('bolstered', 0.4971669018268585),
\n", + "('replaced', 0.4966174364089966),
\n", + "('bookended', 0.49495506286621094),
\n", + "('blowout', 0.49287083745002747),
\n", + "('overshadowed', 0.48964253067970276),
\n", + "('played', 0.48709338903427124),
\n", + "('accompanied', 0.47834640741348267)]
[('spoiled', 0.6672338247299194),
\n", + "('troubled', 0.520033597946167),
\n", + "('bankrupted', 0.509053647518158),
\n", + "('ruined', 0.4965386986732483),
\n", + "('misguided', 0.4900725483894348),
\n", + "('devoured', 0.48988765478134155),
\n", + "('ravaged', 0.4861036539077759),
\n", + "('frustrated', 0.4841104745864868),
\n", + "('suffocated', 0.4828023314476013),
\n", + "('investigated', 0.47958582639694214),
\n", + "('tormented', 0.4791877865791321),
\n", + "('traumatized', 0.4785040616989136),
\n", + "('shaken', 0.4784379005432129),
\n", + "('persecuted', 0.4774147868156433),
\n", + "('crippled', 0.4771782457828522),
\n", + "('torpedoed', 0.4764551818370819),
\n", + "('plagued', 0.47006863355636597),
\n", + "('drowned', 0.4688340723514557),
\n", + "('prompted', 0.4678872525691986),
\n", + "('abandoned', 0.4652657210826874)]
" ], "text/plain": [ "" ] }, - "execution_count": 13, + "execution_count": 23, "metadata": {}, "output_type": "execute_result" } @@ -871,7 +785,7 @@ " break\n", "# or uncomment below line, to just pick a word from the relevant domain:\n", "#word = 'comedy/drama'\n", - "similars_per_model = [str(model.most_similar(word, topn=20)).replace('), ','),
\\n') for model in word_models]\n", + "similars_per_model = [str(model.wv.most_similar(word, topn=20)).replace('), ','),
\\n') for model in word_models]\n", "similar_table = (\"
\" +\n", " \"\".join([str(model) for model in word_models]) + \n", " \"
\" +\n", @@ -885,9 +799,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Do the DBOW words look meaningless? That's because the gensim DBOW model doesn't train word vectors – they remain at their random initialized values – unless you ask with the `dbow_words=1` initialization parameter. Concurrent word-training slows DBOW mode significantly, and offers little improvement (and sometimes a little worsening) of the error rate on this IMDB sentiment-prediction task. \n", + "Do the DBOW words look meaningless? That's because the gensim DBOW model doesn't train word vectors – they remain at their random initialized values – unless you ask with the `dbow_words=1` initialization parameter. Concurrent word-training slows DBOW mode significantly, and offers little improvement (and sometimes a little worsening) of the error rate on this IMDB sentiment-prediction task, but may be appropriate on other tasks, or if you also need word-vectors. \n", "\n", - "Words from DM models tend to show meaningfully similar words when there are many examples in the training data (as with 'plot' or 'actor'). (All DM modes inherently involve word vector training concurrent with doc vector training.)" + "Words from DM models tend to show meaningfully similar words when there are many examples in the training data (as with 'plot' or 'actor'). (All DM modes inherently involve word-vector training concurrent with doc-vector training.)" ] }, { @@ -899,27 +813,67 @@ }, { "cell_type": "code", - "execution_count": 14, - "metadata": { - "collapsed": true - }, - "outputs": [], + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Success, questions-words.txt is available for next steps.\n" + ] + } + ], + "source": [ + "# grab the file if not already local\n", + "questions_filename = 'questions-words.txt'\n", + "if not os.path.isfile(questions_filename):\n", + " # Download IMDB archive\n", + " print(\"Downloading analogy questions file...\")\n", + " url = u'https://raw.githubusercontent.com/tmikolov/word2vec/master/questions-words.txt'\n", + " r = requests.get(url)\n", + " with smart_open(questions_filename, 'wb') as f:\n", + " f.write(r.content)\n", + "assert os.path.isfile(questions_filename), \"questions-words.txt unavailable\"\n", + "print(\"Success, questions-words.txt is available for next steps.\")" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/neuscratch/Dev/gensim/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n", + " if np.issubdtype(vec.dtype, np.int):\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Doc2Vec(dbow,d100,n5,mc2,t4): 0.00% correct (0 of 14657)\n", + "Doc2Vec(\"alpha=0.05\",dm/m,d100,n5,w10,mc2,t4): 17.37% correct (2546 of 14657)\n", + "Doc2Vec(dm/c,d100,n5,w5,mc2,t4): 19.20% correct (2814 of 14657)\n" + ] + } + ], "source": [ - "# Download this file: https://github.com/nicholas-leonard/word2vec/blob/master/questions-words.txt\n", - "# and place it in the local directory\n", - "# Note: this takes many minutes\n", - "if os.path.isfile('questions-words.txt'):\n", - " for model in word_models:\n", - " sections = model.accuracy('questions-words.txt')\n", - " correct, incorrect = len(sections[-1]['correct']), len(sections[-1]['incorrect'])\n", - " print('%s: %0.2f%% correct (%d of %d)' % (model, float(correct*100)/(correct+incorrect), correct, correct+incorrect))" + "# Note: this analysis takes many minutes\n", + "for model in word_models:\n", + " score, sections = model.wv.evaluate_word_analogies('questions-words.txt')\n", + " correct, incorrect = len(sections[-1]['correct']), len(sections[-1]['incorrect'])\n", + " print('%s: %0.2f%% correct (%d of %d)' % (model, float(correct*100)/(correct+incorrect), correct, correct+incorrect))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Even though this is a tiny, domain-specific dataset, it shows some meager capability on the general word analogies – at least for the DM/concat and DM/mean models which actually train word vectors. (The untrained random-initialized words of the DBOW model of course fail miserably.)" + "Even though this is a tiny, domain-specific dataset, it shows some meager capability on the general word analogies – at least for the DM/mean and DM/concat models which actually train word vectors. (The untrained random-initialized words of the DBOW model of course fail miserably.)" ] }, { @@ -931,7 +885,7 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -942,36 +896,119 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "To mix the Google dataset (if locally available) into the word tests..." + "### Advanced technique: re-inferring doc-vectors" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Because the bulk-trained vectors had much of their training early, when the model itself was still settling, it is *sometimes* the case that rather than using the bulk-trained vectors, new vectors re-inferred from the final state of the model serve better as the input/test data for downstream tasks. \n", + "\n", + "Our `error_rate_for_model()` function already had a non-default option to re-infer vectors before training/testing the classifier, so here we test that option. (This takes as long or longer than initial bulk training, as inference is only single-threaded.)" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Evaluating Doc2Vec(dbow,d100,n5,mc2,t4) re-inferred\n", + "CPU times: user 7min 9s, sys: 1.55 s, total: 7min 11s\n", + "Wall time: 7min 10s\n", + "\n", + "0.102240 Doc2Vec(dbow,d100,n5,mc2,t4)_reinferred\n", + "\n", + "Evaluating Doc2Vec(\"alpha=0.05\",dm/m,d100,n5,w10,mc2,t4) re-inferred\n", + "CPU times: user 9min 48s, sys: 1.53 s, total: 9min 49s\n", + "Wall time: 9min 48s\n", + "\n", + "0.146200 Doc2Vec(\"alpha=0.05\",dm/m,d100,n5,w10,mc2,t4)_reinferred\n", + "\n", + "Evaluating Doc2Vec(dm/c,d100,n5,w5,mc2,t4) re-inferred\n", + "CPU times: user 16min 13s, sys: 1.32 s, total: 16min 14s\n", + "Wall time: 16min 13s\n", + "\n", + "0.218120 Doc2Vec(dm/c,d100,n5,w5,mc2,t4)_reinferred\n", + "\n", + "Evaluating Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(\"alpha=0.05\",dm/m,d100,n5,w10,mc2,t4) re-inferred\n", + "CPU times: user 15min 50s, sys: 1.63 s, total: 15min 52s\n", + "Wall time: 15min 49s\n", + "\n", + "0.102120 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(\"alpha=0.05\",dm/m,d100,n5,w10,mc2,t4)_reinferred\n", + "\n", + "Evaluating Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4) re-inferred\n", + "CPU times: user 22min 53s, sys: 1.81 s, total: 22min 55s\n", + "Wall time: 22min 52s\n", + "\n", + "0.104320 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)_reinferred\n", + "\n" + ] + } + ], + "source": [ + "for model in simple_models + [models_by_name['dbow+dmm'], models_by_name['dbow+dmc']]: \n", + " print(\"Evaluating %s re-inferred\" % str(model))\n", + " pseudomodel_name = str(model)+\"_reinferred\"\n", + " %time err_rate, err_count, test_count, predictor = error_rate_for_model(model, train_docs, test_docs, reinfer_train=True, reinfer_test=True, infer_subsample=1.0)\n", + " error_rates[pseudomodel_name] = err_rate\n", + " print(\"\\n%f %s\\n\" % (err_rate, pseudomodel_name))" + ] + }, + { + "cell_type": "code", + "execution_count": 25, "metadata": { - "collapsed": true + "scrolled": true }, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Err_rate Model\n", + "0.102120 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(\"alpha=0.05\",dm/m,d100,n5,w10,mc2,t4)_reinferred\n", + "0.102240 Doc2Vec(dbow,d100,n5,mc2,t4)_reinferred\n", + "0.102600 Doc2Vec(dbow,d100,n5,mc2,t4)\n", + "0.103360 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(\"alpha=0.05\",dm/m,d100,n5,w10,mc2,t4)\n", + "0.104320 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)_reinferred\n", + "0.105080 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)\n", + "0.146200 Doc2Vec(\"alpha=0.05\",dm/m,d100,n5,w10,mc2,t4)_reinferred\n", + "0.154280 Doc2Vec(\"alpha=0.05\",dm/m,d100,n5,w10,mc2,t4)\n", + "0.218120 Doc2Vec(dm/c,d100,n5,w5,mc2,t4)_reinferred\n", + "0.225760 Doc2Vec(dm/c,d100,n5,w5,mc2,t4)\n" + ] + } + ], + "source": [ + "# Compare error rates achieved, best-to-worst\n", + "print(\"Err_rate Model\")\n", + "for rate, name in sorted((rate, name) for name, rate in error_rates.items()):\n", + " print(\"%f %s\" % (rate, name))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, "source": [ - "from gensim.models import KeyedVectors\n", - "w2v_g100b = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)\n", - "w2v_g100b.compact_name = 'w2v_g100b'\n", - "word_models.append(w2v_g100b)" + "Here, we do *not* see much benefit of re-inference. It's more likely to help if the initial training used fewer epochs (10 is also a common value in the literature for larger datasets), or perhaps in larger datasets. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "To get copious logging output from above steps..." + "### To get copious logging output from above steps..." ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "import logging\n", @@ -984,25 +1021,30 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "To auto-reload python code while developing..." + "### To auto-reload python code while developing..." ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { "kernelspec": { - "display_name": "Python [default]", + "display_name": "Python 3", "language": "python", "name": "python3" }, @@ -1016,7 +1058,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.1" + "version": "3.6.6" } }, "nbformat": 4, diff --git a/docs/notebooks/doc2vec-lee.ipynb b/docs/notebooks/doc2vec-lee.ipynb index 9865096cdc..371f879f15 100644 --- a/docs/notebooks/doc2vec-lee.ipynb +++ b/docs/notebooks/doc2vec-lee.ipynb @@ -190,16 +190,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now, we'll instantiate a Doc2Vec model with a vector size with 50 words and iterating over the training corpus 55 times. We set the minimum word count to 2 in order to give higher frequency words more weighting. Model accuracy can be improved by increasing the number of iterations but this generally increases the training time. Small datasets with short documents, like this one, can benefit from more training passes." + "Now, we'll instantiate a Doc2Vec model with a vector size with 50 words and iterating over the training corpus 40 times. We set the minimum word count to 2 in order to discard words with very few occurrences. (Without a variety of representative examples, retaining such infrequent words can often make a model worse!) Typical iteration counts in published 'Paragraph Vectors' results, using 10s-of-thousands to millions of docs, are 10-20. More iterations take more time and eventually reach a point of diminishing returns.\n", + "\n", + "However, this is a very very small dataset (300 documents) with shortish documents (a few hundred words). Adding training passes can sometimes help with such small datasets." ] }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ - "model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=55)" + "model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)" ] }, { @@ -211,7 +213,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ @@ -237,15 +239,15 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "CPU times: user 4.5 s, sys: 247 ms, total: 4.75 s\n", - "Wall time: 2.04 s\n" + "CPU times: user 4.61 s, sys: 814 ms, total: 5.43 s\n", + "Wall time: 2.68 s\n" ] } ], @@ -269,26 +271,26 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "array([ 0.03101196, 0.08118944, 0.10724881, -0.16268663, -0.12030419,\n", - " 0.07530276, -0.05967962, 0.01093007, 0.01722554, -0.16849394,\n", - " -0.09248347, 0.00667514, 0.05426382, -0.0725852 , 0.09535281,\n", - " -0.12534387, 0.08636193, -0.1029434 , -0.07632427, -0.24741814,\n", - " -0.1277334 , -0.09834807, -0.12880586, -0.07720284, -0.12248702,\n", - " -0.15788661, 0.17826575, -0.12920539, 0.02845461, -0.12751418,\n", - " 0.06129557, -0.02319777, 0.11814108, -0.08767211, -0.04094559,\n", - " -0.00681656, 0.00937355, 0.02168806, -0.03686712, 0.14234844,\n", - " -0.01192134, 0.06787674, -0.25467244, -0.22923732, -0.03031967,\n", - " -0.2362234 , 0.1105942 , 0.01180398, 0.01921744, -0.07667527],\n", + "array([ 0.24116205, 0.07339828, -0.27019867, -0.19452883, 0.126193 ,\n", + " 0.22654183, 0.26595142, 0.21971616, -0.03823646, -0.14102826,\n", + " 0.30460876, 0.0068176 , -0.1742173 , 0.05304497, 0.16511315,\n", + " -0.15094836, 0.14354771, 0.01259909, -0.17909774, 0.07656667,\n", + " 0.15878952, -0.18826678, 0.03750297, -0.3339148 , -0.09979844,\n", + " -0.05963492, 0.00099474, -0.18307815, -0.00851006, -0.02054437,\n", + " 0.0683636 , -0.13510053, -0.05586798, -0.07510707, 0.13390398,\n", + " -0.08525871, -0.03863541, 0.03461651, -0.1619014 , 0.12662718,\n", + " 0.23388451, 0.11462782, -0.02873337, 0.16269833, -0.01474206,\n", + " 0.09754166, 0.12638392, -0.09281237, -0.04791372, 0.15747668],\n", " dtype=float32)" ] }, - "execution_count": 12, + "execution_count": 10, "metadata": {}, "output_type": "execute_result" } @@ -297,6 +299,15 @@ "model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that `infer_vector()` does *not* take a string, but rather a list of string tokens, which should have already been tokenized the same way as the `words` property of original training document objects. \n", + "\n", + "Also note that because the underlying training/inference algorithms are an iterative approximation problem that makes use of internal randomization, repeated inferences of the same text will return slightly different vectors." + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -313,9 +324,18 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 11, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/neuscratch/Dev/gensim/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n", + " if np.issubdtype(vec.dtype, np.int):\n" + ] + } + ], "source": [ "ranks = []\n", "second_ranks = []\n", @@ -337,7 +357,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 12, "metadata": { "scrolled": true }, @@ -345,16 +365,16 @@ { "data": { "text/plain": [ - "Counter({0: 284, 1: 13, 2: 2, 4: 1})" + "Counter({0: 292, 1: 8})" ] }, - "execution_count": 14, + "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "collections.Counter(ranks) # Results vary due to random seeding and very small corpus" + "collections.Counter(ranks) # Results vary between runs due to random seeding and very small corpus" ] }, { @@ -368,7 +388,7 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 13, "metadata": {}, "outputs": [ { @@ -379,11 +399,13 @@ "\n", "SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3):\n", "\n", - "MOST (299, 0.8637137413024902): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not very pretty tennis there isn too many consistent bounces you are playing like said bit of classic old grass court rafter said rafter levelled the score after lleyton hewitt shock five set loss to nicholas escude in the first singles rubber but rafter says he felt no added pressure after hewitt defeat knew had good team to back me up even if we were down he said knew could win on the last day know the boys can win doubles so even if we were down still feel we are good enough team to win and vice versa they are good enough team to beat us as well»\n", + "MOST (299, 0.93604576587677): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not very pretty tennis there isn too many consistent bounces you are playing like said bit of classic old grass court rafter said rafter levelled the score after lleyton hewitt shock five set loss to nicholas escude in the first singles rubber but rafter says he felt no added pressure after hewitt defeat knew had good team to back me up even if we were down he said knew could win on the last day know the boys can win doubles so even if we were down still feel we are good enough team to win and vice versa they are good enough team to beat us as well»\n", + "\n", + "SECOND-MOST (112, 0.8006965517997742): «australian cricket captain steve waugh has supported fast bowler brett lee after criticism of his intimidatory bowling to the south african tailenders in the first test in adelaide earlier this month lee was fined for giving new zealand tailender shane bond an unsportsmanlike send off during the third test in perth waugh says tailenders should not be protected from short pitched bowling these days you re earning big money you ve got responsibility to learn how to bat he said mean there no times like years ago when it was not professional and sort of bowlers code these days you re professional our batsmen work very hard at their batting and expect other tailenders to do likewise meanwhile waugh says his side will need to guard against complacency after convincingly winning the first test by runs waugh says despite the dominance of his side in the first test south africa can never be taken lightly it only one test match out of three or six whichever way you want to look at it so there lot of work to go he said but it nice to win the first battle definitely it gives us lot of confidence going into melbourne you know the big crowd there we love playing in front of the boxing day crowd so that will be to our advantage as well south africa begins four day match against new south wales in sydney on thursday in the lead up to the boxing day test veteran fast bowler allan donald will play in the warm up match and is likely to take his place in the team for the second test south african captain shaun pollock expects much better performance from his side in the melbourne test we still believe that we didn play to our full potential so if we can improve on our aspects the output we put out on the field will be lot better and we still believe we have side that is good enough to beat australia on our day he said»\n", "\n", - "MEDIAN (178, 0.2800390124320984): «year old middle eastern woman is said to be responding well to treatment after being diagnosed with typhoid in temporary holding centre on remote christmas island it could be hours before tests can confirm whether the disease has spread further two of the woman three children boy aged and year old girl have been quarantined with their mother in the christmas island hospital third child remains at the island sports hall where locals say conditions are crowded and hot all detainees on christmas island are being monitored by health team for signs of fever or abdominal pains the key symptoms of typhoid which is spread by contact with contaminated food or water hygiene measures have also been stepped up the western australian health department is briefing medical staff on infection control procedures but locals have expressed concern the disease could spread to the wider community»\n", + "MEDIAN (119, 0.26439014077186584): «australia is continuing to negotiate with the united states government in an effort to interview the australian david hicks who was captured fighting alongside taliban forces in afghanistan mr hicks is being held by the united states on board ship in the afghanistan region where the australian federal police and australian security intelligence organisation asio officials are trying to gain access foreign affairs minister alexander downer has also confirmed that the australian government is investigating reports that another australian has been fighting for taliban forces in afghanistan we often get reports of people going to different parts of the world and asking us to investigate them he said we always investigate sometimes it is impossible to find out we just don know in this case but it is not to say that we think there are lot of australians in afghanistan the only case we know is hicks mr downer says it is unclear when mr hicks will be back on australian soil but he is hopeful the americans will facilitate australian authorities interviewing him»\n", "\n", - "LEAST (11, 0.01867116428911686): «peru has entered two days of official mourning for the more than people killed in fire that destroyed part of downtown lima police say the fire began when fireworks cache exploded in shop just four blocks from peru congress in heritage listed area famed for its spanish colonial era architecture early evening crowds buying traditional fireworks for new year eve celebrations were trapped by the flames as they raced through surrounding markets and four storey apartment buildings local residents blame vendors of illegal fireworks and say the death toll was exacerbated by poor traffic control in the adjoining narrow street where cars themselves engulfed by fire trapped fleeing victims hospitals have urged the public to donate medicine for the hundreds of burns victims peru president alejandro toledo has cut short his beach holiday to oversee an inquiry»\n", + "LEAST (243, -0.12885713577270508): «four afghan factions have reached agreement on an interim cabinet during talks in germany the united nations says the administration which will take over from december will be headed by the royalist anti taliban commander hamed karzai it concludes more than week of negotiations outside bonn and is aimed at restoring peace and stability to the war ravaged country the year old former deputy foreign minister who is currently battling the taliban around the southern city of kandahar is an ally of the exiled afghan king mohammed zahir shah he will serve as chairman of an interim authority that will govern afghanistan for six month period before loya jirga or grand traditional assembly of elders in turn appoints an month transitional government meanwhile united states marines are now reported to have been deployed in eastern afghanistan where opposition forces are closing in on al qaeda soldiers reports from the area say there has been gun battle between the opposition and al qaeda close to the tora bora cave complex where osama bin laden is thought to be hiding in the south of the country american marines are taking part in patrols around the air base they have secured near kandahar but are unlikely to take part in any assault on the city however the chairman of the joint chiefs of staff general richard myers says they are prepared for anything they are prepared for engagements they re robust fighting force and they re absolutely ready to engage if that required he said»\n", "\n" ] } @@ -391,7 +413,7 @@ "source": [ "print('Document ({}): «{}»\\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))\n", "print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\\n' % model)\n", - "for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:\n", + "for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:\n", " print(u'%s %s: «%s»\\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))" ] }, @@ -399,30 +421,32 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Notice above that the most similar document is has a similarity score of ~80% (or higher). However, the similarity score for the second ranked documents should be significantly lower (assuming the documents are in fact different) and the reasoning becomes obvious when we examine the text itself" + "Notice above that the most similar document (usually the same text) is has a similarity score approaching 1.0. However, the similarity score for the second-ranked documents should be significantly lower (assuming the documents are in fact different) and the reasoning becomes obvious when we examine the text itself.\n", + "\n", + "We can run the next cell repeatedly to see a sampling other target-document comparisons. " ] }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Train Document (186): «united nationals secretary general kofi annan has accepted the nobel peace prize in the norwegian capital oslo declaring that to save one life is to save humanity itself mr annan told gala audience the world must respect the individual whose fundamental rights he says have been sacrificed too often for the good of the state the year old un chief native of ghana shares this year th nobel peace prize with the united nations as whole his award was for bringing new life to the world body in his fight for human rights and against aids and terrorism»\n", + "Train Document (289): «there is renewed attempt to move the debate over choosing an australian head of state forward after conference in southern new south wales at the weekend in corowa delegates adopted proposal which recommended plebiscite to direct another constitutional convention and referendum on republic and australian head of state committee will meet in about four weeks to work on the next step in the campaign one of the proposal developers historian walter phillips hopes there is vote on an australian head of state in about five years think that in five or six years we should be pretty near if we can get this process going and carried forward now we have to persuade our political leaders that it is something they should take up that going to be one of the problems mr phillips said»\n", "\n", - "Similar Document (207, 0.6752535104751587): «geoff huegill has continued his record breaking ways at the world cup short course swimming in melbourne bettering the australian record in the metres butterfly huegill beat fellow australian michael klim backing up after last night setting world record in the metres butterfly»\n", + "Similar Document (298, 0.7201520204544067): «university of canberra academic proposal for republic will be one of five discussed at an historic conference starting in corowa today the conference is part of centenary of federation celebrations and recognises the corowa conference of which began the process towards the federation of australia in university of canberra law lecturer bedeharris is proposing three referenda to determine the republic issue they would decide on whether the monarchy should be replaced the codification powers for head of state and the choice of republic model doctor harris says any constitutional change must involve all australians think it is very important that the people of australia be given the opporunity to choose or be consulted at every stage of the process»\n", "\n" ] } ], "source": [ - "# Pick a random document from the test corpus and infer a vector from the model\n", + "# Pick a random document from the corpus and infer a vector from the model\n", "doc_id = random.randint(0, len(train_corpus) - 1)\n", "\n", - "# Compare and print the most/median/least similar documents from the train corpus\n", + "# Compare and print the second-most-similar document\n", "print('Train Document ({}): «{}»\\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))\n", "sim_id = second_ranks[doc_id]\n", "print('Similar Document {}: «{}»\\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))" @@ -444,24 +468,32 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Test Document (23): «china said sunday it issued new regulations controlling the export of missile technology taking steps to ease concerns about transferring sensitive equipment to middle east countries particularly iran however the new rules apparently do not ban outright the transfer of specific items something washington long has urged beijing to do»\n", + "Test Document (6): «senior members of the saudi royal family paid at least million to osama bin laden terror group and the taliban for an agreement his forces would not attack targets in saudi arabia according to court documents the papers filed in us billion billion lawsuit in the us allege the deal was made after two secret meetings between saudi royals and leaders of al qa ida including bin laden the money enabled al qa ida to fund training camps in afghanistan later attended by the september hijackers the disclosures will increase tensions between the us and saudi arabia»\n", "\n", "SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3):\n", "\n", - "MOST (265, 0.42007705569267273): «the federal government is under fire from unions over new departmental report which recommends australia outsource information technology it to india the document says india has low cost skilled workforce the minister for foreign affairs and trade alexander downer has given his support to the document from his department entitled india new economy old economy the report says sectors like it finance and offer attractive direct investment opportunities it also says australian firms could become more competitive by outsourcing to the indian it sector the community and public sector union wendy caird says the government seems to be encouraging local companies to export jobs to india think that quite alarming obviously labour is great deal cheaper in india and that assisted by the indian government removing labour laws and bankruptcy laws ms caird said the union says while the initiative may create jobs in india it will not help australia rising unemployment»\n", + "MOST (261, 0.6407690048217773): «afghan opposition leaders meeting in germany have reached an agreement after seven days of talks on the structure of an interim post taliban government for afghanistan the agreement calls for the immediate assembly of temporary group of multi national peacekeepers in kabul and possibly other areas the four afghan factions have approved plan for member ruling council composed of chairman five deputy chairmen and other members the council would govern afghanistan for six months at which time traditional afghan assembly called loya jirga would be convened to decide on more permanent structure the agreement calls for elections within two years»\n", "\n", - "MEDIAN (257, 0.11956833302974701): «hundreds of fans stood vigil today for the immersion of george harrison ashes into the ganges river at the hindu holy city of benares but officials and sect leaders remained tightlipped on when or where last rites for the former beatle long time devotee of the hindu hare krishna sect would take place he was closely attached to benares where devout hindus come to scatter the ashes of their dead relatives in the ganges in ritual symbolising the journey of the soul towards eternal salvation the beatles former lead guitarist died on thursday of cancer aged amid chants and prayers of hare krishna devotees who were at his bedside according to details of the ceremony released by members of the hare krishna movement yesterday harrison widow olivia accompanied by son dhani were to scatter some of the ashes early this morning in discreet ceremony at hinduism holy river some of harrison ashes could also be immersed in the ganges at allahabad another holy spot for devout hindus about kilometres upstream from benares spokesman for the hare krishna group said tomorrow harrison family members were supposed to take part in special prayer meeting in vrindavan the birthplace of lord krishna km north of the indian capital the news brought hundreds of journalists fans and curious onlookers to benares odd ghats platforms or steps from which the ashes are strewn into the river this morning but as the day wore on local administration officials and hare krishna devotees in benares refused to confirm when and where along the ganges the ceremony would take place»\n", + "MEDIAN (103, 0.13398753106594086): «the hih royal commission has heard evidence that there were doubts about the company ability to pay all of its creditors three months before its collapse partner for accountancy firm ernst and young john gibbons says he and his colleague kim smith attended meeting with hih on november mr gibbons has told the commission hih chairman ray williams and finance director dominic federa were at that meeting mr gibbons said mr smith noted that if hih was wound up on that date there would be clear shortage of assets to pay creditors he says the directors were told it was highly likely all creditors would not receive per cent returns the commission has also heard that the accountancy firm told the directors that even with hih restructuring plans there was potential for insolvency»\n", "\n", - "LEAST (267, -0.22617124021053314): «israeli prime minister ariel sharon has opened an emergency security cabinet meeting after placing blame for recent suicide attacks squarely on palestinian leader yasser arafat called an urgent meeting of the heads of all the security systems and very shortly the government will hold special session the government will meet in order to make decisions about how to deal further with terrorism he said in national address on public television the government was to discuss its policy on the palestinian authority which mr sharon implied was the enemy of the jewish state and should bear the consequences those who rise up against us to kill us are responsible for their own destruction he said in statement interpreted by palestinian official as call for war arafat has made his strategic choices strategy of terrorism in choosing to try to win political accomplishments through murder and in choosing to allow the ruthless killing of civilians arafat has chosen the path of terrorism mr sharon said the government represents practically the whole of the israel public and we have the paramount goal and need for unity in order to cope with all the brutalities facing us he added tonight we heard declaration of war said chief palestinian negotiator saeb erakat on cnn television sharon has chosen the path of darkness even before his address israeli helicopters and warplanes attacked targets in the west bank and gaza strip including arafat offices and police headquarters in jenin and the palestinian leader three helicopters in gaza city the air strikes were launched on palestinian targets in the wake of weekend suicide attacks by the islamic militant group hamas which left israelis dead meanwhile hamas has defied the palestinian state of emergency and called for more suicide attacks against israel at the funeral of gunman who killed settler more than supporters of the hardline group gathered to bury year old muslim al aarage one of two palestinians who shot the settler dead on sunday in the north of the gaza strip before being killed by israeli soldiers the suicide operations will continue as long as the enemy continues its occupation of palestinian lands in the gaza strip and west bank militant from the group told crowd with loudspeaker when sharon kills women and children our people have the right to defend ourselves then they call us terrorists he said every religion and law in the world gives us the right to defend ourselves he said shortly before the air strikes began security services have arrested some militants from hamas and its smaller rival islamic jihad in the crackdown since sunday human rights group amnesty international has condemned deliberate attacks by the palestinian suicide bombers at the weekend these attacks are horrifying and tragic amnesty said in statement we call on armed groups to end immediately the direct targeting of civilians which contravenes the most fundamental principles of humanity the organisation called on the israeli government and the palestinian authority to remember that no abuses of human rights by armed groups can excuse violations of fundamental human rights and humanitarian law»\n", + "LEAST (264, -0.35531237721443176): «widespread damage from yesterday violent storms in new south wales has forced the government to declare more areas of the state natural disaster zones up to volunteers and fire fighters are continuing the big mop up state emergency services ses volunteers are still clearing some of thehuge trees that came crashing down on homes in sydney north martin walker was sitting on his back deck when the storm struck it sounded like freight train was about to hit our house you could hear it coming with such ferocity and as it hit all the trees just seemed to bend and there was stuff hitting the back of our house mr walker said pitwater bankstown sutherland hurstville and liverpool in sydney and gunnedah and tamworth in the state north west have been added to the list of natural disaster areas new south wales premier bob carr has inspected one of the worst hit parts wahroonga in sydney north struck by the of this storm damage we ve had storms before but never winds of this force and it was uneven and unpredictable in its impact mr carr said the final damage bill is expected to be more than million»\n", "\n" ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/neuscratch/Dev/gensim/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n", + " if np.issubdtype(vec.dtype, np.int):\n" + ] } ], "source": [ @@ -485,6 +517,13 @@ "\n", "That's it! Doc2Vec is a great way to explore relationships between documents." ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { @@ -503,7 +542,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.2" + "version": "3.6.6" } }, "nbformat": 4, diff --git a/gensim/models/doc2vec.py b/gensim/models/doc2vec.py index d73e6e777a..bf1eac8264 100644 --- a/gensim/models/doc2vec.py +++ b/gensim/models/doc2vec.py @@ -743,7 +743,7 @@ def estimated_lookup_memory(self): """ return 60 * len(self.docvecs.offset2doctag) + 140 * len(self.docvecs.doctags) - def infer_vector(self, doc_words, alpha=0.1, min_alpha=0.0001, steps=5): + def infer_vector(self, doc_words, alpha=None, min_alpha=None, epochs=None, steps=None): """Infer a vector for given post-bulk training document. Notes @@ -756,12 +756,17 @@ def infer_vector(self, doc_words, alpha=0.1, min_alpha=0.0001, steps=5): doc_words : list of str A document for which the vector representation will be inferred. alpha : float, optional - The initial learning rate. + The initial learning rate. If unspecified, value from model initialization will be reused. min_alpha : float, optional - Learning rate will linearly drop to `min_alpha` as training progresses. - steps : int, optional - Number of times to train the new document. A higher value may slow down training, but it will result in more - stable representations. + Learning rate will linearly drop to `min_alpha` over all inference epochs. If unspecified, + value from model initialization will be reused. + epochs : int, optional + Number of times to train the new document. Larger values take more time, but may improve + quality and run-to-run stability of inferred vectors. If unspecified, the `epochs` value + from model initialization will be reused. + steps : int, optional, deprecated + Previous name for `epochs`, still available for now for backward compatibility: if + `epochs` is unspecified but `steps` is, the `steps` value will be used. Returns ------- @@ -769,15 +774,19 @@ def infer_vector(self, doc_words, alpha=0.1, min_alpha=0.0001, steps=5): The inferred paragraph vector for the new document. """ + alpha = alpha or self.alpha + min_alpha = min_alpha or self.min_alpha + epochs = epochs or steps or self.epochs + doctag_vectors, doctag_locks = self.trainables.get_doctag_trainables(doc_words, self.docvecs.vector_size) doctag_indexes = [0] work = zeros(self.trainables.layer1_size, dtype=REAL) if not self.sg: neu1 = matutils.zeros_aligned(self.trainables.layer1_size, dtype=REAL) - alpha_delta = (alpha - min_alpha) / (steps - 1) + alpha_delta = (alpha - min_alpha) / max(epochs - 1, 1) - for i in range(steps): + for i in range(epochs): if self.sg: train_document_dbow( self, doc_words, doctag_indexes, alpha, work, diff --git a/gensim/test/test_doc2vec.py b/gensim/test/test_doc2vec.py index 559e166d4f..b80588c55e 100644 --- a/gensim/test/test_doc2vec.py +++ b/gensim/test/test_doc2vec.py @@ -513,8 +513,16 @@ def __init__(self, models): def __getitem__(self, token): return np.concatenate([model[token] for model in self.models]) - def infer_vector(self, document, alpha=0.1, min_alpha=0.0001, steps=5): - return np.concatenate([model.infer_vector(document, alpha, min_alpha, steps) for model in self.models]) + def __str__(self): + """Abbreviated name, built from submodels' names""" + return "+".join([str(model) for model in self.models]) + + @property + def epochs(self): + return self.models[0].epochs + + def infer_vector(self, document, alpha=None, min_alpha=None, epochs=None, steps=None): + return np.concatenate([model.infer_vector(document, alpha, min_alpha, epochs, steps) for model in self.models]) def train(self, *ignore_args, **ignore_kwargs): pass # train subcomponents individually diff --git a/setup.py b/setup.py index 132eb925c5..71c0f044f7 100644 --- a/setup.py +++ b/setup.py @@ -308,7 +308,7 @@ def finalize_options(self): 'numpy >= 1.11.3', 'scipy >= 0.18.1', 'six >= 1.5.0', - 'smart_open >= 1.2.1', + 'smart_open >= 1.2.1, < 1.6.0', ], tests_require=linux_testenv, extras_require={