{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chapter 3: Information Retrieval"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Reading in data\n",
    "\n",
    "There are three components to this data:\n",
    "- documents with their ids and content – there are $1460$ of those to be precise;\n",
    "- questions / queries with their ids and content – there are $112$ of those;\n",
    "- mapping between the queries and relevant documents.\n",
    "\n",
    "First, let's read in documents from the `CISI.ALL` file and store the result in `documents` data structure – set of tuples of document ids matched with contents:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1460\n",
      " 18 Editions of the Dewey Decimal Classifications Comaromi, J.P. The present study is a history of the DEWEY Decimal Classification.  The first edition of the DDC was published in 1876, the eighteenth edition in 1971, and future editions will continue to appear as needed.  In spite of the DDC's long and healthy life, however, its full story has never been told.  There have been biographies of Dewey that briefly describe his system, but this is the first attempt to provide a detailed history of the work that more than any other has spurred the growth of librarianship in this country and abroad. \n"
     ]
    }
   ],
   "source": [
    "def read_documents():\n",
    "    f = open(\"cisi/CISI.ALL\")\n",
    "    merged = \"\"\n",
    "    \n",
    "    for a_line in f.readlines():\n",
    "        if a_line.startswith(\".\"):\n",
    "            merged += \"\\n\" + a_line.strip()\n",
    "        else:\n",
    "            merged += \" \" + a_line.strip()\n",
    "    \n",
    "    documents = {}\n",
    "\n",
    "    content = \"\"\n",
    "    doc_id = \"\"\n",
    "\n",
    "    for a_line in merged.split(\"\\n\"):\n",
    "        if a_line.startswith(\".I\"):\n",
    "            doc_id = a_line.split(\" \")[1].strip()\n",
    "        elif a_line.startswith(\".X\"):\n",
    "            documents[doc_id] = content\n",
    "            content = \"\"\n",
    "            doc_id = \"\"\n",
    "        else:\n",
    "            content += a_line.strip()[3:] + \" \"\n",
    "    f.close()\n",
    "    return documents\n",
    "\n",
    "documents = read_documents()\n",
    "print(len(documents))\n",
    "print(documents.get(\"1\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Second, let's read in queries from the `CISI.QRY` file and store the result in `queries` data structure – set of tuples of query ids matched with contents:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "112\n",
      "What problems and concerns are there in making up descriptive titles? What difficulties are involved in automatically retrieving articles from approximate titles? What is the usual relevance of the content of articles to their titles? \n"
     ]
    }
   ],
   "source": [
    "def read_queries():\n",
    "    f = open(\"cisi/CISI.QRY\")\n",
    "    merged = \"\"\n",
    "    \n",
    "    for a_line in f.readlines():\n",
    "        if a_line.startswith(\".\"):\n",
    "            merged += \"\\n\" + a_line.strip()\n",
    "        else:\n",
    "            merged += \" \" + a_line.strip()\n",
    "    \n",
    "    queries = {}\n",
    "\n",
    "    content = \"\"\n",
    "    qry_id = \"\"\n",
    "\n",
    "    for a_line in merged.split(\"\\n\"):\n",
    "        if a_line.startswith(\".I\"):\n",
    "            if not content==\"\":\n",
    "                queries[qry_id] = content\n",
    "                content = \"\"\n",
    "                qry_id = \"\"\n",
    "            qry_id = a_line.split(\" \")[1].strip()\n",
    "        elif a_line.startswith(\".W\") or a_line.startswith(\".T\"):\n",
    "            content += a_line.strip()[3:] + \" \"\n",
    "    queries[qry_id] = content\n",
    "    f.close()\n",
    "    return queries\n",
    "\n",
    "queries = read_queries()\n",
    "print(len(queries))\n",
    "print(queries.get(\"1\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, let's read in the mapping between the queries and the documents – we'll keep these in the `mappings` data structure – with tuples where each query index (key) corresponds to the list of one or more document indices (value):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "76\n",
      "dict_keys(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '37', '39', '41', '42', '43', '44', '45', '46', '49', '50', '52', '54', '55', '56', '57', '58', '61', '62', '65', '66', '67', '69', '71', '76', '79', '81', '82', '84', '90', '92', '95', '96', '97', '98', '99', '100', '101', '102', '104', '109', '111'])\n",
      "['28', '35', '38', '42', '43', '52', '65', '76', '86', '150', '189', '192', '193', '195', '215', '269', '291', '320', '429', '465', '466', '482', '483', '510', '524', '541', '576', '582', '589', '603', '650', '680', '711', '722', '726', '783', '813', '820', '868', '869', '894', '1162', '1164', '1195', '1196', '1281']\n"
     ]
    }
   ],
   "source": [
    "def read_mappings():\n",
    "    f = open(\"cisi/CISI.REL\")\n",
    "    \n",
    "    mappings = {}\n",
    "\n",
    "    for a_line in f.readlines():\n",
    "        voc = a_line.strip().split()\n",
    "        key = voc[0].strip()\n",
    "        current_value = voc[1].strip()\n",
    "        value = []\n",
    "        if key in mappings.keys():\n",
    "            value = mappings.get(key)\n",
    "        value.append(current_value)\n",
    "        mappings[key] = value\n",
    "\n",
    "    f.close()\n",
    "    return mappings\n",
    "\n",
    "mappings = read_mappings()\n",
    "print(len(mappings))\n",
    "print(mappings.keys())\n",
    "print(mappings.get(\"1\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A simple Boolean search algorithm\n",
    "\n",
    "First perform simple preprocessing as in the previous chapter:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1460\n",
      "['18', 'editions', 'of', 'the', 'dewey', 'decimal', 'classifications', 'comaromi', ',', 'j.p.', 'the', 'present', 'study', 'is', 'a', 'history', 'of', 'the', 'dewey', 'decimal', 'classification', '.', 'the', 'first', 'edition', 'of', 'the', 'ddc', 'was', 'published', 'in', '1876', ',', 'the', 'eighteenth', 'edition', 'in', '1971', ',', 'and', 'future', 'editions', 'will', 'continue', 'to', 'appear', 'as', 'needed', '.', 'in', 'spite', 'of', 'the', 'ddc', \"'s\", 'long', 'and', 'healthy', 'life', ',', 'however', ',', 'its', 'full', 'story', 'has', 'never', 'been', 'told', '.', 'there', 'have', 'been', 'biographies', 'of', 'dewey', 'that', 'briefly', 'describe', 'his', 'system', ',', 'but', 'this', 'is', 'the', 'first', 'attempt', 'to', 'provide', 'a', 'detailed', 'history', 'of', 'the', 'work', 'that', 'more', 'than', 'any', 'other', 'has', 'spurred', 'the', 'growth', 'of', 'librarianship', 'in', 'this', 'country', 'and', 'abroad', '.']\n",
      "113\n",
      "112\n",
      "['what', 'problems', 'and', 'concerns', 'are', 'there', 'in', 'making', 'up', 'descriptive', 'titles', '?', 'what', 'difficulties', 'are', 'involved', 'in', 'automatically', 'retrieving', 'articles', 'from', 'approximate', 'titles', '?', 'what', 'is', 'the', 'usual', 'relevance', 'of', 'the', 'content', 'of', 'articles', 'to', 'their', 'titles', '?']\n",
      "38\n"
     ]
    }
   ],
   "source": [
    "import nltk\n",
    "from nltk import word_tokenize\n",
    "\n",
    "def get_words(text): \n",
    "    word_list = [word for word in word_tokenize(text.lower())]\n",
    "    return word_list\n",
    "\n",
    "doc_words = {}\n",
    "qry_words = {}\n",
    "for doc_id in documents.keys():\n",
    "    doc_words[doc_id] = get_words(documents.get(doc_id))\n",
    "for qry_id in queries.keys():\n",
    "    qry_words[qry_id] = get_words(queries.get(qry_id))\n",
    "\n",
    "print(len(doc_words))\n",
    "print(doc_words.get(\"1\"))\n",
    "print(len(doc_words.get(\"1\")))\n",
    "print(len(qry_words))\n",
    "print(qry_words.get(\"1\"))\n",
    "print(len(qry_words.get(\"1\")))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And next match in a Boolean way:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100']\n",
      "1460\n"
     ]
    }
   ],
   "source": [
    "def retrieve_documents(doc_words, query):\n",
    "    docs = []\n",
    "    for doc_id in doc_words.keys():\n",
    "        found = False\n",
    "        i = 0\n",
    "        while i<len(query) and not found:\n",
    "            word = query[i]\n",
    "            if word in doc_words.get(doc_id):\n",
    "                docs.append(doc_id)\n",
    "                found=True\n",
    "            else:\n",
    "                i+=1\n",
    "    return docs\n",
    "\n",
    "docs = retrieve_documents(doc_words, qry_words.get(\"6\"))\n",
    "print(docs[:100])\n",
    "print(len(docs))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise**: match the documents to the queries based on occurrence of *all* query words in the document:  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[]\n",
      "0\n"
     ]
    }
   ],
   "source": [
    "def retrieve_documents(doc_words, query):\n",
    "    docs = []\n",
    "    for doc_id in doc_words.keys():\n",
    "        #here, you are interested in the documents that contain all words    \n",
    "        found = True    \n",
    "        i = 0\n",
    "        #iterate through words in the query\n",
    "        while i<len(query) and found:    \n",
    "            word = query[i]\n",
    "            if not word in doc_words.get(doc_id):\n",
    "                #if the word is not in document, turn found flag off and stop\n",
    "                found=False    \n",
    "            else:\n",
    "                #otherwise, move on to the next query word\n",
    "                i+=1    \n",
    "        #if all words are found in the document, the last index is len(query)-1\n",
    "        #add the doc_id only in this case\n",
    "        if i==len(query)-1:\n",
    "            docs.append(doc_id)\n",
    "    return docs\n",
    "\n",
    "docs = retrieve_documents(doc_words, qry_words.get(\"112\"))\n",
    "print(docs[:100])\n",
    "print(len(docs))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In fact, it is a very rare case that you may have any single document that contains all the words from the query, therefore, with this approach, you will likely get no relevant documents returned for any queries in this dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Preprocessing the data\n",
    "\n",
    "Apply the following steps (some as before):\n",
    "- tokenize the text\n",
    "- put to lowercase\n",
    "- remove stopwords\n",
    "- lemmatize\n",
    "\n",
    "Now apply these steps to both documents and queries. Let's start with stopwords removal:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['18', 'editions', 'dewey', 'decimal', 'classifications', 'comaromi', 'j.p.', 'present', 'study', 'history', 'dewey', 'decimal', 'classification', 'first', 'edition', 'ddc', 'published', '1876', 'eighteenth', 'edition', '1971', 'future', 'editions', 'continue', 'appear', 'needed', 'spite', 'ddc', \"'s\", 'long', 'healthy', 'life', 'however', 'full', 'story', 'never', 'told', 'biographies', 'dewey', 'briefly', 'describe', 'system', 'first', 'attempt', 'provide', 'detailed', 'history', 'work', 'spurred', 'growth', 'librarianship', 'country', 'abroad']\n"
     ]
    }
   ],
   "source": [
    "import nltk\n",
    "import string\n",
    "from nltk import word_tokenize\n",
    "from nltk.corpus import stopwords\n",
    "\n",
    "def process(text): \n",
    "    stoplist = set(stopwords.words('english'))\n",
    "    word_list = [word for word in word_tokenize(text.lower())\n",
    "                 if not word in stoplist and not word in string.punctuation]\n",
    "    return word_list\n",
    "\n",
    "word_list = process(documents.get(\"1\"))\n",
    "print(word_list)    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lemmatization:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['cost', 'analysis', 'simulation', 'procedure', 'evaluation', 'large', 'information', 'system', 'bourne', 'c.p', 'ford', 'd.f', 'computer', 'program', 'write', 'use', 'simulate', 'several-year', 'operation', 'information', 'system', 'compute', 'estimate', 'expected', 'operating', 'cost', 'well', 'amount', 'equipment', 'personnel', 'require', 'time', 'period', 'program', 'use', 'analysis', 'several', 'large', 'system', 'prove', 'useful', 'research', 'tool', 'study', 'system', 'many', 'component', 'interrelated', 'operation', 'equivalent', 'manual', 'analysis', 'would', 'extremely', 'cumbersome', 'time', 'consuming', 'perhaps', 'even', 'impractical', 'paper', 'describe', 'program', 'show', 'example', 'result', 'simulation', 'two', 'several', 'suggested', 'design', 'specific', 'information', 'system']\n"
     ]
    }
   ],
   "source": [
    "import nltk\n",
    "import string\n",
    "from nltk import word_tokenize, WordNetLemmatizer, pos_tag\n",
    "from nltk.corpus import stopwords\n",
    "\n",
    "def process(text): \n",
    "    stoplist = set(stopwords.words('english'))\n",
    "    lemmatizer = WordNetLemmatizer()\n",
    "    pos_list = pos_tag(word_tokenize(text.lower()))\n",
    "    word_list = [entry for entry in pos_list\n",
    "                 if not entry[0] in stoplist and not entry[0] in string.punctuation]\n",
    "    lemmatized_wl = []\n",
    "    for entry in word_list:\n",
    "        if entry[1].startswith(\"V\"):\n",
    "            lemmatized_wl.append(lemmatizer.lemmatize(entry[0], \"v\"))\n",
    "        else:\n",
    "            lemmatized_wl.append(lemmatizer.lemmatize(entry[0]))\n",
    "    return lemmatized_wl\n",
    "\n",
    "word_list = process(documents.get(\"27\"))\n",
    "print(word_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Preprocessing with a stemmer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['cost', 'analys', 'sim', 'proc', 'evalu', 'larg', 'inform', 'system', 'bourn', 'c.p', 'ford', 'd.f', 'comput', 'program', 'writ', 'us', 'sim', 'several-year', 'op', 'inform', 'system', 'comput', 'estim', 'expect', 'op', 'cost', 'wel', 'amount', 'equip', 'personnel', 'requir', 'tim', 'period', 'program', 'us', 'analys', 'sev', 'larg', 'system', 'prov', 'us', 'research', 'tool', 'study', 'system', 'many', 'compon', 'interrel', 'op', 'equ', 'man', 'analys', 'would', 'extrem', 'cumbersom', 'tim', 'consum', 'perhap', 'ev', 'impract', 'pap', 'describ', 'program', 'show', 'exampl', 'result', 'sim', 'two', 'sev', 'suggest', 'design', 'spec', 'inform', 'system']\n",
      "['org', 'org', 'org', 'org', 'org', 'org']\n"
     ]
    }
   ],
   "source": [
    "import nltk\n",
    "import string\n",
    "from nltk import word_tokenize\n",
    "from nltk.corpus import stopwords\n",
    "from nltk.stem.lancaster import LancasterStemmer\n",
    "\n",
    "def process(text): \n",
    "    stoplist = set(stopwords.words('english'))\n",
    "    st = LancasterStemmer()\n",
    "    word_list = [st.stem(word) for word in word_tokenize(text.lower())\n",
    "                 if not word in stoplist and not word in string.punctuation]\n",
    "    return word_list\n",
    "  \n",
    "word_list = process(documents.get(\"27\"))\n",
    "print(word_list)\n",
    "word_list = process(\"organize, organizing, organizational, organ, organic, organizer\")\n",
    "print(word_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Step 3: Term weighing\n",
    "\n",
    "First calculate the term frequency in each document:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1460\n",
      "{'18': 1, 'edit': 4, 'dewey': 3, 'decim': 2, 'class': 2, 'comarom': 1, 'j.p.': 1, 'pres': 1, 'study': 1, 'hist': 2, 'first': 2, 'ddc': 2, 'publ': 1, '1876': 1, 'eighteen': 1, '1971': 1, 'fut': 1, 'continu': 1, 'appear': 1, 'nee': 1, 'spit': 1, \"'s\": 1, 'long': 1, 'healthy': 1, 'lif': 1, 'howev': 1, 'ful': 1, 'story': 1, 'nev': 1, 'told': 1, 'biograph': 1, 'brief': 1, 'describ': 1, 'system': 1, 'attempt': 1, 'provid': 1, 'detail': 1, 'work': 1, 'spur': 1, 'grow': 1, 'libr': 1, 'country': 1, 'abroad': 1}\n",
      "43\n",
      "112\n",
      "{'problem': 1, 'concern': 1, 'mak': 1, 'describ': 1, 'titl': 3, 'difficul': 1, 'involv': 1, 'autom': 1, 'retriev': 1, 'artic': 2, 'approxim': 1, 'us': 1, 'relev': 1, 'cont': 1}\n",
      "14\n"
     ]
    }
   ],
   "source": [
    "import nltk\n",
    "import string\n",
    "from nltk import word_tokenize\n",
    "from nltk.corpus import stopwords\n",
    "from nltk.stem.lancaster import LancasterStemmer\n",
    "\n",
    "def get_terms(text): \n",
    "    stoplist = set(stopwords.words('english'))\n",
    "    terms = {}\n",
    "    st = LancasterStemmer()\n",
    "    word_list = [st.stem(word) for word in word_tokenize(text.lower())\n",
    "                 if not word in stoplist and not word in string.punctuation]\n",
    "    for word in word_list:\n",
    "        terms[word] = terms.get(word, 0) + 1\n",
    "    return terms\n",
    "\n",
    "doc_terms = {}\n",
    "qry_terms = {}\n",
    "for doc_id in documents.keys():\n",
    "    doc_terms[doc_id] = get_terms(documents.get(doc_id))\n",
    "for qry_id in queries.keys():\n",
    "    qry_terms[qry_id] = get_terms(queries.get(qry_id))\n",
    "\n",
    "print(len(doc_terms))\n",
    "print(doc_terms.get(\"1\"))\n",
    "print(len(doc_terms.get(\"1\")))\n",
    "print(len(qry_terms))\n",
    "print(qry_terms.get(\"1\"))\n",
    "print(len(qry_terms.get(\"1\")))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Second, collect shared vocabulary from all documents and queries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7775\n",
      "[\"''\", \"'60\", \"'70\", \"'anyhow\", \"'apparent\", \"'basic\", \"'better\", \"'bibliograph\", \"'bibliometrics\", \"'building\"]\n"
     ]
    }
   ],
   "source": [
    "def collect_vocabulary():\n",
    "    all_terms = []\n",
    "    for doc_id in doc_terms.keys():\n",
    "        for term in doc_terms.get(doc_id).keys():            \n",
    "            all_terms.append(term)\n",
    "    for qry_id in qry_terms.keys():\n",
    "        for term in qry_terms.get(qry_id).keys():\n",
    "            all_terms.append(term)\n",
    "    return sorted(set(all_terms))\n",
    "\n",
    "all_terms = collect_vocabulary()\n",
    "print(len(all_terms))\n",
    "print(all_terms[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Represent each document and query as the counts in the shared space:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1460\n",
      "7775\n",
      "112\n",
      "7775\n"
     ]
    }
   ],
   "source": [
    "def vectorize(input_features, vocabulary):\n",
    "    output = {}\n",
    "    for item_id in input_features.keys():\n",
    "        features = input_features.get(item_id)\n",
    "        output_vector = []\n",
    "        for word in vocabulary:\n",
    "            if word in features.keys():\n",
    "                output_vector.append(int(features.get(word)))\n",
    "            else:\n",
    "                output_vector.append(0)\n",
    "        output[item_id] = output_vector\n",
    "    return output\n",
    "\n",
    "doc_vectors = vectorize(doc_terms, all_terms)\n",
    "qry_vectors = vectorize(qry_terms, all_terms)\n",
    "\n",
    "print(len(doc_vectors))\n",
    "print(len(doc_vectors.get(\"1460\")))\n",
    "#print(doc_vectors.get(\"1460\")[:1000])\n",
    "print(len(qry_vectors))\n",
    "print(len(qry_vectors.get(\"112\")))\n",
    "#print(qry_vectors.get(\"112\")[:1000])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7775\n",
      "0.4287539560862571\n"
     ]
    }
   ],
   "source": [
    "import math\n",
    "\n",
    "def calculate_idfs(vocabulary, doc_features):\n",
    "    doc_idfs = {}\n",
    "    for term in vocabulary:\n",
    "        doc_count = 0\n",
    "        for doc_id in doc_features.keys():\n",
    "            terms = doc_features.get(doc_id)\n",
    "            if term in terms.keys():\n",
    "                doc_count += 1\n",
    "        doc_idfs[term] = math.log(float(len(doc_features.keys()))/float(1 + doc_count), 10)\n",
    "    return doc_idfs\n",
    "\n",
    "doc_idfs = calculate_idfs(all_terms, doc_terms)\n",
    "print(len(doc_idfs))\n",
    "print(doc_idfs.get(\"system\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1460\n",
      "7775\n"
     ]
    }
   ],
   "source": [
    "def vectorize_idf(input_terms, input_idfs, vocabulary):\n",
    "    output = {}\n",
    "    for item_id in input_terms.keys():\n",
    "        terms = input_terms.get(item_id)\n",
    "        output_vector = []\n",
    "        for term in vocabulary:\n",
    "            if term in terms.keys():\n",
    "                output_vector.append(input_idfs.get(term)*float(terms.get(term)))\n",
    "            else:\n",
    "                output_vector.append(float(0))\n",
    "        output[item_id] = output_vector\n",
    "    return output\n",
    "\n",
    "doc_vectors = vectorize_idf(doc_terms, doc_idfs, all_terms)\n",
    "\n",
    "print(len(doc_vectors))\n",
    "print(len(doc_vectors.get(\"1460\")))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Retrieval of the most similar documents\n",
    "\n",
    "Use cosine similarity, as before, on unfiltered texts:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.9701425001453319\n"
     ]
    }
   ],
   "source": [
    "import math\n",
    "\n",
    "query = [1, 1]\n",
    "document = [3, 5]\n",
    "\n",
    "def length(vector):\n",
    "    sq_length = 0\n",
    "    for index in range(0, len(vector)):\n",
    "        sq_length += math.pow(vector[index], 2)\n",
    "    return math.sqrt(sq_length)\n",
    "    \n",
    "def dot_product(vector1, vector2):\n",
    "    if len(vector1)==len(vector2):\n",
    "        dot_prod = 0\n",
    "        for index in range(0, len(vector1)):\n",
    "            if not vector1[index]==0 and not vector2[index]==0:\n",
    "                dot_prod += vector1[index]*vector2[index]\n",
    "        return dot_prod\n",
    "    else:\n",
    "        return \"Unmatching dimensionality\"\n",
    "\n",
    "def calculate_cosine(query, document):\n",
    "    cosine =  dot_product(query, document) / (length(query) * length(document)) \n",
    "    return cosine\n",
    "\n",
    "cosine = calculate_cosine(query, document)\n",
    "print (cosine)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get cosine similarity for some examples of a particular query and a particular document:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.21799825905375303\n"
     ]
    }
   ],
   "source": [
    "#document = doc_vectors.get(\"27\")\n",
    "#query = qry_vectors.get(\"15\")\n",
    "\n",
    "document = doc_vectors.get(\"60\")\n",
    "query = qry_vectors.get(\"3\")\n",
    "\n",
    "cosine =  dot_product(query, document) / (length(query) * length(document))     \n",
    "print(cosine)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Apply the search algorithm to find relevant documents for a particular query:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "469\n",
      "1179\n",
      "1181\n",
      "1142\n",
      "1190\n",
      "1116\n",
      "445\n",
      "85\n",
      "599\n",
      "60\n",
      "540\n",
      "640\n",
      "372\n",
      "1030\n",
      "1095\n",
      "965\n",
      "1161\n",
      "241\n",
      "1191\n",
      "899\n",
      "137\n",
      "535\n",
      "456\n",
      "803\n",
      "95\n",
      "544\n",
      "1077\n",
      "1111\n",
      "1103\n",
      "837\n",
      "560\n",
      "1133\n",
      "602\n",
      "166\n",
      "1080\n",
      "163\n",
      "686\n",
      "839\n",
      "1297\n",
      "1082\n",
      "1428\n",
      "1330\n",
      "1113\n",
      "110\n"
     ]
    }
   ],
   "source": [
    "from operator import itemgetter\n",
    "\n",
    "results = {}\n",
    "\n",
    "for doc_id in doc_vectors.keys():\n",
    "    document = doc_vectors.get(doc_id)\n",
    "    cosine = calculate_cosine(query, document)    \n",
    "    results[doc_id] = cosine\n",
    "\n",
    "for items in sorted(results.items(), key=itemgetter(1), reverse=True)[:44]:\n",
    "    print(items[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5: Evaluation\n",
    "\n",
    "Prefilter – only keep the documents that contain at least one word from the query – to speed search up:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['5', '6', '10', '15', '16', '17', '21', '22', '25', '26', '27', '29', '30', '33', '38', '41', '42', '43', '45', '46', '47', '49', '51', '52', '56', '57', '58', '63', '64', '66', '68', '71', '74', '77', '78', '79', '80', '82', '87', '90', '91', '92', '95', '96', '97', '98', '101', '102', '104', '105', '106', '107', '109', '114', '116', '117', '122', '123', '124', '126', '129', '131', '132', '136', '140', '141', '142', '144', '150', '151', '155', '157', '158', '159', '160', '163', '168', '169', '175', '178', '179', '180', '181', '191', '194', '197', '206', '208', '211', '212', '214', '218', '220', '228', '229', '233', '237', '240', '241', '242']\n",
      "607\n"
     ]
    }
   ],
   "source": [
    "def prefilter(doc_terms, query):\n",
    "    docs = []\n",
    "    for doc_id in doc_terms.keys():\n",
    "        found = False\n",
    "        i = 0\n",
    "        while i<len(query.keys()) and not found:\n",
    "            term = list(query.keys())[i]\n",
    "            if term in doc_terms.get(doc_id).keys():\n",
    "                docs.append(doc_id)\n",
    "                found=True\n",
    "            else:\n",
    "                i+=1\n",
    "    return docs\n",
    "\n",
    "docs = prefilter(doc_terms, qry_terms.get(\"6\"))\n",
    "print(docs[:100])\n",
    "print(len(docs))\n",
    "\n",
    "prefiltered_docs = {}\n",
    "for query_id in mappings.keys():\n",
    "    prefiltered_docs[query_id] = prefilter(doc_terms, qry_terms.get(str(query_id)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Return the top-3 or top-10 results and evaluate in terms of precision:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1: 1.0\n",
      "2: 0.3333333333333333\n",
      "3: 1.0\n",
      "4: 0.0\n",
      "5: 0.0\n",
      "6: 0.0\n",
      "7: 0.0\n",
      "8: 0.0\n",
      "9: 0.3333333333333333\n",
      "10: 0.6666666666666666\n",
      "11: 0.3333333333333333\n",
      "12: 0.0\n",
      "13: 0.3333333333333333\n",
      "14: 0.0\n",
      "15: 0.0\n",
      "16: 0.0\n",
      "17: 0.0\n",
      "18: 0.0\n",
      "19: 0.0\n",
      "20: 0.3333333333333333\n",
      "21: 0.0\n",
      "22: 0.0\n",
      "23: 0.3333333333333333\n",
      "24: 1.0\n",
      "25: 0.0\n",
      "26: 0.6666666666666666\n",
      "27: 0.6666666666666666\n",
      "28: 0.6666666666666666\n",
      "29: 0.6666666666666666\n",
      "30: 1.0\n",
      "31: 0.3333333333333333\n",
      "32: 0.3333333333333333\n",
      "33: 0.0\n",
      "34: 0.6666666666666666\n",
      "35: 0.6666666666666666\n",
      "37: 0.3333333333333333\n",
      "39: 0.3333333333333333\n",
      "41: 0.3333333333333333\n",
      "42: 0.6666666666666666\n",
      "43: 0.0\n",
      "44: 0.3333333333333333\n",
      "45: 0.3333333333333333\n",
      "46: 0.6666666666666666\n",
      "49: 0.3333333333333333\n",
      "50: 0.6666666666666666\n",
      "52: 1.0\n",
      "54: 0.3333333333333333\n",
      "55: 1.0\n",
      "56: 0.6666666666666666\n",
      "57: 0.0\n",
      "58: 1.0\n",
      "61: 0.3333333333333333\n",
      "62: 1.0\n",
      "65: 0.6666666666666666\n",
      "66: 1.0\n",
      "67: 0.0\n",
      "69: 0.3333333333333333\n",
      "71: 0.0\n",
      "76: 1.0\n",
      "79: 0.3333333333333333\n",
      "81: 0.3333333333333333\n",
      "82: 0.3333333333333333\n",
      "84: 0.0\n",
      "90: 0.0\n",
      "92: 0.6666666666666666\n",
      "95: 0.6666666666666666\n",
      "96: 0.0\n",
      "97: 0.6666666666666666\n",
      "98: 1.0\n",
      "99: 0.3333333333333333\n",
      "100: 0.0\n",
      "101: 0.0\n",
      "102: 1.0\n",
      "104: 0.0\n",
      "109: 0.6666666666666666\n",
      "111: 1.0\n",
      "0.4035087719298246\n",
      "0.6578947368421053\n"
     ]
    }
   ],
   "source": [
    "def calculate_precision(model_output, gold_standard):\n",
    "    true_pos = 0\n",
    "    for item in model_output:\n",
    "        if item in gold_standard:\n",
    "            true_pos += 1\n",
    "    return float(true_pos)/float(len(model_output))\n",
    "\n",
    "def calculate_found(model_output, gold_standard):\n",
    "    found = 0\n",
    "    for item in model_output:\n",
    "        if item in gold_standard:\n",
    "            found = 1\n",
    "    return float(found)\n",
    "\n",
    "precision_all = 0.0\n",
    "found_all = 0.0\n",
    "for query_id in mappings.keys():\n",
    "    gold_standard = mappings.get(str(query_id))\n",
    "    query = qry_vectors.get(str(query_id))\n",
    "    results = {}\n",
    "    model_output = []\n",
    "    for doc_id in prefiltered_docs.get(str(query_id)):\n",
    "        document = doc_vectors.get(doc_id)\n",
    "        cosine = calculate_cosine(query, document)    \n",
    "        results[doc_id] = cosine\n",
    "    for items in sorted(results.items(), key=itemgetter(1), \n",
    "                        #reverse=True)[:min(10, len(gold_standard))]:\n",
    "                        #reverse=True)[:min(3, len(gold_standard))]:\n",
    "                        reverse=True)[:3]:\n",
    "        model_output.append(items[0])\n",
    "    precision = calculate_precision(model_output, gold_standard)\n",
    "    found = calculate_found(model_output, gold_standard)\n",
    "    print(f\"{str(query_id)}: {str(precision)}\")\n",
    "    precision_all += precision\n",
    "    found_all += found\n",
    "\n",
    "print(precision_all/float(len(mappings.keys())))\n",
    "print(found_all/float(len(mappings.keys())))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1: 1.0\n",
      "2: 0.0\n",
      "3: 1.0\n",
      "4: 0.0\n",
      "5: 0.0\n",
      "6: 0.0\n",
      "7: 0.0\n",
      "8: 0.0\n",
      "9: 0.0\n",
      "10: 1.0\n",
      "11: 1.0\n",
      "12: 0.0\n",
      "13: 1.0\n",
      "14: 0.0\n",
      "15: 0.0\n",
      "16: 0.0\n",
      "17: 0.0\n",
      "18: 0.0\n",
      "19: 0.0\n",
      "20: 0.0\n",
      "21: 0.0\n",
      "22: 0.0\n",
      "23: 0.0\n",
      "24: 1.0\n",
      "25: 0.0\n",
      "26: 1.0\n",
      "27: 1.0\n",
      "28: 1.0\n",
      "29: 1.0\n",
      "30: 1.0\n",
      "31: 0.0\n",
      "32: 0.0\n",
      "33: 0.0\n",
      "34: 1.0\n",
      "35: 0.0\n",
      "37: 1.0\n",
      "39: 1.0\n",
      "41: 1.0\n",
      "42: 1.0\n",
      "43: 0.0\n",
      "44: 0.0\n",
      "45: 1.0\n",
      "46: 0.0\n",
      "49: 0.0\n",
      "50: 1.0\n",
      "52: 1.0\n",
      "54: 0.0\n",
      "55: 1.0\n",
      "56: 1.0\n",
      "57: 0.0\n",
      "58: 1.0\n",
      "61: 0.0\n",
      "62: 1.0\n",
      "65: 1.0\n",
      "66: 1.0\n",
      "67: 0.0\n",
      "69: 1.0\n",
      "71: 0.0\n",
      "76: 1.0\n",
      "79: 1.0\n",
      "81: 1.0\n",
      "82: 0.0\n",
      "84: 0.0\n",
      "90: 0.0\n",
      "92: 1.0\n",
      "95: 0.0\n",
      "96: 0.0\n",
      "97: 1.0\n",
      "98: 1.0\n",
      "99: 0.0\n",
      "100: 0.0\n",
      "101: 0.0\n",
      "102: 1.0\n",
      "104: 0.0\n",
      "109: 0.0\n",
      "111: 1.0\n",
      "0.4473684210526316\n"
     ]
    }
   ],
   "source": [
    "precision_all = 0.0\n",
    "for query_id in mappings.keys():\n",
    "    gold_standard = mappings.get(str(query_id))\n",
    "    query = qry_vectors.get(str(query_id))\n",
    "    result = \"\"\n",
    "    model_output = []\n",
    "    max_sim = 0.0\n",
    "    prefiltered_docs = prefilter(doc_terms, qry_terms.get(str(query_id)))\n",
    "    for doc_id in prefiltered_docs:\n",
    "        document = doc_vectors.get(doc_id)\n",
    "        cosine = calculate_cosine(query, document) \n",
    "        if cosine >= max_sim:\n",
    "            max_sim = cosine\n",
    "            result = doc_id\n",
    "    model_output.append(result)\n",
    "    precision = calculate_precision(model_output, gold_standard)\n",
    "    print(f\"{str(query_id)}: {str(precision)}\")\n",
    "    precision_all += precision\n",
    "\n",
    "print(precision_all/len(mappings.keys()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "MRR – rank of the first relevant entry:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1: 1.0\n",
      "2: 0.3333333333333333\n",
      "3: 1.0\n",
      "4: 0.09090909090909091\n",
      "5: 0.14285714285714285\n",
      "6: 0.038461538461538464\n",
      "7: 0.043478260869565216\n",
      "8: 0.02857142857142857\n",
      "9: 0.5\n",
      "10: 1.0\n",
      "11: 1.0\n",
      "12: 0.1\n",
      "13: 1.0\n",
      "14: 0.011494252873563218\n",
      "15: 0.125\n",
      "16: 0.029411764705882353\n",
      "17: 0.25\n",
      "18: 0.25\n",
      "19: 0.25\n",
      "20: 0.5\n",
      "21: 0.05555555555555555\n",
      "22: 0.09090909090909091\n",
      "23: 0.5\n",
      "24: 1.0\n",
      "25: 0.1111111111111111\n",
      "26: 1.0\n",
      "27: 1.0\n",
      "28: 1.0\n",
      "29: 1.0\n",
      "30: 1.0\n",
      "31: 0.5\n",
      "32: 0.3333333333333333\n",
      "33: 0.05555555555555555\n",
      "34: 1.0\n",
      "35: 0.5\n",
      "37: 1.0\n",
      "39: 1.0\n",
      "41: 1.0\n",
      "42: 1.0\n",
      "43: 0.14285714285714285\n",
      "44: 0.5\n",
      "45: 1.0\n",
      "46: 0.5\n",
      "49: 0.3333333333333333\n",
      "50: 1.0\n",
      "52: 1.0\n",
      "54: 0.3333333333333333\n",
      "55: 1.0\n",
      "56: 1.0\n",
      "57: 0.09090909090909091\n",
      "58: 1.0\n",
      "61: 0.3333333333333333\n",
      "62: 1.0\n",
      "65: 1.0\n",
      "66: 1.0\n",
      "67: 0.25\n",
      "69: 1.0\n",
      "71: 0.25\n",
      "76: 1.0\n",
      "79: 1.0\n",
      "81: 1.0\n",
      "82: 0.5\n",
      "84: 0.05\n",
      "90: 0.2\n",
      "92: 1.0\n",
      "95: 0.5\n",
      "96: 0.08333333333333333\n",
      "97: 1.0\n",
      "98: 1.0\n",
      "99: 0.3333333333333333\n",
      "100: 0.1\n",
      "101: 0.020833333333333332\n",
      "102: 1.0\n",
      "104: 0.25\n",
      "109: 0.5\n",
      "111: 1.0\n",
      "0.5804111538527951\n"
     ]
    }
   ],
   "source": [
    "rank_all = 0.0\n",
    "for query_id in mappings.keys():\n",
    "    gold_standard = mappings.get(str(query_id))\n",
    "    query = qry_vectors.get(str(query_id))\n",
    "    results = {}\n",
    "    for doc_id in doc_vectors.keys():\n",
    "        document = doc_vectors.get(doc_id)\n",
    "        cosine = calculate_cosine(query, document)    \n",
    "        results[doc_id] = cosine\n",
    "    sorted_results = sorted(results.items(), key=itemgetter(1), reverse=True)\n",
    "    index = 0\n",
    "    found = False\n",
    "    while found==False:\n",
    "        item = sorted_results[index]\n",
    "        index += 1\n",
    "        if index==len(sorted_results):\n",
    "            found = True\n",
    "        if item[0] in gold_standard:\n",
    "            found = True\n",
    "            print(f\"{str(query_id)}: {str(float(1) / float(index))}\")\n",
    "            rank_all += float(1) / float(index)\n",
    "            \n",
    "            \n",
    "print(rank_all/float(len(mappings.keys())))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}