1_Logistic_Regression.ipnyb

{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"1_Logistic_Regression.ipnyb","provenance":[],"collapsed_sections":[],"toc_visible":true,"mount_file_id":"1pqasoXaTck1VqtA5p78DR_tOKTbc5tbc","authorship_tag":"ABX9TyMOGzL/ZIYgadwB05ZqnO/N"},"kernelspec":{"name":"python3","display_name":"Python 3"}},"cells":[{"cell_type":"markdown","metadata":{"id":"5R4mLaolzHm_"},"source":["# Imports"]},{"cell_type":"code","metadata":{"id":"qMiy8hf5y6Oh"},"source":["import pandas as pd\n","import numpy as np\n","\n","from sklearn.feature_extraction.text import CountVectorizer\n","from sklearn.model_selection import train_test_split\n","from sklearn.linear_model import LogisticRegression\n","from sklearn.metrics import confusion_matrix, classification_report\n","from sklearn.preprocessing import LabelEncoder\n","\n","import matplotlib.pyplot as plt"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"wsV9LpsB1kDd"},"source":["# Data"]},{"cell_type":"markdown","metadata":{"id":"KTtAL3_U6j0S"},"source":["For this example you need to download twitter data from Kaggle: https://www.kaggle.com/kazanova/sentiment140"]},{"cell_type":"markdown","metadata":{"id":"wjfc1e9d80OA"},"source":["Context \\\\\n","\\\n","This is the sentiment140 dataset. It contains 1,600,000 tweets extracted using the twitter api . The tweets have been annotated (0 = negative, 4 = positive) and they can be used to detect sentiment. \\\\\n","\\\n","Content \\\\\n","\\\n","It contains the following 6 fields:\n","* target: the polarity of the tweet (0 = negative, 4 = positive)\n","\n","* ids: The id of the tweet ( 2087)\n","\n","* date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)\n","\n","* flag: The query (lyx). If there is no query, then this value is NO_QUERY.\n","\n","* user: the user that tweeted (robotickilldozr)\n","\n","* text: the text of the tweet (Lyx is cool)\n","\n","Acknowledgements \\\\\n","The official link regarding the dataset with resources about how it was generated is [here](http://%20http//help.sentiment140.com/for-students/) \\\\\n","The official paper detailing the approach is [here](http://bhttp//cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf)\n","\n","Citation: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12."]},{"cell_type":"code","metadata":{"id":"FzXp3Gua1lxV"},"source":["df = pd.read_csv('/content/drive/My Drive/YouTube/botters_2020-10/twitter_sentiment_course/twitter_data/training.1600000.processed.noemoticon.csv',\n","                 encoding='ISO-8859-1', \n","                 names=[\n","                        'target',\n","                        'id',\n","                        'date',\n","                        'flag',\n","                        'user',\n","                        'text'\n","                        ])"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"XhsfgM9o12-L","executionInfo":{"status":"ok","timestamp":1602555856080,"user_tz":-180,"elapsed":9115,"user":{"displayName":"Adam Ling","photoUrl":"","userId":"04744848912553865948"}},"outputId":"b6454407-c903-45d9-f00f-9a3293be7cbe","colab":{"base_uri":"https://localhost:8080/","height":204}},"source":["df.head()"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>target</th>\n","      <th>id</th>\n","      <th>date</th>\n","      <th>flag</th>\n","      <th>user</th>\n","      <th>text</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>0</th>\n","      <td>0</td>\n","      <td>1467810369</td>\n","      <td>Mon Apr 06 22:19:45 PDT 2009</td>\n","      <td>NO_QUERY</td>\n","      <td>_TheSpecialOne_</td>\n","      <td>@switchfoot http://twitpic.com/2y1zl - Awww, t...</td>\n","    </tr>\n","    <tr>\n","      <th>1</th>\n","      <td>0</td>\n","      <td>1467810672</td>\n","      <td>Mon Apr 06 22:19:49 PDT 2009</td>\n","      <td>NO_QUERY</td>\n","      <td>scotthamilton</td>\n","      <td>is upset that he can't update his Facebook by ...</td>\n","    </tr>\n","    <tr>\n","      <th>2</th>\n","      <td>0</td>\n","      <td>1467810917</td>\n","      <td>Mon Apr 06 22:19:53 PDT 2009</td>\n","      <td>NO_QUERY</td>\n","      <td>mattycus</td>\n","      <td>@Kenichan I dived many times for the ball. Man...</td>\n","    </tr>\n","    <tr>\n","      <th>3</th>\n","      <td>0</td>\n","      <td>1467811184</td>\n","      <td>Mon Apr 06 22:19:57 PDT 2009</td>\n","      <td>NO_QUERY</td>\n","      <td>ElleCTF</td>\n","      <td>my whole body feels itchy and like its on fire</td>\n","    </tr>\n","    <tr>\n","      <th>4</th>\n","      <td>0</td>\n","      <td>1467811193</td>\n","      <td>Mon Apr 06 22:19:57 PDT 2009</td>\n","      <td>NO_QUERY</td>\n","      <td>Karoli</td>\n","      <td>@nationwideclass no, it's not behaving at all....</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"],"text/plain":["   target  ...                                               text\n","0       0  ...  @switchfoot http://twitpic.com/2y1zl - Awww, t...\n","1       0  ...  is upset that he can't update his Facebook by ...\n","2       0  ...  @Kenichan I dived many times for the ball. Man...\n","3       0  ...    my whole body feels itchy and like its on fire \n","4       0  ...  @nationwideclass no, it's not behaving at all....\n","\n","[5 rows x 6 columns]"]},"metadata":{"tags":[]},"execution_count":4}]},{"cell_type":"markdown","metadata":{"id":"gnHNDF2H8qo5"},"source":["We have two classes in the dataset"]},{"cell_type":"code","metadata":{"id":"GrL29pcJ6IWE","executionInfo":{"status":"ok","timestamp":1602555856081,"user_tz":-180,"elapsed":9101,"user":{"displayName":"Adam Ling","photoUrl":"","userId":"04744848912553865948"}},"outputId":"2a2de589-523d-4f0b-dd1e-d3da755d3c30","colab":{"base_uri":"https://localhost:8080/","height":34}},"source":["df.target.unique()"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["array([0, 4])"]},"metadata":{"tags":[]},"execution_count":5}]},{"cell_type":"markdown","metadata":{"id":"ZMQPnCUZdLw_"},"source":["Let's check how equally distributed those classes are."]},{"cell_type":"code","metadata":{"id":"xqUoAbO9awuD","executionInfo":{"status":"ok","timestamp":1602555857203,"user_tz":-180,"elapsed":10214,"user":{"displayName":"Adam Ling","photoUrl":"","userId":"04744848912553865948"}},"outputId":"ee669403-8e4a-44a1-ddeb-ce936d104260","colab":{"base_uri":"https://localhost:8080/","height":265}},"source":["classes = df.target.unique()\n","counts = []\n","\n","for i in classes:\n","  count = len(df[df.target==i])\n","  counts.append(count)\n","\n","plt.bar(['negative', 'positive'], counts)\n","plt.show()"],"execution_count":null,"outputs":[{"output_type":"display_data","data":{"image/png":"iVBORw0KGgoAAAANSUhEUgAAAYkAAAD4CAYAAAAZ1BptAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAXaElEQVR4nO3df7Bc5X3f8fcnyNgYByTgRoMlHFFbjYuZGsMdkOtMprEaIUgmYhpMoXakUI1VDziNQzOx3OkMiQkZPMmUhImtRDUqInWMZRoPqissq7JpG88IuNiEnybcgImk4ceNJCAOsR2cb//YR/FyvefeFYa9svR+zezsc77nOec5q1nt554fuydVhSRJg/zIXG+AJOnwZUhIkjoZEpKkToaEJKmTISFJ6jRvrjfglXbKKafUkiVL5nozJOmHyj333PPXVTU2vX7EhcSSJUuYmJiY682QpB8qSZ4YVPdwkySpkyEhSepkSEiSOhkSkqROhoQkqZMhIUnqNFRIJPnVJA8meSDJp5O8LsnpSe5MMpnkM0mObX1f26Yn2/wlfev5SKs/kuT8vvrKVptMsr6vPnAMSdJozBoSSRYB/wEYr6ozgWOAS4GPAddX1VuAA8Datsha4ECrX9/6keSMttzbgJXAJ5Ick+QY4OPABcAZwGWtLzOMIUkagWEPN80DjksyD3g98CTwbuDWNn8zcFFrr2rTtPnLk6TVb6mqb1fV48AkcG57TFbVY1X1HeAWYFVbpmsMSdIIzPqN66ram+R3gb8C/g74InAP8GxVvdi67QEWtfYiYHdb9sUkzwEnt/quvlX3L7N7Wv28tkzXGC+RZB2wDuBNb3rTbC+p05L1/+tlL6sj2zeu+9m53gTA96i6vVrv0WEONy2gtxdwOvBG4Hh6h4sOG1W1sarGq2p8bOz7fnpEkvQyDXO46V8Bj1fVVFX9PfCnwLuA+e3wE8BiYG9r7wVOA2jzTwT29denLdNV3zfDGJKkERgmJP4KWJbk9e08wXLgIeDLwMWtzxrgttbe2qZp879UvRtpbwUubVc/nQ4sBe4C7gaWtiuZjqV3cntrW6ZrDEnSCMwaElV1J72Tx18F7m/LbAQ+DFyVZJLe+YMb2yI3Aie3+lXA+raeB4Et9ALmC8CVVfXdds7hg8B24GFgS+vLDGNIkkZgqJ8Kr6qrgaunlR+jd2XS9L7fAt7TsZ5rgWsH1LcB2wbUB44hSRoNv3EtSepkSEiSOhkSkqROhoQkqZMhIUnqZEhIkjoZEpKkToaEJKmTISFJ6mRISJI6GRKSpE6GhCSpkyEhSepkSEiSOhkSkqROhoQkqdOsIZHkJ5Lc2/d4PsmHkpyUZEeSR9vzgtY/SW5IMpnkviRn961rTev/aJI1ffVzktzflrmh3SaVrjEkSaMxzO1LH6mqs6rqLOAc4AXgc/RuS7qzqpYCO9s0wAX07l+9FFgHbIDeBz69u9udR+9uc1f3fehvAN7ft9zKVu8aQ5I0Aod6uGk58JdV9QSwCtjc6puBi1p7FXBz9ewC5ic5FTgf2FFV+6vqALADWNnmnVBVu6qqgJunrWvQGJKkETjUkLgU+HRrL6yqJ1v7KWBhay8Cdvcts6fVZqrvGVCfaQxJ0ggMHRJJjgV+Hvjs9HltD6Bewe36PjONkWRdkokkE1NTU6/mZkjSUeVQ9iQuAL5aVU+36afboSLa8zOtvhc4rW+5xa02U33xgPpMY7xEVW2sqvGqGh8bGzuElyRJmsmhhMRlfO9QE8BW4OAVSmuA2/rqq9tVTsuA59oho+3AiiQL2gnrFcD2Nu/5JMvaVU2rp61r0BiSpBGYN0ynJMcDPwP8+77ydcCWJGuBJ4BLWn0bcCEwSe9KqMsBqmp/kmuAu1u/j1bV/ta+ArgJOA64vT1mGkOSNAJDhURV/S1w8rTaPnpXO03vW8CVHevZBGwaUJ8AzhxQHziGJGk0/Ma1JKmTISFJ6mRISJI6GRKSpE6GhCSpkyEhSepkSEiSOhkSkqROhoQkqZMhIUnqZEhIkjoZEpKkToaEJKmTISFJ6mRISJI6GRKSpE6GhCSp01AhkWR+kluTfD3Jw0nemeSkJDuSPNqeF7S+SXJDkskk9yU5u289a1r/R5Os6aufk+T+tswN7V7XdI0hSRqNYfckfh/4QlW9FXg78DCwHthZVUuBnW0a4AJgaXusAzZA7wMfuBo4DzgXuLrvQ38D8P6+5Va2etcYkqQRmDUkkpwI/BRwI0BVfaeqngVWAZtbt83ARa29Cri5enYB85OcCpwP7Kiq/VV1ANgBrGzzTqiqXe3+2DdPW9egMSRJIzDMnsTpwBTw35J8LcknkxwPLKyqJ1ufp4CFrb0I2N23/J5Wm6m+Z0CdGcZ4iSTrkkwkmZiamhriJUmShjFMSMwDzgY2VNU7gL9l2mGftgdQr/zmDTdGVW2sqvGqGh8bG3s1N0OSjirDhMQeYE9V3dmmb6UXGk+3Q0W052fa/L3AaX3LL261meqLB9SZYQxJ0gjMGhJV9RSwO8lPtNJy4CFgK3DwCqU1wG2tvRVY3a5yWgY81w4ZbQdWJFnQTlivALa3ec8nWdaualo9bV2DxpAkjcC8Ifv9MvCpJMcCjwGX0wuYLUnWAk8Al7S+24ALgUnghdaXqtqf5Brg7tbvo1W1v7WvAG4CjgNubw+A6zrGkCSNwFAhUVX3AuMDZi0f0LeAKzvWswnYNKA+AZw5oL5v0BiSpNHwG9eSpE6GhCSpkyEhSepkSEiSOhkSkqROhoQkqZMhIUnqZEhIkjoZEpKkToaEJKmTISFJ6mRISJI6GRKSpE6GhCSpkyEhSepkSEiSOhkSkqROQ4VEkm8kuT/JvUkmWu2kJDuSPNqeF7R6ktyQZDLJfUnO7lvPmtb/0SRr+urntPVPtmUz0xiSpNE4lD2Jn66qs6rq4G1M1wM7q2opsLNNA1wALG2PdcAG6H3gA1cD5wHnAlf3fehvAN7ft9zKWcaQJI3AD3K4aRWwubU3Axf11W+unl3A/CSnAucDO6pqf1UdAHYAK9u8E6pqV7s/9s3T1jVoDEnSCAwbEgV8Mck9Sda12sKqerK1nwIWtvYiYHffsntabab6ngH1mcZ4iSTrkkwkmZiamhryJUmSZjNvyH4/WVV7k/wYsCPJ1/tnVlUlqVd+84Ybo6o2AhsBxsfHX9XtkKSjyVB7ElW1tz0/A3yO3jmFp9uhItrzM637XuC0vsUXt9pM9cUD6swwhiRpBGYNiSTHJ/nRg21gBfAAsBU4eIXSGuC21t4KrG5XOS0DnmuHjLYDK5IsaCesVwDb27znkyxrVzWtnrauQWNIkkZgmMNNC4HPtatS5wF/UlVfSHI3sCXJWuAJ4JLWfxtwITAJvABcDlBV+5NcA9zd+n20qva39hXATcBxwO3tAXBdxxiSpBGYNSSq6jHg7QPq+4DlA+oFXNmxrk3ApgH1CeDMYceQJI2G37iWJHUyJCRJnQwJSVInQ0KS1MmQkCR1MiQkSZ0MCUlSJ0NCktTJkJAkdTIkJEmdDAlJUidDQpLUyZCQJHUyJCRJnQwJSVInQ0KS1MmQkCR1GjokkhyT5GtJPt+mT09yZ5LJJJ9Jcmyrv7ZNT7b5S/rW8ZFWfyTJ+X31la02mWR9X33gGJKk0TiUPYlfAR7um/4YcH1VvQU4AKxt9bXAgVa/vvUjyRnApcDbgJXAJ1rwHAN8HLgAOAO4rPWdaQxJ0ggMFRJJFgM/C3yyTQd4N3Br67IZuKi1V7Vp2vzlrf8q4Jaq+nZVPQ5MAue2x2RVPVZV3wFuAVbNMoYkaQSG3ZP4PeDXgX9o0ycDz1bVi216D7CotRcBuwHa/Oda/3+sT1umqz7TGC+RZF2SiSQTU1NTQ74kSdJsZg2JJD8HPFNV94xge16WqtpYVeNVNT42NjbXmyNJR4x5Q/R5F/DzSS4EXgecAPw+MD/JvPaX/mJgb+u/FzgN2JNkHnAisK+vflD/MoPq+2YYQ5I0ArPuSVTVR6pqcVUtoXfi+UtV9V7gy8DFrdsa4LbW3tqmafO/VFXV6pe2q59OB5YCdwF3A0vblUzHtjG2tmW6xpAkjcAP8j2JDwNXJZmkd/7gxla/ETi51a8C1gNU1YPAFuAh4AvAlVX13baX8EFgO72rp7a0vjONIUkagWEON/2jqroDuKO1H6N3ZdL0Pt8C3tOx/LXAtQPq24BtA+oDx5AkjYbfuJYkdTIkJEmdDAlJUidDQpLUyZCQJHUyJCRJnQwJSVInQ0KS1MmQkCR1MiQkSZ0MCUlSJ0NCktTJkJAkdTIkJEmdDAlJUidDQpLUyZCQJHWaNSSSvC7JXUn+PMmDSX6z1U9PcmeSySSfafenpt3D+jOtfmeSJX3r+kirP5Lk/L76ylabTLK+rz5wDEnSaAyzJ/Ft4N1V9XbgLGBlkmXAx4Drq+otwAFgbeu/FjjQ6te3fiQ5A7gUeBuwEvhEkmOSHAN8HLgAOAO4rPVlhjEkSSMwa0hUzzfb5Gvao4B3A7e2+mbgotZe1aZp85cnSavfUlXfrqrHgUl6968+F5isqseq6jvALcCqtkzXGJKkERjqnET7i/9e4BlgB/CXwLNV9WLrsgdY1NqLgN0Abf5zwMn99WnLdNVPnmGM6du3LslEkompqalhXpIkaQhDhURVfbeqzgIW0/vL/62v6lYdoqraWFXjVTU+NjY215sjSUeMQ7q6qaqeBb4MvBOYn2Rem7UY2Nvae4HTANr8E4F9/fVpy3TV980whiRpBIa5umksyfzWPg74GeBhemFxceu2Brittbe2adr8L1VVtfql7eqn04GlwF3A3cDSdiXTsfRObm9ty3SNIUkagXmzd+FUYHO7CulHgC1V9fkkDwG3JPkt4GvAja3/jcAfJ5kE9tP70KeqHkyyBXgIeBG4sqq+C5Dkg8B24BhgU1U92Nb14Y4xJEkjMGtIVNV9wDsG1B+jd35iev1bwHs61nUtcO2A+jZg27BjSJJGw29cS5I6GRKSpE6GhCSpkyEhSepkSEiSOhkSkqROhoQkqZMhIUnqZEhIkjoZEpKkToaEJKmTISFJ6mRISJI6GRKSpE6GhCSpkyEhSepkSEiSOg1zj+vTknw5yUNJHkzyK61+UpIdSR5tzwtaPUluSDKZ5L4kZ/eta03r/2iSNX31c5Lc35a5IUlmGkOSNBrD7Em8CPzHqjoDWAZcmeQMYD2ws6qWAjvbNMAFwNL2WAdsgN4HPnA1cB69W5Je3fehvwF4f99yK1u9awxJ0gjMGhJV9WRVfbW1/wZ4GFgErAI2t26bgYtaexVwc/XsAuYnORU4H9hRVfur6gCwA1jZ5p1QVbuqqoCbp61r0BiSpBE4pHMSSZYA7wDuBBZW1ZNt1lPAwtZeBOzuW2xPq81U3zOgzgxjTN+udUkmkkxMTU0dykuSJM1g6JBI8gbgfwAfqqrn++e1PYB6hbftJWYao6o2VtV4VY2PjY29mpshSUeVoUIiyWvoBcSnqupPW/npdqiI9vxMq+8FTutbfHGrzVRfPKA+0xiSpBEY5uqmADcCD1fVf+mbtRU4eIXSGuC2vvrqdpXTMuC5dshoO7AiyYJ2wnoFsL3Nez7JsjbW6mnrGjSGJGkE5g3R513ALwL3J7m31f4TcB2wJcla4AngkjZvG3AhMAm8AFwOUFX7k1wD3N36fbSq9rf2FcBNwHHA7e3BDGNIkkZg1pCoqj8D0jF7+YD+BVzZsa5NwKYB9QngzAH1fYPGkCSNht+4liR1MiQkSZ0MCUlSJ0NCktTJkJAkdTIkJEmdDAlJUidDQpLUyZCQJHUyJCRJnQwJSVInQ0KS1MmQkCR1MiQkSZ0MCUlSJ0NCktTJkJAkdRrmHtebkjyT5IG+2klJdiR5tD0vaPUkuSHJZJL7kpzdt8ya1v/RJGv66uckub8tc0O7z3XnGJKk0RlmT+ImYOW02npgZ1UtBXa2aYALgKXtsQ7YAL0PfOBq4DzgXODqvg/9DcD7+5ZbOcsYkqQRmTUkqur/AvunlVcBm1t7M3BRX/3m6tkFzE9yKnA+sKOq9lfVAWAHsLLNO6GqdrV7Y988bV2DxpAkjcjLPSexsKqebO2ngIWtvQjY3ddvT6vNVN8zoD7TGN8nybokE0kmpqamXsbLkSQN8gOfuG57APUKbMvLHqOqNlbVeFWNj42NvZqbIklHlZcbEk+3Q0W052dafS9wWl+/xa02U33xgPpMY0iSRuTlhsRW4OAVSmuA2/rqq9tVTsuA59oho+3AiiQL2gnrFcD2Nu/5JMvaVU2rp61r0BiSpBGZN1uHJJ8G/iVwSpI99K5Sug7YkmQt8ARwSeu+DbgQmAReAC4HqKr9Sa4B7m79PlpVB0+GX0HvCqrjgNvbgxnGkCSNyKwhUVWXdcxaPqBvAVd2rGcTsGlAfQI4c0B936AxJEmj4zeuJUmdDAlJUidDQpLUyZCQJHUyJCRJnQwJSVInQ0KS1MmQkCR1MiQkSZ0MCUlSJ0NCktTJkJAkdTIkJEmdDAlJUidDQpLUyZCQJHUyJCRJnQ77kEiyMskjSSaTrJ/r7ZGko8lhHRJJjgE+DlwAnAFcluSMud0qSTp6HNYhAZwLTFbVY1X1HeAWYNUcb5MkHTXmzfUGzGIRsLtveg9w3vROSdYB69rkN5M8MoJtOxqcAvz1XG/E4SAfm+stUAffo80r8B798UHFwz0khlJVG4GNc70dR5okE1U1PtfbIXXxPfrqO9wPN+0FTuubXtxqkqQRONxD4m5gaZLTkxwLXApsneNtkqSjxmF9uKmqXkzyQWA7cAywqaoenOPNOpp4CE+HO9+jr7JU1VxvgyTpMHW4H26SJM0hQ0KS1MmQ0FCSzE9yRd/0G5PcOpfbpKNXkg8kWd3av5TkjX3zPukvM7xyPCehoSRZAny+qs6c402RXiLJHcCvVdXEXG/Lkcg9iSNEkiVJHk7yX5M8mOSLSY5L8uYkX0hyT5L/l+Strf+bk+xKcn+S30ryzVZ/Q5KdSb7a5h38GZTrgDcnuTfJ77TxHmjL7Erytr5tuSPJeJLjk2xKcleSr/WtS0ex9t75epJPtffsrUlen2R5e5/c3943r239r0vyUJL7kvxuq/1Gkl9LcjEwDnyqvTeP63v/fSDJ7/SN+0tJ/qC139fel/cm+aP2O3EapKp8HAEPYAnwInBWm94CvA/YCSxttfOAL7X254HLWvsDwDdbex5wQmufAkwCaet/YNp4D7T2rwK/2dqnAo+09m8D72vt+cBfAMfP9b+Vj8PivVrAu9r0JuA/0/sJnn/aajcDHwJOBh7he0c95rfn36C39wBwBzDet/476AXHGL3ffjtYvx34SeCfAf8TeE2rfwJYPdf/Lofrwz2JI8vjVXVva99D7z/jvwA+m+Re4I/ofYgDvBP4bGv/Sd86Avx2kvuA/03v97MWzjLuFuDi1r4EOHiuYgWwvo19B/A64E2H/Kp0JNpdVV9p7f8OLKf3/v2LVtsM/BTwHPAt4MYk/xp4YdgBqmoKeCzJsiQnA28FvtLGOge4u703lwP/5BV4TUekw/rLdDpk3+5rf5feh/uzVXXWIazjvfT+Ajunqv4+yTfofbh3qqq9SfYl+efAv6G3ZwK9wPmFqvIHFzXd9JOhz9Lba3hpp94Xas+l90F+MfBB4N2HMM4t9P5w+TrwuaqqJAE2V9VHXtaWH2XckziyPQ88nuQ9AOl5e5u3C/iF1r60b5kTgWdaQPw03/tlyL8BfnSGsT4D/DpwYlXd12rbgV9u/ylJ8o4f9AXpiPGmJO9s7X8LTABLkryl1X4R+D9J3kDvPbWN3mHNt3//qmZ8b36O3u0FLqMXGNA7BHtxkh8DSHJSkoG/gCpD4mjwXmBtkj8HHuR79+P4EHBVO6z0Fnq79QCfAsaT3A+spvcXGFW1D/hKkgf6Twb2uZVe2Gzpq10DvAa4L8mDbVqC3nmGK5M8DCwArgcup3do9H7gH4A/pPfh//n2Pv0z4KoB67oJ+MODJ677Z1TVAeBh4Mer6q5We4jeOZAvtvXu4HuHYTWNl8AepZK8Hvi7tvt9Kb2T2F59pFedl1P/cPGcxNHrHOAP2qGgZ4F/N8fbI+kw5J6EJKmT5yQkSZ0MCUlSJ0NCktTJkJAkdTIkJEmd/j8NRZZn3nav5gAAAABJRU5ErkJggg==\n","text/plain":["<Figure size 432x288 with 1 Axes>"]},"metadata":{"tags":[],"needs_background":"light"}}]},{"cell_type":"markdown","metadata":{"id":"_Z8YvAWgdQsK"},"source":["Even class distribution helps us a lot in text classifiction. Imagine a situation where 95% of data is in one class and the rest 5% is split among other 5 classes. If we wouldn't do anything about it model would just learn to guess the 95% class all the time and would be correct 95% of the time on the data we would use."]},{"cell_type":"markdown","metadata":{"id":"5wopVEhDOY1F"},"source":["# Vectorize\n","What we need to do now is to split the data into training and testing datasets and vectorize (essentialy turning text into number vectors) the text. "]},{"cell_type":"code","metadata":{"id":"_ppqSW1N7mIt"},"source":["x = df.text.values\n","y = df.target.values\n","x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=32)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"u6mjX21TOoX-"},"source":["Vectorizing"]},{"cell_type":"code","metadata":{"id":"A1QGfFgQ-Rpy"},"source":["vectorizer = CountVectorizer()\n","vectorizer.fit(x_train)\n","\n","X_train = vectorizer.transform(x_train)\n","X_test = vectorizer.transform(x_test)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"LtJV4JBUOsyc"},"source":["Our vectorized dataset will consist of index of each word that is used in training dataset. We can check how it looks by simply printing the first tweet both as it was and after vectorizing it."]},{"cell_type":"code","metadata":{"id":"iUlYC1mkRuMt","executionInfo":{"status":"ok","timestamp":1602555897103,"user_tz":-180,"elapsed":50096,"user":{"displayName":"Adam Ling","photoUrl":"","userId":"04744848912553865948"}},"outputId":"20f76126-5142-4b5d-d5b4-50cbd5e2afcf","colab":{"base_uri":"https://localhost:8080/","height":442}},"source":["print(x_train[0], '\\n', X_train[0])"],"execution_count":null,"outputs":[{"output_type":"stream","text":["It was rainy and cloudy in the Windy City today &amp; WF customers had some serious SAD issues! I'm with them, when is summer coming?  \n","   (0, 56126)\t1\n","  (0, 57443)\t1\n","  (0, 126950)\t1\n","  (0, 129399)\t1\n","  (0, 132544)\t1\n","  (0, 142868)\t1\n","  (0, 226851)\t1\n","  (0, 251981)\t1\n","  (0, 256885)\t1\n","  (0, 257700)\t1\n","  (0, 257841)\t1\n","  (0, 433424)\t1\n","  (0, 455284)\t1\n","  (0, 467402)\t1\n","  (0, 486023)\t1\n","  (0, 501487)\t1\n","  (0, 517082)\t1\n","  (0, 519964)\t1\n","  (0, 528707)\t1\n","  (0, 558260)\t1\n","  (0, 561897)\t1\n","  (0, 562752)\t1\n","  (0, 565913)\t1\n","  (0, 566694)\t1\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"gu4yQEw9TnQp"},"source":["Now the sparse matrix you see corresponds to index of the word and count of it in the tweet. Keep in mind it's not in the same order as in the tweet. You can check the corresponding values using vocabulary of the vectorizer."]},{"cell_type":"code","metadata":{"id":"NATpPyWzO7gD","executionInfo":{"status":"ok","timestamp":1602555897104,"user_tz":-180,"elapsed":50088,"user":{"displayName":"Adam Ling","photoUrl":"","userId":"04744848912553865948"}},"outputId":"f6d689ab-54b7-420a-b95a-6b61befa62b6","colab":{"base_uri":"https://localhost:8080/","height":425}},"source":["import re\n","\n","# delimiters https://stackoverflow.com/questions/35221535/python-removing-delimiters-from-strings\n","d = \",.!?/&-:;@'...\"\n","\"[\"+\"\\\\\".join(d)+\"]\"\n","\n","# clean up the string\n","s = x_train[0]\n","s = ' '.join(w for w in re.split(\"[\"+\"\\\\\".join(d)+\"]\", s) if w)\n","\n","for i in s.split():\n","  if len(i)>1: print(i, vectorizer.vocabulary_[i.lower()])"],"execution_count":null,"outputs":[{"output_type":"stream","text":["It 257841\n","was 558260\n","rainy 433424\n","and 57443\n","cloudy 129399\n","in 251981\n","the 517082\n","Windy 565913\n","City 126950\n","today 528707\n","amp 56126\n","WF 561897\n","customers 142868\n","had 226851\n","some 486023\n","serious 467402\n","SAD 455284\n","issues 257700\n","with 566694\n","them 519964\n","when 562752\n","is 256885\n","summer 501487\n","coming 132544\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"jiGcqcLLPnCq"},"source":["# Modelling\n","The model that we use here is a simple [Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression). \\\\\n","It's the simpliest model (to my knowledge) to start with text classification and use it's accuracy as a base measure."]},{"cell_type":"code","metadata":{"id":"SRc8JH-S-fBl","executionInfo":{"status":"ok","timestamp":1602557332462,"user_tz":-180,"elapsed":615,"user":{"displayName":"Adam Ling","photoUrl":"","userId":"04744848912553865948"}},"outputId":"0e988fb1-0cfc-4d28-d648-3233dfd68177","colab":{"base_uri":"https://localhost:8080/","height":170}},"source":["classifier = LogisticRegression(max_iter=1000)\n","classifier.fit(X_train, y_train)\n","\n","score = classifier.score(X_test, y_test)\n","\n","print(\"Accuracy:\", score)"],"execution_count":null,"outputs":[{"output_type":"stream","text":["Accuracy: 0.8003125\n"],"name":"stdout"},{"output_type":"stream","text":["/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):\n","STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n","\n","Increase the number of iterations (max_iter) or scale the data as shown in:\n","    https://scikit-learn.org/stable/modules/preprocessing.html\n","Please also refer to the documentation for alternative solver options:\n","    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n","  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)\n"],"name":"stderr"}]},{"cell_type":"markdown","metadata":{"id":"9OdXHctIZmeI"},"source":["So we got an 80% accuracy measure which is great considering we only used like 5mins of our time to make this."]},{"cell_type":"markdown","metadata":{"id":"61h6y03n-6n5"},"source":["#Confussion matrix\n","Before jumping into other models it's always good to check how our model performs in different classes. For that we can use [confussion matrix](https://en.wikipedia.org/wiki/Confusion_matrix).\n","\n","It essentially shows how many times model predicted which class and how many of those times the class it slected was the right one. Here columns are true values and lines are predicted values."]},{"cell_type":"code","metadata":{"id":"y8bi8AFT-qlU","executionInfo":{"status":"ok","timestamp":1602557332471,"user_tz":-180,"elapsed":128,"user":{"displayName":"Adam Ling","photoUrl":"","userId":"04744848912553865948"}},"outputId":"4313d329-77a2-4ef9-8c9b-e55bde380d9e","colab":{"base_uri":"https://localhost:8080/","height":111}},"source":["y_pred = classifier.predict(X_test)\n","cm = confusion_matrix(y_test, y_pred, labels=df.target.unique())\n","df_cm = pd.DataFrame(cm, index=df.target.unique(), columns=df.target.unique())\n","df_cm"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>0</th>\n","      <th>4</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>0</th>\n","      <td>126492</td>\n","      <td>33829</td>\n","    </tr>\n","    <tr>\n","      <th>4</th>\n","      <td>30071</td>\n","      <td>129608</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"],"text/plain":["        0       4\n","0  126492   33829\n","4   30071  129608"]},"metadata":{"tags":[]},"execution_count":12}]},{"cell_type":"markdown","metadata":{"id":"6Z25Nn0yaOcu"},"source":["For a better understanding you could use percentage expression."]},{"cell_type":"code","metadata":{"id":"kzAdUYAW_F3X","executionInfo":{"status":"ok","timestamp":1602557332480,"user_tz":-180,"elapsed":117,"user":{"displayName":"Adam Ling","photoUrl":"","userId":"04744848912553865948"}},"outputId":"53b027c8-1c2f-42ce-c6fc-e5ffdd922098","colab":{"base_uri":"https://localhost:8080/","height":111}},"source":["df_cm_percentage = df_cm.copy()\n","for i in df_cm_percentage:\n","  df_cm_percentage[i]/=df_cm_percentage[i].sum()\n","\n","df_cm_percentage"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>0</th>\n","      <th>4</th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>0</th>\n","      <td>0.80793</td>\n","      <td>0.206985</td>\n","    </tr>\n","    <tr>\n","      <th>4</th>\n","      <td>0.19207</td>\n","      <td>0.793015</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"],"text/plain":["         0         4\n","0  0.80793  0.206985\n","4  0.19207  0.793015"]},"metadata":{"tags":[]},"execution_count":13}]},{"cell_type":"markdown","metadata":{"id":"0F3Ay2gNam_x"},"source":["Good, so in the end model learnt to classify both classes about the same. Even though we are using Logistic Regression as a base measure to check if the problem is solvable and what results we might expect some insights can be derived here:\n","\n","1.   Classes are evenly distributed, thus we won't overtrain on one class compared to the other\n","2.   Logistic Regression with simple vectorization of words achieved 80% accuracy, meaning we should be able to get a much better result using language models like BERT\n","3.   Both classes are predicted equally well\n","\n"]},{"cell_type":"markdown","metadata":{"id":"55agZlIF4NLM"},"source":["# Test"]},{"cell_type":"markdown","metadata":{"id":"Gp5usbob4PXa"},"source":["Now we need to check if it actually works. Let's just copy a comment in one of Trump's [tweets](https://twitter.com/realDonaldTrump/status/1315835556081868801). \n","\n","The comment is: \n","PATRIOTIC AMERICANS STAND PROUDLY WITH PRESIDENT TRUMP!!"]},{"cell_type":"code","metadata":{"id":"Toe4vwZdaUYC","executionInfo":{"status":"ok","timestamp":1602560844875,"user_tz":-180,"elapsed":2310,"user":{"displayName":"Adam Ling","photoUrl":"","userId":"04744848912553865948"}},"outputId":"38737d95-f3c2-43e1-8b0e-ea0a173103e8","colab":{"base_uri":"https://localhost:8080/","height":34}},"source":["tweet = 'PATRIOTIC AMERICANS STAND PROUDLY WITH PRESIDENT TRUMP!!'\n","vectTweet = vectorizer.transform(np.array([tweet]))  # vectorizes the tweet using our vectorizer\n","\n","prediction = classifier.predict(vectTweet)  # predicts class of the tweet\n","print('Tweet is', 'positive' if prediction[0]==4 else 'negative')"],"execution_count":null,"outputs":[{"output_type":"stream","text":["Tweet is positive\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"2B7dhzZ-FCiE"},"source":["Good, so our model is right here. Let's just make sure our model works okay with very specific sentiments."]},{"cell_type":"code","metadata":{"id":"FI09xnL_5Fd1","executionInfo":{"status":"ok","timestamp":1602560967406,"user_tz":-180,"elapsed":1060,"user":{"displayName":"Adam Ling","photoUrl":"","userId":"04744848912553865948"}},"outputId":"b77dd891-3e2b-461d-cf51-c3e3cbfddc62","colab":{"base_uri":"https://localhost:8080/","height":51}},"source":["tweetList = ['Best tweet ever!', 'Mondays are the worst...']\n","vectTweet = vectorizer.transform(np.array(tweetList))  # vectorizes the tweet using our vectorizer\n","\n","prediction = classifier.predict(vectTweet)  # predicts class of the tweet\n","for enum, i in enumerate(tweetList):\n","  print(i, '| This tweet is', 'positive' if prediction[enum]==4 else 'negative')"],"execution_count":null,"outputs":[{"output_type":"stream","text":["Best tweet ever! | This tweet is positive\n","Mondays are the worst... | This tweet is negative\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"id":"ychK3qv-Eyag"},"source":[""],"execution_count":null,"outputs":[]}]}