{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Web scraping Rate My Professor.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/Tanu-N-Prabhu/Python/blob/master/Web_scraping_Rate_My_Professor.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cm1axZ2cHUa5",
        "colab_type": "text"
      },
      "source": [
        "# Web scraping using BeautifulSoup and Requests libraries (Rate My Professors Website)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ML2ZARBRsU0g",
        "colab_type": "text"
      },
      "source": [
        "## Let’s scrape the data from the web using Python."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "42v-u2qzsRS-",
        "colab_type": "text"
      },
      "source": [
        "![alt text](https://raw.githubusercontent.com/pluralsight/guides/master/images/310d6edd-b569-408a-a61d-f6d9a9a9eb61.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fq6AxmBYsXrT",
        "colab_type": "text"
      },
      "source": [
        "Today we will be scraping “[Rate My Professor](https://www.ratemyprofessors.com/)” website. A little insight about Rate My Professor website, it is a website that contains a rating of school, professors and universities. You can search for any professor or school and get their ratings before taking or joining to their courses. It’s a handy feature which helps to know more about your professor or the university that you want to join. In this tutorial, we shall see how to scrape and to extract a specific professor’s tag. I warn you guys this is not illegal but the mass scraping of data from the website can lead your IP address being blocked. Just do it once or twice, but don’t just foolishly put it in a loop and try to put request inside the loop."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "teMB0jnMELj2",
        "colab_type": "text"
      },
      "source": [
        "![alt text](https://i.kym-cdn.com/entries/icons/original/000/026/575/5ac80d20dfc1e.image.jpg)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TESiRp8VE44M",
        "colab_type": "text"
      },
      "source": [
        "**What is Web Scraping ?**\n",
        "\n",
        "Web Scraping (also termed as Scraping, Data Extraction, Data Harvesting, etc.) is a technique used to extract data from the websites. Sometimes web scraping can be very useful wherein we can get the data that we are looking for straight from the web, but sometimes it a bad way to do it, because it’s like stealing the precious data from the website without their permission, but limit your scraping process to once or twice so that this can avoid you from falling in trouble.\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "mkHyOX1psqcH",
        "colab_type": "text"
      },
      "source": [
        "## The most useful libraries required for web scraping are:\n",
        "\n",
        "1. [Beautiful Soup.](https://www.crummy.com/software/BeautifulSoup/bs4/doc/?source=post_page---------------------------)\n",
        "\n",
        "2. [Requests.](https://2.python-requests.org/en/master/?source=post_page---------------------------)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kY2HX-VftA4T",
        "colab_type": "text"
      },
      "source": [
        "## These are the steps that we would be following throughout this tutorial:\n",
        "\n",
        "1. Importing the required libraries.\n",
        "\n",
        "2. Getting the URL and storing it in a variable.\n",
        "\n",
        "3. Making a request to the website using the requests library.\n",
        "\n",
        "4. Using the Beautiful Soup library to get the HTML (raw) data from the website.\n",
        "\n",
        "5. Using soup.findAll method to get the respected tag that we are looking for.\n",
        "\n",
        "6. Removing all the HTML tags and converting it to a plain text format."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eAe3tdqstNU0",
        "colab_type": "text"
      },
      "source": [
        "You might be wondering what tags to extract, well in the Rate My Professor website every professor will have his/her respected tags such as (hilarious, heavy homework, study hard or fail, etc.), we will just try to extra these tags in this tutorials as shown below."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8Sy58JjEtuG5",
        "colab_type": "text"
      },
      "source": [
        "![alt text](http://i64.tinypic.com/35klron.png)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nxNY9FqDt5pN",
        "colab_type": "text"
      },
      "source": [
        "Before we begin make sure to scrape the data at a slow pace, and you can also use a VPN service to get a different IP address, this prevents your IP address from banning, but I hope you guys will follow the instructions. Here is an [article](https://hackernoon.com/how-to-scrape-a-website-without-getting-blacklisted-271a605a0d94?source=post_page---------------------------) that will let you know how to scrape a website without getting blacklisted. One important thing in this tutorial there is no point of me explaining each and every line of code, which is not needed here because python code is self-explanatory. However, I will try not to confuse you, and make things clear in an easy way. So I wrote this tutorial in such a way that everybody can understand irrespective of their programming background. Moreover, the entire source code can be found in my GitHub. There might be numerous tutorials available on the internet, but this tutorial is easy to understand because I have tried to explain the code as much as possible, some parts are a mechanical process, wherein you just have to follow them, just let me know if you have any doubt in the comments section down below."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8e4KeKhit_TD",
        "colab_type": "text"
      },
      "source": [
        "Let’s get started."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5QMkozcWJi4j",
        "colab_type": "text"
      },
      "source": [
        "**Step 1: Importing the required libraries**"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qte8uE9SuFA1",
        "colab_type": "text"
      },
      "source": [
        "Let us import few important libraries such as Requests and BeautifulSoup."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "bQIL3UbWh3II",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import requests\n",
        "from bs4 import BeautifulSoup"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "E6GDKT8Gu-ov",
        "colab_type": "text"
      },
      "source": [
        "\n",
        "\n",
        "---\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "81N6kDoxM4ke",
        "colab_type": "text"
      },
      "source": [
        "**Step 2: Getting the URL and storing it in a variable.**"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "02ugpc6OuK0U",
        "colab_type": "text"
      },
      "source": [
        "Let us store the URL of the professor in the variable named “url”. The URL of the website can be found here: “[Rate My Professor](https://www.ratemyprofessors.com/ShowRatings.jsp?tid=1986099&source=post_page---------------------------)”."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "aadHoY69jEWI",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "url = 'https://www.ratemyprofessors.com/ShowRatings.jsp?tid=1986099'"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "B6-3kmHIu9nV",
        "colab_type": "text"
      },
      "source": [
        "\n",
        "\n",
        "---\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KP2pzOgWNG4u",
        "colab_type": "text"
      },
      "source": [
        "\n",
        "**Step 3: Making a request to the website using the requests library.**\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9fHhlDOQuTB4",
        "colab_type": "text"
      },
      "source": [
        "Here we use the requests library by passing “url” as a parameter, be careful don’t run this multiple times. If you get like Response 200 then its success, if you get something else then there is something wrong with maybe the code or your browser I don’t know."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "r0wPh55bjJrp",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "page = requests.get(url)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "yY1A4nnijnfT",
        "colab_type": "code",
        "outputId": "0e696a12-81ab-4f85-ef26-2af358ab444f",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        }
      },
      "source": [
        "page"
      ],
      "execution_count": 5,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "<Response [200]>"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 5
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MXK-919ku8h5",
        "colab_type": "text"
      },
      "source": [
        "\n",
        "\n",
        "---\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "g9-SmRvFNtcz",
        "colab_type": "text"
      },
      "source": [
        "**Step 4: Using the Beautiful Soup library to get the HTML (raw) data from the website.**"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sqwmHpL6uYBt",
        "colab_type": "text"
      },
      "source": [
        "Here we use the BeautifulSoup by passing the page.text as a parameter and using the HTML parser. You can try to print the soup, but printing the soup doesn’t give you the answer, rather it contains huge chunks of HTML data, so I decided not to show it here."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "B32OMv-4j1BJ",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "soup = BeautifulSoup(page.text, \"html.parser\")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zvuZ0nsSu7ac",
        "colab_type": "text"
      },
      "source": [
        "\n",
        "\n",
        "---\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PnS9Q0nGOJAa",
        "colab_type": "text"
      },
      "source": [
        "**Step 5: Using soup.findAll method to get the respected tag that we are looking for.**\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7o3aamYtucnr",
        "colab_type": "text"
      },
      "source": [
        "Here is the place where you shall add the tags that you are looking for, to get the tag name all you have to do is to right click on the respected tag or click Ctrl-Shift-I on the tag in the webpage, then a page with selected tag will open for you to your right-hand side as shown below:"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gwSVFYTHulMS",
        "colab_type": "text"
      },
      "source": [
        "![alt text](http://i64.tinypic.com/2j2w9b8.png)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "G5kH-6oVurgW",
        "colab_type": "text"
      },
      "source": [
        "You can then copy the HTML tag and class if any, and then place it inside the soup.findAll method. In this case, the HTML tag is “span” and class is “tag-box-choosetags”"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "mOZlRX52lyQK",
        "colab_type": "code",
        "outputId": "59d32d6d-9122-43e7-ef0e-5d69da1222ce",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 260
        }
      },
      "source": [
        "proftags = soup.findAll(\"span\", {\"class\": \"tag-box-choosetags\" })\n",
        "proftags"
      ],
      "execution_count": 7,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[<span class=\"tag-box-choosetags\"> Hilarious <b>(11)</b></span>,\n",
              " <span class=\"tag-box-choosetags\"> Caring <b>(9)</b></span>,\n",
              " <span class=\"tag-box-choosetags\"> Accessible outside class <b>(8)</b></span>,\n",
              " <span class=\"tag-box-choosetags\"> Amazing lectures <b>(5)</b></span>,\n",
              " <span class=\"tag-box-choosetags\"> Clear grading criteria <b>(4)</b></span>,\n",
              " <span class=\"tag-box-choosetags\"> Inspirational <b>(4)</b></span>,\n",
              " <span class=\"tag-box-choosetags\"> Group projects <b>(3)</b></span>,\n",
              " <span class=\"tag-box-choosetags\"> Respected <b>(3)</b></span>,\n",
              " <span class=\"tag-box-choosetags\"> Gives good feedback <b>(2)</b></span>,\n",
              " <span class=\"tag-box-choosetags\"> EXTRA CREDIT <b>(2)</b></span>,\n",
              " <span class=\"tag-box-choosetags\"> Participation matters <b>(1)</b></span>,\n",
              " <span class=\"tag-box-choosetags\"> Lecture heavy <b>(1)</b></span>,\n",
              " <span class=\"tag-box-choosetags\"> Test heavy <b>(1)</b></span>,\n",
              " <span class=\"tag-box-choosetags\"> So many papers <b>(1)</b></span>]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 7
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "U2edHSXyu5z1",
        "colab_type": "text"
      },
      "source": [
        "\n",
        "\n",
        "---\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9SOauojTRxgh",
        "colab_type": "text"
      },
      "source": [
        "**Step 6: Removing all the HTML tags and converting it to a plain text format.** "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "G9RGUKtHuwU0",
        "colab_type": "text"
      },
      "source": [
        "Here we remove all the HTML tags and convert it to a text format, this can be done with the help of get_text method placed inside a for loop. This converts the HTML into the text format."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "m6xBHR-fmJAD",
        "colab_type": "code",
        "outputId": "6ef8ce58-0113-4100-974f-a1c2bbc313e0",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 260
        }
      },
      "source": [
        "for mytag in proftags:\n",
        "  print(mytag.get_text())"
      ],
      "execution_count": 8,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            " Hilarious (11)\n",
            " Caring (9)\n",
            " Accessible outside class (8)\n",
            " Amazing lectures (5)\n",
            " Clear grading criteria (4)\n",
            " Inspirational (4)\n",
            " Group projects (3)\n",
            " Respected (3)\n",
            " Gives good feedback (2)\n",
            " EXTRA CREDIT (2)\n",
            " Participation matters (1)\n",
            " Lecture heavy (1)\n",
            " Test heavy (1)\n",
            " So many papers (1)\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EgJ12-z9Sl9p",
        "colab_type": "text"
      },
      "source": [
        "**Hence we got the above information that we were looking for. We got all the tags of the professor. This is how we scrape the data from the internet by using Requests and Beautiful Soup libraries. To be frank this is my professor who teaches the subject “Data Science”. He is one of the best professors in the entire university. I like his teaching and his style.**"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "v9XLrULOu1aq",
        "colab_type": "text"
      },
      "source": [
        "**Thank you guys for spending your time reading my tutorial, stay tuned for more updates. Let me know what is your opinion about this tutorial in the comment section below. Also if you have any doubts regarding the code, comment section is all yours. Have a nice day.**"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SVYBkYgbu4cm",
        "colab_type": "text"
      },
      "source": [
        "\n",
        "\n",
        "---\n",
        "\n"
      ]
    }
  ]
}