Skip to content

Latest commit

 

History

History
235 lines (155 loc) · 11.1 KB

17_web_scraping_api.md

File metadata and controls

235 lines (155 loc) · 11.1 KB

Python Programming for Data Science: Module 17 - Web Scraping and APIs

Twitter Follow

Author: Asabeneh Yetayeh
Second Edition: July, 2021

<< Module 16 | Module 18 >>

30DaysOfPython

📘 Module 17

Python Web Scraping

What is Web Scrapping

The internet is full of huge amount of data which can be used for different purposes. To collect this data we need to know how to scrape data from a website.

Web scraping is the process of extracting and collecting data from websites and storing it on a local machine or in a database.

In this section, we will use beautifulsoup and requests package to scrape data. The package version we are using is beautifulsoup 4.

To start scraping websites you need requests, beautifoulSoup4 and a website.

pip install requests
pip install beautifulsoup4

To scrape data from websites, basic understanding of HTML tags and CSS selectors is needed. We target content from a website using HTML tags, classes or/and ids. Let us import the requests and BeautifulSoup module

import requests
from bs4 import BeautifulSoup

Let us declare url variable for the website which we are going to scrape.

import requests
from bs4 import BeautifulSoup
url = 'https://archive.ics.uci.edu/ml/datasets.php'

# Lets use the requests get method to fetch the data from url

response = requests.get(url)
# lets check the status
status = response.status_code
print(status) # 200 means the fetching was successful
200

Using beautifulSoup to parse content from the page

import requests
from bs4 import BeautifulSoup
url = 'https://archive.ics.uci.edu/ml/datasets.php'

response = requests.get(url)
content = response.content # we get all the content from the website
soup = BeautifulSoup(content, 'html.parser') # beautiful soup will give a chance to parse
print(soup.title) # <title>UCI Machine Learning Repository: Data Sets</title>
print(soup.title.get_text()) # UCI Machine Learning Repository: Data Sets
print(soup.body) # gives the whole page on the website
print(response.status_code)

tables = soup.find_all('table', {'cellpadding':'3'})
# We are targeting the table with cellpadding attribute with the value of 3
# We can select using id, class or HTML tag , for more information check the beautifulsoup doc
table = tables[0] # the result is a list, we are taking out data from it
for td in table.find('tr').find_all('td'):
    print(td.text)

If you run this code, you can see that the extraction is half done. You can continue doing it because it is part of exercise 1. For reference check the beautifulsoup documentation

🌕 You are so special, you are progressing everyday. You are left with only eight days to your way to greatness. Now do some exercises for your brain and muscles.

💻 Web Scraping Exercises:

  1. Scrape the following website and store the data as json file(url = 'http://www.bu.edu/president/boston-university-facts-stats/').
  2. Extract the table in this url (https://archive.ics.uci.edu/ml/datasets.php) and change it to a json file
  3. Scrape the presidents table and store the data as json(https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States). The table is not very structured and the scrapping may take very long time.

Application Programming Interface(API)

API

API stands for Application Programming Interface. The kind of API we will cover in this section is going to be Web APIs. Web APIs are the defined interfaces through which interactions happen between an enterprise and applications that use its assets, which also is a Service Level Agreement (SLA) to specify the functional provider and expose the service path or URL for its API users.

In the context of web development, an API is defined as a set of specifications, such as Hypertext Transfer Protocol (HTTP) request messages, along with a definition of the structure of response messages, usually in an XML or a JavaScript Object Notation (JSON) format.

Web API has been moving away from Simple Object Access Protocol (SOAP) based web services and service-oriented architecture (SOA) towards more direct representational state transfer (REST) style web resources.

Social media services, web APIs have allowed web communities to share content and data between communities and different platforms.

Using API, content that is created in one place dynamically can be posted and updated to multiple locations on the web.

For example, Twitter's REST API allows developers to access core Twitter data and the Search API provides methods for developers to interact with Twitter Search and trends data.

Many applications provide API end points. Some examples of API such as the countries API, cat's breed API.

In this section, we will cover a RESTful API that uses HTTP request methods to GET, PUT, POST and DELETE data.

Building API

RESTful API is an application program interface (API) that uses HTTP requests to GET, PUT, POST and DELETE data. In the previous sections, we have learned about python, flask and mongoDB. We will use the knowledge we acquire to develop a RESTful API using Python flask and mongoDB database. Every application which has CRUD(Create, Read, Update, Delete) operation has an API to create data, to get data, to update data or to delete data from a database.

To build an API, it is good to understand HTTP protocol and HTTP request and response cycle.

HTTP(Hypertext Transfer Protocol)

HTTP is an established communication protocol between a client and a server. A client in this case is a browser and server is the place where you access data. HTTP is a network protocol used to deliver resources which could be files on the World Wide Web, whether they are HTML files, image files, query results, scripts, or other file types.

A browser is an HTTP client because it sends requests to an HTTP server (Web server), which then sends responses back to the client.

Structure of HTTP

HTTP uses client-server model. An HTTP client opens a connection and sends a request message to an HTTP server and the HTTP server returns response message which is the requested resources. When the request response cycle completes the server closes the connection.

HTTP request response cycle

The format of the request and response messages are similar. Both kinds of messages have

  • an initial line,
  • zero or more header lines,
  • a blank line (i.e. a CRLF by itself), and
  • an optional message body (e.g. a file, or query data, or query output).

Let us an example of request and response messages by navigating this site:https://thirtydaysofpython-v1-final.herokuapp.com/. This site has been deployed on Heroku free dyno and in some months may not work because of high request. Support this work to make the server run all the time.

Request and Response header

Initial Request Line(Status Line)

The initial request line is different from the response. A request line has three parts, separated by spaces:

  • method name(GET, POST, HEAD)
  • path of the requested resource,
  • the version of HTTP being used. eg GET / HTTP/1.1

GET is the most common HTTP that helps to get or read resource and POST is a common request method to create resource.

Initial Response Line(Status Line)

The initial response line, called the status line, also has three parts separated by spaces:

  • HTTP version
  • Response status code that gives the result of the request, and a reason which describes the status code. Example of status lines are: HTTP/1.0 200 OK or HTTP/1.0 404 Not Found Notes:

The most common status codes are: 200 OK: The request succeeded, and the resulting resource (e.g. file or script output) is returned in the message body. 500 Server Error A complete list of HTTP status code can be found here. It can be also found here.

Header Fields

As you have seen in the above screenshot, header lines provide information about the request or response, or about the object sent in the message body.

GET / HTTP/1.1
Host: thirtydaysofpython-v1-final.herokuapp.com
Connection: keep-alive
Pragma: no-cache
Cache-Control: no-cache
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36
Sec-Fetch-User: ?1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: same-origin
Sec-Fetch-Mode: navigate
Referer: https://thirtydaysofpython-v1-final.herokuapp.com/post
Accept-Encoding: gzip, deflate, br
Accept-Language: en-GB,en;q=0.9,fi-FI;q=0.8,fi;q=0.7,en-CA;q=0.6,en-US;q=0.5,fr;q=0.4

The message body

An HTTP message may have a body of data sent after the header lines. In a response, this is where the requested resource is returned to the client (the most common use of the message body), or perhaps explanatory text if there's an error. In a request, this is where user-entered data or uploaded files are sent to the server.

If an HTTP message includes a body, there are usually header lines in the message that describe the body. In particular,

The Content-Type: header gives the MIME-type of the data in the body(text/html, application/json, text/plain, text/css, image/gif). The Content-Length: header gives the number of bytes in the body.

Request Methods

The GET, POST, PUT and DELETE are the HTTP request methods which we are going to implement an API or a CRUD operation application.

  1. GET: GET method is used to retrieve and get information from the given server using a given URI. Requests using GET should only retrieve data and should have no other effect on the data.

  2. POST: POST request is used to create data and send data to the server, for example, creating a new post, file upload, etc. using HTML forms.

  3. PUT: Replaces all current representations of the target resource with the uploaded content and we use it modify or update data.

  4. DELETE: Removes data

💻 API Exercises

  1. Read about API and HTTP

🎉 CONGRATULATIONS ! 🎉

<< Module 16 | Module 18 >>