Web scraping, web harvesting, or web data extraction is data scraping used for quickly and efficiently extracting data from websites. It involves fetching a web page and extracting information from it.
- Websites have a lot of data. If extracted properly, this data can be very useful in the realm of Machine Learning.
- Search engine engineering relies a lot on Web Scraping and other forms of Information Retrieval.
- It can be used to automate mundane search tasks.
- Learning how to web scrape involves exploring many related domains, such as Cybersecurity, Web Development, Web Services, & Natural Language Processing.
We will be using the following Python libraries:
-
requests
- HTTP Requests for HumansAllows to navigate and access resources on the web.
-
beautifulsoup4
- Scraping LibraryAllows to parse and extract certain information from a resource.
pip3 install requests bs4 pandas # install the libraries
Create the following project layout:
web-scraping
└── main.py
We will build a crawler that surfs across Wikipedia and finds the shortest directed path between two articles.
Import the dependencies
import re
import os
import requests
import argparse
import pandas as pd
from bs4 import BeautifulSoup
from collections import deque
Using the get
function in the requests module, you can access a webpage
response = requests.get('https://en.wikipedia.org/wiki/Batman') # pass the webpage url as an argument to the function
print(response.status_code) # 200 OK response if the webpage is present
print(response.headers) # contains the date, size, information about the server and type of file the server is sending to the client
print(resonse.content) # page content or the html source
There are five classes of response status codes:
- 1xx informational - relatively new, only indicates the request has been received
- 2xx successful - request received and accepted successfully
- 3xx redirection - client must take some action in order to complete the request
- 4xx client error - request from client contains bad syntax or access to invalid resources
- 5xx server error - server incapable of fulfilling the request
For more information, you can refer to this Wikipedia page.
To parse the content or the page source, we will use the BeautifulSoup
module. To do so, first create a BeautifulSoup
object
soup = BeautifulSoup(response.content, 'html.parser')
We can now access specific information directly. To get the first link in the page
link = soup.a
To get all the links on the page
links = soup.find_all('a') # anchor tag contains a link
Let's work on the last link in the page
link = links[-1] # get the last anchor tag
The link is a Tag
object corresponding to the anchor tag. Each Tag
object contains a few properties
print(link.name) # name of the tag
print(link.attrs) # attributes of the tag
The anchor tag or <a>
has an href
attribute which contains the actual link to the page
print(link.attrs['href']) # accessing directly using attrs
print(link['href']) # accessing by treating like a dict
Trying to access an attribute that does not exist causes an error. A better way is to use the get
method
print(link['id']) # error
print(link.get('id')) # returns value if exists else None and does not throw an error
To find only those anchor tags that contain the href
attribute
links = soup.find_all('a', href=True) # anchor tag contains a href attribute
You can also specify a regex to match links that fit a requirement. In our case, we only want to keep links to Wikipedia pages. We also want to exclue the links that point to some documentation (such as the Help or About page). If you notice, the links have a specific format.
Links we need since they point to other Wikipedia pages
/wiki/DC_Thomson
/wiki/Chris_KL-99
/wiki/Wing_(DC_Comics)
Links we don't need since they contain documentation, media or point to other websites
/wiki/Category:American_culture
/wiki/File:Batman_DC_Comics.png
www.wikimediafoundation.org
Hence, we can use the following regex ^/wiki/[^.:#]*$
. This is explained below:
^
denotes the start of the string^/wiki/
the link starts with/wiki/
[]
denotes a character set that is allowed,[^]
denotes a character set that is not allowed. The^
symbol negates the character set if present at the start[c]*
denotes 0 or more occurrences of a character c,[^c]*
denotes 0 or more occurrences of all characters except c$
denotes the end of the string
The above regex means the link starts with /wiki/
has 0 or more characters that are not .
or :
or #
symbol till the end of the string.
links = soup.find_all('a', href=re.compile('^/wiki/[^.:#]*$')) # find all anchor tags that contain the href attribute with the specified regex
.
- indicates a url for a media file. For example, '/wiki/url/to/file/batman.jpg'.:
- indicates a url for a Wikipedia meta page. For example, '/wiki/Help:Contents'.#
- indicates a url for a page with an anchor attached, we chose to not include these.
Don't forget, we still have to extract the title from the href
link since the above returns all the anchor tags
pages = set([link.get('href')[len("/wiki/"):] for link in links])
Let's add a helper function to return the entire url given a title
def wiki(title):
"""Takes a title and wraps it to form a https://en.wikipedia.org URL
Arguments:
title {str} -- Title of Wikipedia Article
Returns:
{str} -- URL on wikipedia
"""
return f"https://en.wikipedia.org/wiki/{title}"
Let us organize our code into a function that retrieves a page, and returns a list of titles for all articles that appear as links on the page
def get_pages(title):
"""Returns a set of wikipedia articles linked in a wikipedia article
Arguments:
title {str} -- Article title
Returns:
{set()} -- A set of wikipedia articles
"""
response = requests.get(wiki(title))
soup = BeautifulSoup(response.content, 'html.parser')
links = soup.find_all('a', href=re.compile('^/wiki/[^.:#]*$'))
pages = set([link.get('href')[len("/wiki/"):] for link in links])
return pages
In this section we will learn how to extract information for celebrities. An example is shown below
We will extract the following parameters
header = ['Name', 'Nick Name', 'Born', 'Birth Place', 'Nationality', 'Residence', 'Occupation', 'Parent(s)', 'Website']
First we extract the table
body = soup.find('table', class_='infobox').tbody
Most of the parameters can be extracted using the class attributes since they are unique
name = body.find(class_='fn').text
nickname = body.find(class_='nickname').text
born = body.find(class_='bday').text
birthplace = body.find(class_='birthplace').text
residence = body.find(class_='label').text
occupation = body.find(class_='role').li.text
url = body.find(class_='url').text
However, the parent names are not under a unique class attribute
<tr>
<th scope="row">Parent(s)</th>
<td>
<div class="plainlist">
<ul>
<li>
<a class="mw-redirect" href="/wiki/William_Henry_Gates_Sr."
title="William Henry Gates Sr.">William Henry Gates Sr.</a>
</li>
<li>
<a href="/wiki/Mary_Maxwell_Gates" title="Mary Maxwell Gates">Mary Maxwell Gates</a>
</li>
</ul>
</div>
</td>
</tr>
We can still extract these by searching for the string Parent(s)
and then getting the parent tag. We can now access the li
tags
parent = body.find('th', string='Parent(s)').parent
father, mother = [li.text for li in parent.find_all('li')]
You can now write them into a csv file if you prefer. For this, you will need to import the necessary modules first
import os
import pandas as pd
Create a pandas DataFrame and write it into a csv file
csv = 'path/to/file.csv'
row = [name, nickname, born, birthplace, residence, occupation, father, mother, url] # row of column values
df = pd.DataFrame(row).T # create a pandas DataFrame
if not os.path.isfile(csv): # create a new file if it does not exist
header = ['Name', 'Nickname', 'Born', 'Birthplace', 'Residence', 'Occupation', 'Father', 'Mother', 'Website'] # column names
df.to_csv(csv, header=header, index=False)
else: # append to existing csv file
df.to_csv(csv, mode='a', header=False, index=False)
The crawler uses a Breadth-First Search traversal to crawl across the site.
def shortest_path(start, end):
"""
Finds the shortest path in Wikipedia from start page to end page
Arguments:
start {str} -- start page in /wiki/name format
end {str} -- end page in /wiki/name format
"""
i = 1
seen = set()
d = deque([start])
tree = {start: None}
level = {start: 1}
while d:
# Get element in front
topic = d.pop()
seen.add(topic)
print(f'{i}) Parsed: {topic}, Deque: {len(d)}, Seen: {len(seen)}, Level: {level[topic]}')
urls = get_pages(topic)
urls -= seen
# Update structures with new urls
seen |= urls
d.extendleft(urls)
for child in urls:
tree[child] = topic
level[child] = level[topic] + 1
# Check if page found
if end in urls:
topic = end
break
i += 1
# Get path from start to end
path = []
while topic in tree:
path.append(topic)
topic = tree[topic]
print(' \u2192 '.join(reversed(path)))
print(f'Length: {len(path)-1}')
Let us create an interface for our functions
def main():
"""Command line interface for the program
"""
parser = argparse.ArgumentParser()
parser.add_argument('--start', '-s', help='Exact name of start page', required=True)
parser.add_argument('--end', '-e', help='Exact name of end page', required=True)
args = parser.parse_args()
start = '_'.join(args.start.split())
end = '_'.join(args.end.split())
shortest_path(start, end)
if __name__ == "__main__":
main()
You can find the complete source code here
To learn how to use the command
python main.py -h
usage: main.py [-h] --start START --end END
optional arguments:
-h, --help show this help message and exit
--start START, -s START
Exact name of start page
--end END, -e END Exact name of end page
Try the program
python main.py -s Web_Scraping -e Hell
We covered: