Skip to content

Latest commit

 

History

History
64 lines (52 loc) · 2.2 KB

README.md

File metadata and controls

64 lines (52 loc) · 2.2 KB

Gelato Web Scraper

Web scraper for downloading icecream names, descriptions and dietary requirements into an excel spreadsheet.

I created this web scraper to have data to practice in my PostgreSQL Database Design Exercise.

Output Spreadsheet

Inspection

Here is a screenshot of the inspection for the pagination link. This was a crucial part of the web scraper (to prevent having to hard-code each page into the script).

Navigation Through Pagination (explanation)

I navigated through pagination by sending a request for each "Next" button found:

# scraper.py
# ...
def get_flavours_through_pagination(full_url,add_description=False):
    next=(True, full_url)
    flavours_collected=[]
    while True:
        full_url=next[1]
        response =requests.get(full_url)
        flavours=get_flavours(response,add_description=add_description)
        flavours_collected.extend(flavours)
        next=get_next_pagination_href(full_url)
        print("next is ", next)
        if next[0] == False:
            print("finished!")
            break
    return flavours_collected
# ...

The get_next_pagination_href function will return a tuple with True as first item if a link to the next page was found. When the "Next" button can't be found, the script will not try to send a new request for that section (tuple with False as first item is returned).

# scraper.py
# ...
def get_next_pagination_href(base_url):
    """ Returns a tuple with a boolean and the href if present

    If boolean True -> there is a next page, and the second element is the link

    """
    response =requests.get(base_url)
    soup = bs4.BeautifulSoup(response.content,"lxml")
    ul=soup.find('ul',{"class":"pagination flex items-center"})  
    if ul is not None:
        for element in ul.children:
            try:
                anchor=element.find("a")
                span=anchor.find("span")
                if "Next" in span.text:
                    return (True, BASE_URL+anchor['href']) 
            except:
                pass
    return (False, None)
# ...