Gelato Web Scraper

Web scraper for downloading icecream names, descriptions and dietary requirements into an excel spreadsheet.

I created this web scraper to have data to practice in my PostgreSQL Database Design Exercise.

Output Spreadsheet

Inspection

Here is a screenshot of the inspection for the pagination link. This was a crucial part of the web scraper (to prevent having to hard-code each page into the script).

Navigation Through Pagination (explanation)

I navigated through pagination by sending a request for each "Next" button found:

# scraper.py
# ...
def get_flavours_through_pagination(full_url,add_description=False):
    next=(True, full_url)
    flavours_collected=[]
    while True:
        full_url=next[1]
        response =requests.get(full_url)
        flavours=get_flavours(response,add_description=add_description)
        flavours_collected.extend(flavours)
        next=get_next_pagination_href(full_url)
        print("next is ", next)
        if next[0] == False:
            print("finished!")
            break
    return flavours_collected
# ...

The get_next_pagination_href function will return a tuple with True as first item if a link to the next page was found. When the "Next" button can't be found, the script will not try to send a new request for that section (tuple with False as first item is returned).

# scraper.py
# ...
def get_next_pagination_href(base_url):
    """ Returns a tuple with a boolean and the href if present

    If boolean True -> there is a next page, and the second element is the link

    """
    response =requests.get(base_url)
    soup = bs4.BeautifulSoup(response.content,"lxml")
    ul=soup.find('ul',{"class":"pagination flex items-center"})  
    if ul is not None:
        for element in ul.children:
            try:
                anchor=element.find("a")
                span=anchor.find("span")
                if "Next" in span.text:
                    return (True, BASE_URL+anchor['href']) 
            except:
                pass
    return (False, None)
# ...

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
media		media
README.md		README.md
requirements.txt		requirements.txt
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gelato Web Scraper

Output Spreadsheet

Inspection

Navigation Through Pagination (explanation)

About

Releases

Packages

Languages

lmponcio/web-scraper-gelato

Folders and files

Latest commit

History

Repository files navigation

Gelato Web Scraper

Output Spreadsheet

Inspection

Navigation Through Pagination (explanation)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages