A workshop created by Sami Friedrich for the BioData Club Workshop Series.
The internet is overflowing with data ripe for harvesting. The challenge is that not all of that data is formatted neatly or easily accessible. Enter the web scraping multitool! With the power of web scraping, the contents of virtually any webpage can be transformed into analysis-ready data. During this workshop, you’ll learn using python how to:
- Scavenge the contents of an HTML webpage
- Extract only the data you want
- Format the data into a table
requests
BeautifulSoup4
pandas
- the browser Inspector/Inspect tool
- Google Colab
- Some basic python knowledge (looping through list elements, passing arguments to functions, writing basic functions) is a prerequisite for this workshop.
- If you are new to python or want to brush up on these topics before the workshop, check out these free tutorials:
- We will also be working with HTML, and no prior experience is necessary. However, it will be helpful to have a surface-level understanding of HTML elements - namely, their open/close tag structure, and how they nest within each other.
- If you are not familiar with HTML elements or tags, please take a look at this short overview on HTML Basics before beginning.
- webscraping_workshop.ipynb is the Jupyter Notebook (without solutions) for the workshop. Follow the badge at the top to open in Google Colab, or download and run locally (just make sure you've already installed the libraries listed above.)
- solutions_to_webscraping_workshop.ipynb contains solutions to the Jupyter Notebook exercises.
- taphunter_belmont_station.html is the downloaded .html file for the webpage this workshop is designed to scrape. If you're running things locally, be sure to place this file in the same folder as webscraping_workshop.ipynb.
The Google Slides presentation accompanying this workshop can be found here.
Sami Friedrich, PhD candidate at Oregon Health and Science University. Please feel free to reach out with questions or comments!
This project is licensed under the MIT License (see LICENSE).