Schedule and various materials for "Web-scraping and Web-crawling with Python" Workshop, University of Pittsburgh, April 2019
Note: Anaconda takes up 2.17GB of space! If you are short on disk space, you might want to install miniconda, which is a lightweight version.
- Go to https://www.anaconda.com/download/ and select Python 3.6 download (make sure the installer version matches your operating system)
- Click and the downloaded installer and follow the on-screen instructions (on macs, this is a .pkg file After it's done installing (can take 5-10 minutes), double-click the application "Anaconda-Navigator" and make sure it loads properly
- If you would like to work in a Jupyter Notebook (which I recommend), you can open "Anaconda Navigator," click the Launch button on the Jupyter Notebook card (not the JupyterLab card), and then click "New > Python 3" to load a new notebook.
For Windows-specific instructions, visit https://docs.anaconda.com/anaconda/install/windows/
As we go around the room, tell us your name and what you are studying or working on. If you have a specific example, name something you are interested in scraping and why it's potentially a challenge for you. If your interest is more general, say a little about the kinds of sites you would want to scrape and why.
Below is a copy of my slide deck for later reference:
In this part of the workshop, I'll take us through some of the examples I used in my slides, as well as a couple wildcards. We will look at the source code of several websites and try to think about how to break down the problem of web-scraping. The example sites are:
- https://mjlavin80.github.io/pseudonyms/
- http://www.oed.com/browsedictionary
- https://memory-alpha.fandom.com/wiki/Starfleet_casualties_(22nd_century)
- https://memory-alpha.fandom.com/wiki/Starfleet_casualties_(23rd_century)
- https://memory-alpha.fandom.com/wiki/Starfleet_casualties_(24th_century) Note: these three wiki pages were used to create a data visualization of Star Trek deaths by shirt color. See http://digg.com/2019/star-trek-shirt-color-death-data-viz and https://www.startrek.com/article/did-redshirts-really-die-more-often-on-tos for some background.
- http://movies2.nytimes.com/learning/general/onthisday/bday/0101.html I scraped the obituaries from the "On This Day" series for a tutorial I wrote for The Programming Historian (forthcoming)
- http://www.nuforc.org/webreports/ndxpost.html Note: A dataset was scraped, geolocated, and time standardized from NUFORC data by Sigmond Axel (see https://github.com/planetsig/ufo-reports). The goal for us is to think about how it would be done.
In this section of the workshop, we will break into groups and focus on some of the webscraping questions or use cases you have in mind. If participants don't have examples sites of their own, we will return to my examples and discuss end-to-end data collection, including modeling, scraping, crawling, and setting up datastores.