WebScraping

Jump to bottom Edit New page

StevenMaude edited this page Nov 4, 2014 · 6 revisions

Overview

Python provides a wealth of tools for scraping data off the web. Below are some resources to help get you started.

Modules

HTTP Requests

The first step in scraping is making an HTTP request. Below are some useful libraries for fetching data over the Web.

urllib - the traditional (no frills) library for making HTTP requests. This library comes pre-packaged with Python
urllib2 - another built-in library for constructing web requests, including those that require authentication. (note, this library has been folded back into urllib in Python 3.x)
httplib2 - "A comprehensive HTTP client library that supports many features left out of other HTTP libraries."
mechanize - a stateful web crawler (similar to process of stepping through a website with a browser).
requests - A newer library that provides a very clean, intuitive interface for making HTTP requests.
scrapelib - Created by the Sunlight Foundation, this library bakes in caching, ftp downloads, and other goodies.

HTML/XML Parsing

The second step after downloading your data is parsing it. Below are some libraries that parse HTML and provide an easy API for extracting elements.

BeautifulSoup - A traditional favorite among scrapers for HTML parsing. Not as feature-rich as lxml, but often gets the job done. A good first library to start with.
html5lib
lxml - a robust library that supports multiple HTML/XML parser types, and provides advanced features such as extracting page elements using CSS selectors

Scraping Frameworks

scrapy - "an application framework for crawling web sites and extracting structured data" (packages together the request and scraping bits)

Tutorials

WebScraping101 - a series of basic web scrapes that demonstrate basic Python syntax
ScraperWiki contains tuts, sample code, and even lets you ask others to write a scraper for you (though why would we ever do that, right?)
An Introduction to Compassionate Screen Scraping, by Will Larson. This is a very good intro to scraping sites in a responsible way.
Python Recipe: Grab page, scrape table, download file, by Ben Welsh