-
Notifications
You must be signed in to change notification settings - Fork 29
WebScraping
zstumgoren edited this page Feb 2, 2012
·
6 revisions
Like most programming languages, Python provides a wealth of tools for scraping data off the web. Below are some resources to help get you started.
The first step in scraping is making an HTTP request. Below are some useful libraries for fetching data over the Web.
- urrlib - the traditional (no frills) library for making HTTP requests. This library comes pre-packaged with Python
- urllib2 - another built-in library for constructing web requests, including those that require authentication. (note, this library has been folded back into urllib in Python 3.x)
- httplib2 - "A comprehensive HTTP client library that supports many features left out of other HTTP libraries."
- mechanize - a stateful web crawler (similar to process of stepping through a website with a browser).
- requests - A newer library that provides a very clean, intuitive interface for making HTTP requests.
- scrapelib - Created by the Sunlight Foundation, this library bakes in caching, ftp downloads, and other goodies.
The second step after downloading your data is parsing it. Below are some libraries that parse HTML and provide an easy API for extracting elements.
- BeautifulSoup - A traditional favorite among scrapers for HTML parsing. Not as feature-rich as lxml, but often gets the job done. A good first library to start with.
- html5lib
- lxml - a robust library that supports multiple HTML/XML parser types, and provides advanced features such as extracting page elements using CSS selectors
- scrapy - "an application framework for crawling web sites and extracting structured data" (packages together the request and scraping bits)
- WebScraping101 - a series of basic web scrapes that demonstrate basic Python syntax
- ScraperWiki contains tuts, sample code, and even lets you ask others to write a scraper for you (though why would we ever do that, right?)
- An Introduction to Compassionate Screen Scraping, by Will Larson. This is a very good intro to scraping sites in a responsible way.
- Python Recipe: Grab page, scrape table, download file, by Ben Welsh