Skip to content
zstumgoren edited this page Feb 2, 2012 · 6 revisions

Overview

Like most programming languages, Python provides a wealth of tools for scraping data off the web. Below are some resources to help get you started.

Modules

HTTP Requests

The first step in scraping is making an HTTP request. Below are some useful libraries for fetching data over the Web.

  • urrlib - the traditional (no frills) library for making HTTP requests. This library comes pre-packaged with Python
  • urllib2 - another built-in library for constructing web requests, including those that require authentication. (note, this library has been folded back into urllib in Python 3.x)
  • httplib2 - "A comprehensive HTTP client library that supports many features left out of other HTTP libraries."
  • mechanize - a stateful web crawler (similar to process of stepping through a website with a browser).
  • requests - A newer library that provides a very clean, intuitive interface for making HTTP requests.
  • scrapelib - Created by the Sunlight Foundation, this library bakes in caching, ftp downloads, and other goodies.

HTML/XML Parsing

The second step after downloading your data is parsing it. Below are some libraries that parse HTML and provide an easy API for extracting elements.

  • BeautifulSoup - A traditional favorite among scrapers for HTML parsing. Not as feature-rich as lxml, but often gets the job done. A good first library to start with.
  • html5lib
  • lxml - a robust library that supports multiple HTML/XML parser types, and provides advanced features such as extracting page elements using CSS selectors

Scraping Frameworks

  • scrapy - "an application framework for crawling web sites and extracting structured data" (packages together the request and scraping bits)

Tutorials

Clone this wiki locally