Skip to content

Process Common Crawl data with Python and Spark

License

Notifications You must be signed in to change notification settings

cronoik/cc-pyspark

 
 

Repository files navigation

Common Crawl Logo

Common Crawl PySpark Examples

This project provides examples how to process the Common Crawl dataset with Apache Spark and Python:

  • count HTML tags in Common Crawl's raw response data (WARC files)
  • count web server names in Common Crawl's metadata (WAT files and/or WARC files)
  • list host names and corresponding IP addresses (WAT files and/or WARC files)
  • word count (term and document frequency) in Common Crawl's extracted text (WET files)
  • extract links from WAT files and construct the (host-level) web graph

Setup

To develop and test locally, you will need to install

    pip install -r requirements.txt

Compatibility and Requirements

Tested with Spark 2.1.0 - 2.3.0 in combination with Python 2.7 and/or 3.5.

Get Sample Data

To develop locally, you'll need at least three data files -- one for each format the crawl uses. They can be fetched from the following links:

Alternatively, running get-data.sh will download the sample data. It also writes input files containing

  • sample input as file:// URLs
  • all input of one monthly crawl as s3:// URLs

Running locally

First, point the environment variable SPARK_HOME to your Spark installation. Then submit a job via

   $SPARK_HOME/bin/spark-submit ./server_count.py \
	--num_output_partitions 1 --log_level WARN \
	./input/test_warc.txt servernames

This will count web server names sent in HTTP response headers for the sample WARC input and store the resulting counts in the SparkSQL table "servernames" in your ... (usually in ./spark-warehouse/servernames). The

The output table can be accessed via SparkSQL, e.g.,

$SPARK_HOME/spark/bin/pyspark
>>> df = sqlContext.read.parquet("spark-warehouse/servernames")
>>> for row in df.sort(df.val.desc()).take(10): print(row)
... 
Row(key=u'Apache', val=9396)
Row(key=u'nginx', val=4339)
Row(key=u'Microsoft-IIS/7.5', val=3635)
Row(key=u'(no server in HTTP header)', val=3188)
Row(key=u'cloudflare-nginx', val=2743)
Row(key=u'Microsoft-IIS/8.5', val=1459)
Row(key=u'Microsoft-IIS/6.0', val=1324)
Row(key=u'GSE', val=886)
Row(key=u'Apache/2.2.15 (CentOS)', val=827)
Row(key=u'Apache-Coyote/1.1', val=790)

See also

Running in Spark cluster over large amounts of data

As the Common Crawl dataset lives in the Amazon Public Datasets program, you can access and process it on Amazon AWS without incurring any transfer costs. The only cost that you incur is the cost of the machines running your Spark cluster.

  1. spinning up the Spark cluster: AWS EMR contains a ready-to-use Spark installation but you'll find multiple descriptions on the web how to deploy Spark on a cheap cluster of AWS spot instances. See also launching Spark on a cluster.

  2. choose appropriate cluster-specific settings when submitting jobs and also check for relevant command-line options (e.g., --num_input_partitions or --num_output_partitions) by running

./spark/bin/spark-submit ./server_count.py --help

  1. don't forget to deploy all dependencies in the cluster, see advanced dependency management

Credits

Examples are ported from Stephen Merity's cc-mrjob with a couple of upgrades:

  • based on Apache Spark (instead of mrjob)
  • boto3 supporting multi-part download of data from S3
  • warcio a Python 2 and Python 3 compatible module to access WARC files

Further inspirations are taken from

License

MIT License, as per LICENSE

About

Process Common Crawl data with Python and Spark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.5%
  • Shell 1.5%