-
Notifications
You must be signed in to change notification settings - Fork 64
Home
Olivier Grisel edited this page Jan 6, 2011
·
12 revisions
Here are some tips to use pignlproc tools to mine wikipedia & dbpedia dumps:
-
Fetching the data from the official Wikipedia and DBpedia online dumps or from EBS volumes on Amazon EC2.
-
Splitting a Wikipedia XML dump using Mahout into small chunks is useful to make pig able to work in parallel for instance on a S3 bucket. It is also useful to test a script locally on a small chunk before launching the script as a job on Hadoop cluster.
-
Running pignlproc scripts on a EC2 Hadoop cluster using Apache Whirr makes it possible to setup a Hadoop Cluster on Amazon EC2 with a minimalist configuration file. This make it possible to run a cluster of up to 20 nodes quite easily.