Hadoop Streaming

Hadoop streaming is a utility that allows the mappers and reducers of a MapReduce job to be processed with any executable or script. This means that you can write code in your favourite language and have Hadoop use them for computation. Streaming will basically read STDIN line-by-line for input and emit output to STDOUT.

Choosing this method limits the use of some framework functionality, but for the purposes of learning Hadoop during a one day event, it's usually the better option. Especially if you're not used to programming in Java.

Official Hadoop streaming guide

http://hadoop.apache.org/common/docs/r0.20.2/streaming.html

Usage

Example command for running a Hadoop streaming job on your local system:

$ bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar -file ~/users/{team}/mapper.py -mapper ~/users/{team}/mapper.py -file ~/users/{team}/reducer.py -reducer ~/users/{team}/reducer.py -input /datasets/wikipedia/* -output /tmp/{team}/job-output

Example command for running on the cluster:

$ hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u5.jar -file ~/users/{team}/mapper.py -mapper ~/users/{team}/mapper.py -file ~/users/{team}/reducer.py -reducer ~/users/{team}/reducer.py -input /datasets/wikipedia/* -output /tmp/{team}/job-output

Language specific concerns

Don't forget that in a MapReduce job, each map and reduce task occurs on one of the cluster's slave nodes, and any scripts used in streaming will be subject to the slaves' local install of the programming language you choose. The implication of this is that if your script has a dependency on a library, then that library will need to be installed on ALL of the slave nodes. An example of which is a Ruby script requiring a Rubygem to be present. If your team needs this done, please see a Hopper event organizer (probably Greg).

Notes:

-mapper and -reducer are paths on the LOCAL filesystem of the master node
-file is a repeated argument for each mapper and reducer script you have that you want distributed to the cluster nodes (http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Package+Files+With+Job+Submissions)
-input and -output are paths on the Hadoop filesystem (HDFS)
Your mapper and reducers must not be named the same, even if they are in different folders. The jar file generator will overwrite the files, leaving you with either the mapper or the reducer for both tasks.

Other resources:

http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly