Skip to content

Files

Latest commit

 

History

History
99 lines (73 loc) · 2.72 KB

README.md

File metadata and controls

99 lines (73 loc) · 2.72 KB

Hadoop Mapper and Reducer Scripts for Python

This repository contains solutions to common mapper and reducer problems in Hadoop using Python. Most online resources for Hadoop are geared towards Java environments, so this repository aims to provide Python solutions for Hadoop streaming.

Hadoop Installation:

Windows:

  • Watch this video for Hadoop installation on Windows.

Ubuntu:

  • Follow this video for Hadoop installation on Ubuntu.

Basic Hadoop Commands:

  1. Format Namenode:

    hdfs namenode -format
  2. Start Hadoop Services:

    start-all.sh
  3. Create Input Directory in HDFS:

    hdfs dfs -mkdir /input
  4. Upload Input File to HDFS:

    hdfs dfs -put /path/to/input.txt /input/input.txt
  5. Run Hadoop Streaming:

    hadoop jar /path/to/hadoop-streaming.jar \
    -input /input/input.txt \
    -output /output \
    -file "/path/to/mapper.py" \
    -mapper "python3 mapper.py" \
    -file "/path/to/reducer.py" \
    -reducer "python3 reducer.py"
  6. Copy Output from HDFS to Local File:

    hdfs dfs -text /output/* > /path/to/outputfile.txt
  7. Remove Output and Input Directories from HDFS:

    hadoop fs -rm -r /output
    hadoop fs -rm -r /input

Testing Mapper and Reducer Scripts:

You can test the mapper and reducer scripts separately to ensure they work correctly:

  1. Test Mapper Script:

    cat /path/to/input.txt | python3 /path/to/mapper.py
  2. Test Reducer Script:

    cat /path/to/mapper_output.txt | python3 /path/to/reducer.py

Algorithm Explanations:

Recommendation System:

  • Mapper: Preprocesses user-item ratings.
  • Reducer: Generates recommendations based on similarity measures between users.

Page Rank:

  • Mapper: Prepares graph data with nodes and edges.
  • Reducer: Calculates the PageRank algorithm to determine node importance in the graph.

K-Means:

  • Mapper: Assigns data points to clusters based on centroid proximity.
  • Reducer: Updates centroid positions based on cluster assignments.

Weather Data Analysis:

  • Mapper: Extracts relevant weather data from input records.
  • Reducer: Aggregates weather data and computes statistics like average temperature or precipitation.

Word Count:

  • Mapper: Splits text into words and emits key-value pairs for each word.
  • Reducer: Counts the occurrences of each word.

Sample Input and Output:

You can find sample input and output files in the repository to test the scripts.