Parallel-text-processing-using-MapReduce-MR-and-Hadoop-Distributed-File-System-HDFS

This problem was provided by researchers in the Classics department at UB. They have provided two classical texts and a lemmatization file to convert words from one form to a standard or normal form. In this case several passes through the documents were done.

Pass 1: Lemmetization using the lemmas.csv file

Pass 2: Identify the words in the texts by <word <docid, [chapter#, line#]> for two documents>.

Pass 3: Repeat this for multiple documents.

Start hadoop

 start-hadoop.sh

Run chmod

chmod +x ./mapper3.py
chmod +x ./reducer3.py

Put the input files to the hadoop directory :-

hdfs dfs -put $HOME/Activity3/act3input act3input

Run hadoop mapreduce:

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.6.4.jar -mapper $HOME/Activity3/mapper3.py -reducer $HOME/Activity3/reducer3.py -input act3input -output act3out

Read output:

hdfs dfs -cat act3out/*

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
act3input		act3input
act3output		act3output
README.md		README.md
README.txt		README.txt
mapper3.py		mapper3.py
mapper_cooccurence_2.py		mapper_cooccurence_2.py
mapper_cooccurence_3.py		mapper_cooccurence_3.py
new_lemmatizer.csv		new_lemmatizer.csv
reducer3.py		reducer3.py
reducer_cooccurence_2.py		reducer_cooccurence_2.py
reducer_cooccurence_3.py		reducer_cooccurence_3.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Parallel-text-processing-using-MapReduce-MR-and-Hadoop-Distributed-File-System-HDFS

About

Uh oh!

Releases

Packages

Languages

shaduk/Parallel-text-processing-using-Hadoop-MapReduce

Folders and files

Latest commit

History

Repository files navigation

Parallel-text-processing-using-MapReduce-MR-and-Hadoop-Distributed-File-System-HDFS

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages