PhraseExtract

The master branch is the prototype. For more details, please reference the azure_hdinsight branch.
Use the following command to search the frequently occurring sentences(the generic options like -files and -D show be placed before the command options):

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.8.0.jar \
	-files /home/rav009/PycharmProjects/untitled/PhraseExtract/sentence_mapper.py,/home/rav009/PycharmProjects/untitled/PhraseExtract/sentence_reducer.py \
	-D mapred.map.tasks=7 \
	-D mapred.reduce.tasks=3 \
	-input /input/text.txt \
	-output /sentences/above100/ \
	-mapper "python sentence_mapper.py" \
	-reducer "python sentence_reducer.py -t 100"

python sentence_reducer.py -t 100 stands for output all the sentence appears for more than 100 times.

Use the following command to search the frequently occurring phrases which contains 2 or 3 words:

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.8.0.jar \
	-files /home/rav009/PycharmProjects/untitled/PhraseExtract/phrase_mapper.py,/home/rav009/PycharmProjects/untitled/PhraseExtract/phrase_reducer.py,hdfs://127.0.0.1:9000/sentences/above100/part-00000 \
	-D mapred.map.tasks=4 \
	-D mapred.reduce.tasks=4 \
	-D mapred.text.key.partitioner.options=-k1 \
	-input hdfs://namenode/input.txt \
	-output /phrase/above2000 \
	-mapper "python phrase_mapper.py -l 3" \
	-reducer "python phrase_reducer.py -t 2000 -c" \
	-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

python phrase_mapper.py -l 3 stands for generate the phrases contain less than or equal to 3 words.
python phrase_reducer.py -t 2000 -c stands the threshold of frequency of phrase is 2000 and also output the ID number of each passage(assume the ID and the content is split by '|').

The zip file is the Kettle ETL project and the SSAS project.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
CE_Activities.zip		CE_Activities.zip
README.md		README.md
phrase_mapper.py		phrase_mapper.py
phrase_reducer.py		phrase_reducer.py
sentence_mapper.py		sentence_mapper.py
sentence_reducer.py		sentence_reducer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhraseExtract

About

Releases

Packages

Languages

rav009/PhraseExtract

Folders and files

Latest commit

History

Repository files navigation

PhraseExtract

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages