Mapreduce for inverted index on Wikipedia data

CAREFUL: It's just a homework done in a hurry. Some of the resorts are not elegant

Prerequisites

Download wiki data
Set up hadoop if you wanna run TFIDF.
Add $HADOOP_HOME/bin to PATH in ~/.bash_profile. Or if you're using Windows or Linux, you should export the PATH somehow by modifying the tasks.
Set up elasticsearch if you wanna run that.

Project Structure

I confirm that the arrangement is a mess, but is ok to develop locally. It's a multi-app project but I use a plugin without multi-app support.

|- src/main/com/clqb
 |- app
  |- WordCountAtPage.java
  |- WordFreqAtPage.java
  |- PageCount.java
  |- WordAtPageTFIDF.java
  |- ElasticSearch.java
 |- utils
  |- XmlInputFormat.java
  |- XmlUtils.java
|- build.gradle

To run different applications under app folder, I write different tasks in build.gradle. You may need to change them, e.g. . ~/.bash_profile in tasks. You can resort to any solutions as long as you export the PATH which includes the script hadoop.

Instructions

First make sure your hadoop daemons are running, i.e. you have run

start-hdfs.sh
start-yarn.sh

or something equivalent.

Build index using TFIDF

It takes steps to build the index: WordCountAtPage, WordFreqAtPage, PageCount and WordAtPageTFIDF; They must be run in order. I followed this tutorial. I copied most part, except some names and had some small modifications according to my understanding and needs.

Step 1: Run WordCountAtPage

Uncomment the section in build.gradle:

/* section WordCountAtPage */
ext.hadoopSubDir = 'hadoop'
ext.hadoopApplicationName = 'WordCountAtPage'
ext.hadoopMainClassName = 'com.clqb.app.WordCountAtPage'

Comment all other sections in build.gradle.
Modify the task runTFIDF_Phase1.
- Modify . ~/.bash_profile if needed.
- Modify the parameters passed in hadoop application if needed. Here it's /data/wiki-161m /output-TFIDF-1.
Run the task:

gradle runTFIDF_Phase1

or use GUI in your IDE.

Step 2: Run WordFreqAtPage

Uncomment the section in build.gradle:

/* section WordFreqAtPage */
ext.hadoopSubDir = 'hadoop'
ext.hadoopApplicationName = 'WordFreqAtPage'
ext.hadoopMainClassName = 'com.clqb.app.WordFreqAtPage'

Comment all other sections in build.gradle.
Modify the task runTFIDF_Phase2 accordingly.
Run the task:

gradle runTFIDF_Phase2

or use GUI in your IDE.

Step 2.5: Run PageCountAtPage

Uncomment the section in build.gradle:

/* section PageCount */
ext.hadoopSubDir = 'hadoop'
ext.hadoopApplicationName = 'PageCount'
ext.hadoopMainClassName = 'com.clqb.app.PageCount'

Comment all other sections in build.gradle.
Modify the task runTFIDF_Phase2_5 accordingly.
Run the task:

gradle runTFIDF_Phase2

or use GUI in your IDE.

Go to Name node manage and check the result. Take the result in mind, for example 1660576, and this will be used in step 3.

Step 3: Run WordAtPageTFIDF

Replace @link{com.clqb.app.WordAtPageTFIDF#NUM_PAGES} with the value you just obtained.
Uncomment the section in build.gradle:

/* section WordAtPageTFIDF */
ext.hadoopSubDir = 'hadoop'
ext.hadoopApplicationName = 'WordAtPageTFIDF'
ext.hadoopMainClassName = 'com.clqb.app.WordAtPageTFIDF'

Modify the task runTFIDF_Phase_3 accordingly.
Run the task:

gradle runTFIDF_Phase_3

or use GUI in your IDE.

Build index with elasticsearch

First make sure your elasticsearch server(s) are serving. Go to http://your_elasticsearch_host:9200 to see if you get the json response.
Ucomment the section in build.gradle:

/* section ElasticSearch */
ext.hadoopSubDir = 'hadoop'
ext.hadoopApplicationName = 'ElasticSearch'
ext.hadoopMainClassName = 'com.clqb.app.ElasticSearch'

Comment all other sections.
Modify the task runElasticSearch accordingly.
Run the task:

gradle runElasticSearch

or use GUI in your IDE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Mapreduce for inverted index on Wikipedia data

Prerequisites

Project Structure

Instructions

Build index using TFIDF

Step 1: Run WordCountAtPage

Step 2: Run WordFreqAtPage

Step 2.5: Run PageCountAtPage

Step 3: Run WordAtPageTFIDF

Build index with elasticsearch

Files

README.md

Latest commit

History

README.md

File metadata and controls

Mapreduce for inverted index on Wikipedia data

Prerequisites

Project Structure

Instructions

Build index using TFIDF

Step 1: Run WordCountAtPage

Step 2: Run WordFreqAtPage

Step 2.5: Run PageCountAtPage

Step 3: Run WordAtPageTFIDF

Build index with elasticsearch