CAREFUL: It's just a homework done in a hurry. Some of the resorts are not elegant
- Download wiki data
- Set up hadoop if you wanna run TFIDF.
- Add
$HADOOP_HOME/bin
toPATH
in~/.bash_profile
. Or if you're using Windows or Linux, you should export the PATH somehow by modifying the tasks. - Set up elasticsearch if you wanna run that.
I confirm that the arrangement is a mess, but is ok to develop locally. It's a multi-app project but I use a plugin without multi-app support.
|- src/main/com/clqb
|- app
|- WordCountAtPage.java
|- WordFreqAtPage.java
|- PageCount.java
|- WordAtPageTFIDF.java
|- ElasticSearch.java
|- utils
|- XmlInputFormat.java
|- XmlUtils.java
|- build.gradle
To run different applications under app
folder, I write different tasks in build.gradle
. You may need to change them, e.g. . ~/.bash_profile
in tasks. You can resort to any solutions as long as you export the PATH which includes the script hadoop
.
First make sure your hadoop daemons are running, i.e. you have run
start-hdfs.sh
start-yarn.sh
or something equivalent.
It takes steps to build the index: WordCountAtPage
, WordFreqAtPage
, PageCount
and WordAtPageTFIDF
; They must be run in order. I followed this tutorial. I copied most part, except some names and had some small modifications according to my understanding and needs.
- Uncomment the section in
build.gradle
:
/* section WordCountAtPage */
ext.hadoopSubDir = 'hadoop'
ext.hadoopApplicationName = 'WordCountAtPage'
ext.hadoopMainClassName = 'com.clqb.app.WordCountAtPage'
- Comment all other sections in
build.gradle
. - Modify the task
runTFIDF_Phase1
.- Modify
. ~/.bash_profile
if needed. - Modify the parameters passed in hadoop application if needed. Here it's
/data/wiki-161m /output-TFIDF-1
.
- Modify
- Run the task:
gradle runTFIDF_Phase1
or use GUI in your IDE.
- Uncomment the section in
build.gradle
:
/* section WordFreqAtPage */
ext.hadoopSubDir = 'hadoop'
ext.hadoopApplicationName = 'WordFreqAtPage'
ext.hadoopMainClassName = 'com.clqb.app.WordFreqAtPage'
- Comment all other sections in
build.gradle
. - Modify the task
runTFIDF_Phase2
accordingly. - Run the task:
gradle runTFIDF_Phase2
or use GUI in your IDE.
- Uncomment the section in
build.gradle
:
/* section PageCount */
ext.hadoopSubDir = 'hadoop'
ext.hadoopApplicationName = 'PageCount'
ext.hadoopMainClassName = 'com.clqb.app.PageCount'
- Comment all other sections in
build.gradle
. - Modify the task
runTFIDF_Phase2_5
accordingly. - Run the task:
gradle runTFIDF_Phase2
or use GUI in your IDE.
- Go to
Name node manage
and check the result. Take the result in mind, for example1660576
, and this will be used in step 3.
- Replace @link{com.clqb.app.WordAtPageTFIDF#NUM_PAGES} with the value you just obtained.
- Uncomment the section in
build.gradle
:
/* section WordAtPageTFIDF */
ext.hadoopSubDir = 'hadoop'
ext.hadoopApplicationName = 'WordAtPageTFIDF'
ext.hadoopMainClassName = 'com.clqb.app.WordAtPageTFIDF'
- Modify the task
runTFIDF_Phase_3
accordingly. - Run the task:
gradle runTFIDF_Phase_3
or use GUI in your IDE.
- First make sure your elasticsearch server(s) are serving. Go to
http://your_elasticsearch_host:9200
to see if you get the json response. - Ucomment the section in
build.gradle
:
/* section ElasticSearch */
ext.hadoopSubDir = 'hadoop'
ext.hadoopApplicationName = 'ElasticSearch'
ext.hadoopMainClassName = 'com.clqb.app.ElasticSearch'
- Comment all other sections.
- Modify the task
runElasticSearch
accordingly. - Run the task:
gradle runElasticSearch
or use GUI in your IDE.