Inverted index creation on a large collection of documents

Problem Statement

Create an efficient implementation to build an inverted index of a large collection of documents. Detailed problem statement can be found here.

The output of the application can be found in the ipynb file de-challenge.ipynb

Python
Spark

Configure runtime settings
1. Modify $PROJECT_ROOT/settings.py and fill in the correct dataset location,
2. Modify SparkConf within the same file
Create environment variables for execution
- export JAVA_HOME=<JAVA_HOME>
- export SPARK_HOME=<SPARK_HOME>
- export HADOOP_HOME=<HADOOP_HOME>
- export PATH=$SPARK_HOME/bin:$PATH
- export PYSPARK_PYTHON=python3
- export SPARK_MASTER=<spark_master> # Only if there is a master node configured, else default will be local
Submit the job spark-submit app.py
Result file for the dictionary will be <dataset_locaiton>/id_files/dictionary
Result file for the inverted index will be <dataset_locaiton>/id_files/inverted-index

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
invertedindex		invertedindex
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
app.py		app.py
de-challenge.ipynb		de-challenge.ipynb
problem-statement.pdf		problem-statement.pdf
readme.md		readme.md
requirements.txt		requirements.txt
settings.py		settings.py
sonar-project.properties		sonar-project.properties
test.py		test.py