Inverted index creation on a large collection of documents

Problem Statement

Create an efficient implementation to build an inverted index of a large collection of documents. Detailed problem statement can be found here.

Execution and Output

The output of the application can be found in the ipynb file de-challenge.ipynb

Tech Stack used

Python
Spark

Execution

Configure runtime settings
1. Modify $PROJECT_ROOT/settings.py and fill in the correct dataset location,
2. Modify SparkConf within the same file
Create environment variables for execution
- export JAVA_HOME=<JAVA_HOME>
- export SPARK_HOME=<SPARK_HOME>
- export HADOOP_HOME=<HADOOP_HOME>
- export PATH=$SPARK_HOME/bin:$PATH
- export PYSPARK_PYTHON=python3
- export SPARK_MASTER=<spark_master> # Only if there is a master node configured, else default will be local
Submit the job spark-submit app.py
Result file for the dictionary will be <dataset_locaiton>/id_files/dictionary
Result file for the inverted index will be <dataset_locaiton>/id_files/inverted-index