Create an efficient implementation to build an inverted index of a large collection of documents. Detailed problem statement can be found here.
The output of the application can be found in the ipynb file de-challenge.ipynb
Python
Spark
-
Configure runtime settings
- Modify $PROJECT_ROOT/settings.py and fill in the correct dataset location,
- Modify SparkConf within the same file
-
Create environment variables for execution
export JAVA_HOME=<JAVA_HOME>
export SPARK_HOME=<SPARK_HOME>
export HADOOP_HOME=<HADOOP_HOME>
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python3
export SPARK_MASTER=<spark_master> # Only if there is a master node configured, else default will be local
-
Submit the job
spark-submit app.py
-
Result file for the dictionary will be
<dataset_locaiton>/id_files/dictionary
-
Result file for the inverted index will be
<dataset_locaiton>/id_files/inverted-index