Skip to content

Latest commit

 

History

History
37 lines (28 loc) · 1.46 KB

readme.md

File metadata and controls

37 lines (28 loc) · 1.46 KB

Build Status Quality Gate Status

Inverted index creation on a large collection of documents

Problem Statement

Create an efficient implementation to build an inverted index of a large collection of documents. Detailed problem statement can be found here.

Execution and Output

The output of the application can be found in the ipynb file de-challenge.ipynb

Tech Stack used

Python
Spark

Execution

  1. Configure runtime settings

    1. Modify $PROJECT_ROOT/settings.py and fill in the correct dataset location,
    2. Modify SparkConf within the same file
  2. Create environment variables for execution

    • export JAVA_HOME=<JAVA_HOME>
    • export SPARK_HOME=<SPARK_HOME>
    • export HADOOP_HOME=<HADOOP_HOME>
    • export PATH=$SPARK_HOME/bin:$PATH
    • export PYSPARK_PYTHON=python3
    • export SPARK_MASTER=<spark_master> # Only if there is a master node configured, else default will be local
  3. Submit the job spark-submit app.py

  4. Result file for the dictionary will be <dataset_locaiton>/id_files/dictionary

  5. Result file for the inverted index will be <dataset_locaiton>/id_files/inverted-index