Skip to content

A Problem statement to create an inverted index solution using distributed enviornment

License

Notifications You must be signed in to change notification settings

snemmani/de-challenge

Repository files navigation

Build Status Quality Gate Status

Inverted index creation on a large collection of documents

Problem Statement

Create an efficient implementation to build an inverted index of a large collection of documents. Detailed problem statement can be found here.

Execution and Output

The output of the application can be found in the ipynb file de-challenge.ipynb

Tech Stack used

Python
Spark

Execution

  1. Configure runtime settings

    1. Modify $PROJECT_ROOT/settings.py and fill in the correct dataset location,
    2. Modify SparkConf within the same file
  2. Create environment variables for execution

    • export JAVA_HOME=<JAVA_HOME>
    • export SPARK_HOME=<SPARK_HOME>
    • export HADOOP_HOME=<HADOOP_HOME>
    • export PATH=$SPARK_HOME/bin:$PATH
    • export PYSPARK_PYTHON=python3
    • export SPARK_MASTER=<spark_master> # Only if there is a master node configured, else default will be local
  3. Submit the job spark-submit app.py

  4. Result file for the dictionary will be <dataset_locaiton>/id_files/dictionary

  5. Result file for the inverted index will be <dataset_locaiton>/id_files/inverted-index

About

A Problem statement to create an inverted index solution using distributed enviornment

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published