Yelp insights challenge

Application finding the list of businesses reviewed by Yelp users having higher influence in Yelp social network

Spark job accepting Yelp dataset in TSV format as an input

running Page Rank algorithm to find top 20 high influencers
running query to find the businesses reviewed
writing results to the disk

Initial Yelp dataset has JSON format, to transform it to TSV (tab delimited) run script

$ python scripts/convertJsonToTsv.py yelp_academic_dataset.json # Creates yelp_academic_dataset.tsv

How to build and run the application

How to build application with Maven

mvn clean verify

How to build a Docker image

docker build -t dgreenshtein/yelp-insights ${PROJECT_HOME}

How to run application Docker container and start Spark cluster

docker run -it -p 4040:4040 -p 8080:8080 -p 8081:8081 -h insights --name=insights dgreenshtein/yelp-insights /bin/bash

How to run Spark job

# start Spark Master and Worker
root@insights$ /etc/bootstrap.sh

root@insights$ cd /opt/yelp-insights/

# to run application with test data set
root@insights$ scripts/start-job.sh test-data/business.tsv test-data/reviews.tsv test-data/users.tsv /opt/yelp-insights/results/

Spark Master Web UI http://localhost:8080

Tools and versions

Spark SQL, GraphX 2.1.1
Graphframes 0.5.0
Pandas Dataframe
Docker 17.05.0-ce

References

Json to CSV converter
Graphframes example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Yelp insights challenge

How to build and run the application

Tools and versions

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

Yelp insights challenge

How to build and run the application

Tools and versions

References