Application finding the list of businesses reviewed by Yelp users having higher influence in Yelp social network
Spark job accepting Yelp dataset in TSV format as an input
- running Page Rank algorithm to find top 20 high influencers
- running query to find the businesses reviewed
- writing results to the disk
Initial Yelp dataset has JSON format, to transform it to TSV (tab delimited) run script
$ python scripts/convertJsonToTsv.py yelp_academic_dataset.json # Creates yelp_academic_dataset.tsv
How to build application with Maven
mvn clean verify
How to build a Docker image
docker build -t dgreenshtein/yelp-insights ${PROJECT_HOME}
How to run application Docker container and start Spark cluster
docker run -it -p 4040:4040 -p 8080:8080 -p 8081:8081 -h insights --name=insights dgreenshtein/yelp-insights /bin/bash
How to run Spark job
# start Spark Master and Worker
root@insights$ /etc/bootstrap.sh
root@insights$ cd /opt/yelp-insights/
# to run application with test data set
root@insights$ scripts/start-job.sh test-data/business.tsv test-data/reviews.tsv test-data/users.tsv /opt/yelp-insights/results/
Spark Master Web UI http://localhost:8080
- Spark SQL, GraphX 2.1.1
- Graphframes 0.5.0
- Pandas Dataframe
- Docker 17.05.0-ce