small_spark_project

This is a small Spark project where the original .py file was gotten from https://github.com/PacktPublishing/Frank-Kanes-Taming-Big-Data-with-Apache-Spark-and-Python. The original script cleans and transforms a dataset containing 100000 movie titles, their respective ratings, genres and more, and recommends 10 other movies you should watch, if you entered/liked a particular movie. I've went on to make some small minor improvements to the script, adding a new dataframe that contains each movie ID and its respective average rating, and using this dataframe to filter away movies in our final selection with an average rating below 3, because we don't want to be recommending crappy movies do we?

Instructions to use the script

After cloning down the code, look for the movie ID that you would like to receive recommendations for in the u.item file found in the ml-100k folder.
Go to the terminal and enter the following command: spark-submit movie-similarities-dataframe.py
And the output you get will be similar to what is in the capture.png file

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
ml-100k		ml-100k
Capture.PNG		Capture.PNG
README.md		README.md
movie-similarities-dataframe.py		movie-similarities-dataframe.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

small_spark_project

About

Releases

Packages

Languages

BrentTanKian/small_spark_project

Folders and files

Latest commit

History

Repository files navigation

small_spark_project

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages