Here is my take on the dataset contructed from Netflix which begun in October 2006.
Objective: <0.9525 RMSE score (Root Mean Square Error)
Python Version: 3.8
Packages: numpy, pandas, matplotlib, wordcloud, sklearn, surprise, tensorflow
Dataset from Kaggle: https://www.kaggle.com/netflix-inc/netflix-prize-data
Prize Details: http://www.netflixprize.com
Dataset was contructed to support the participants in the Netflix Prize.
- Netflix Customers: 480,000 (randomly chosen and annoymous)
- Movie Titles: 17,000
- Data collected from: Oct 1998 to Dec 2005
- Ratings: 1 to 5 stars
- Date of Rating
- Movie ID
- Movie's Year of Release
Due to the nature of the data provided, collaborative filtering was used in the process. Thereafter, top 3 models were used to recommend the targeted user the next 10 movies to watch.
Due to some restraints, I have reduced the data size from 100million to 5 million. The 5 million data comes from Jun to Dec 2005 since 50% of the data were rated in 2005.
Root Mean Square Error (RMSE) was one of the metrics used to evaluate the project.
Over the span of 10 days, whilst working as a full-timer, I am glad to attain great RMSE scores. I believe with better device and more time given prior to the presentation, I could have better results. As most of my time spent was running the different models with different data size. At the same time, trying to figure out how do I not lose accuracy with less data (from the original dataset provided).