-
Notifications
You must be signed in to change notification settings - Fork 0
/
movielens.Rmd
397 lines (321 loc) · 14.8 KB
/
movielens.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
---
title: 'HarvardX Data Science Capstone: MovieLens Project'
author: "Medea"
date: "5/3/2021"
output:
pdf_document:
toc: yes
number_sections: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
In our modern era business success often depends if we are able to recommend correct product to the end-user. The good example of this is streaming services like Netflix. To keep users engaged we must provide them relative content.
In this project we pursued two simple objectives:
1. Try to analyze what were the key factors that contributed to the high or low rating for the specific user, which is the the heart of movie recommendation
2. Make RMSE < 0.86490 (We have built our movie rating predictation model by expanded what we have learned in Prof. Rafael A Irizarry's class about Netflix challenge, Assumptions here reader is familiar terms like RMSE & regularization, so we won't need to redefine them. We would highly recommend to check his "Introduction to Data Science".)
Data we used in the analysis comes from GroupLens, 10M version, which makes our work simpler but is sufficient to draw vital conclusions. We performed data wrangling: split data for training/testing/validation reasons, compare different biases and achieve our goals.
# Analysis
After download data we have reviewed columns of the movielens, we decided to extract year from timestamp as separate column, release year may have effect rating, old movies sometimes seems naive, and in sci-fi effects seems to be outdated.
For analysis data was split into edx (90%) & validation(10%) sets . Former for training and latter for the final validation.
```{r include=FALSE}
if(!require(tidyverse)) install.packages("tidyverse", repos = "http://cran.us.r-project.org")
if(!require(caret)) install.packages("caret", repos = "http://cran.us.r-project.org")
if(!require(data.table)) install.packages("data.table", repos = "http://cran.us.r-project.org")
library(tidyverse)
library(caret)
library(data.table)
# MovieLens 10M dataset:
# https://grouplens.org/datasets/movielens/10m/
# http://files.grouplens.org/datasets/movielens/ml-10m.zip
dl <- tempfile()
download.file("http://files.grouplens.org/datasets/movielens/ml-10m.zip", dl)
ratings <- fread(text = gsub("::", "\t", readLines(unzip(dl, "ml-10M100K/ratings.dat"))),
col.names = c("userId", "movieId", "rating", "timestamp"))
movies <- str_split_fixed(readLines(unzip(dl, "ml-10M100K/movies.dat")), "\\::", 3)
colnames(movies) <- c("movieId", "title", "genres")
# if using R 4.0 or later:
movies <- as.data.frame(movies) %>% mutate(movieId = as.numeric(movieId),
title = as.character(title),
genres = as.character(genres))
movielens <- left_join(ratings, movies, by = "movieId")
colnames(movielens)
#extract year from timestamp since it may influence rating
movielens<-movielens %>% mutate(year=format(as.POSIXct(timestamp,origin="1970-01-01"), "%Y"))
dims<-dim(movielens)
```
```{r include=FALSE}
# Validation set will be 10% of MovieLens data
set.seed(1, sample.kind="Rounding") # if using R 3.5 or earlier, use `set.seed(1)`
test_index <- createDataPartition(y = movielens$rating, times = 1, p = 0.1, list = FALSE)
edx <- movielens[-test_index,]
temp <- movielens[test_index,]
# Make sure userId and movieId in validation set are also in edx set
validation <- temp %>%
semi_join(edx, by = "movieId") %>%
semi_join(edx, by = "userId")
# Add rows removed from validation set back into edx set
removed <- anti_join(temp, validation)
edx <- rbind(edx, removed)
```
```{r include=FALSE}
rm(dl, ratings, movies, test_index, temp, movielens, removed)
```
We observed columns/dimensions and distinct users, movies, genres, year in the edx/validation dataset
```{r, echo=TRUE}
colnames(edx)
dim(edx)
dim(validation)
```
```{r, echo=TRUE}
edx %>%
summarize(n_users = n_distinct(userId),
n_movies = n_distinct(movieId),
n_genres=n_distinct(genres),
n_year=n_distinct(year),
)
```
As we see not all users rated all movies, this may related to genre preferences, movies quality in users eyes, date movie released. We needed to experiment with data, to determine different biases. So questions we've tried to answer:
1. What RMSE we will get if we use naive approach, e.i. if we follow our intuition and give movies rating as average of whole ratings
2. What if we only consider movie effect?
3. What if we only consider user effect?
4. What if we only consider year effect?
5. What if we only consider genres effect?
6. What if we use different combination of those?
7. Should we consider using of regularization?
To create reliable model we divide edx dataset into the train(90%) and test(10%) sets. We also make sure userId and movieId in test_set are also in train set
```{r include=FALSE}
test_index_edx <- createDataPartition(y = edx$rating, times = 1, p = 0.1, list = FALSE)
train_set <- edx[-test_index_edx,]
tmp_test_set <- edx[test_index_edx,]
# Ensure userId and movieId in test_set are also in edx set
test_set <- tmp_test_set %>%
semi_join(train_set, by = "movieId") %>%
semi_join(train_set, by = "userId")
# Add rows removed from test set back into train set
removed_test <- anti_join(tmp_test_set, test_set)
train_set <- rbind(train_set, removed_test)
```
First let's develop simple model and predict the same rating for all movies regardless of user. Here all the differences explained by random variation and model would look like this:
$$Y_{u,i} = \mu + \varepsilon_{u, i}$$
Which is translated into following code:
```{r echo=TRUE}
#Naive rmse model
mu <- mean(train_set$rating)
naive_rmse <- RMSE(test_set$rating, mu)
naive_rmse
rmse_results <- tibble(method = "Average", RMSE = naive_rmse)
```
Once we have a starting line model we can start to augment it. For example let's consider year effect only. It can be depicted using following formula:
$$Y_{u,y} = \mu +b_{y}+ \varepsilon_{u, y}$$
where term $b_{y}$ to represent average ranking for year so $\hat{b_y}$ is mean of
$Y_{u,y} - \hat{\mu}$
```{r echo=TRUE}
year_avgs <- train_set %>%
group_by(year) %>%
summarize(b_y = mean(rating - mu))
predicted_ratings <- test_set %>%
left_join(year_avgs, by='year') %>%
mutate(pred = mu + b_y) %>%
pull(pred)
y_rmse <-RMSE(predicted_ratings, test_set$rating)
rmse_results <- rmse_results %>% add_row(method = "Year", RMSE = y_rmse)
y_rmse
```
The result we get is better than naive one. Although we've proceed with checking other possible biases, but for simplicity reasons here we are skip the discussion of the most scenarios and focus on the three: Movie/User, Movie/User/Year/Genres and Movie/User/Year/Genres/Regularization combinations.
```{r include=FALSE}
#Predict using only movie effect
movie_avgs <- train_set %>%
group_by(movieId) %>%
summarize(b_i = mean(rating - mu))
predicted_ratings <- test_set %>%
left_join(movie_avgs, by='movieId') %>%
mutate(pred = mu + b_i) %>%
pull(pred)
i_rmse <-RMSE(predicted_ratings, test_set$rating)
i_rmse
rmse_results <- rmse_results %>% add_row(method = "Movie", RMSE = i_rmse)
#Predict using only user effect
user_avgs <- train_set %>%
group_by(userId) %>%
summarize(b_u = mean(rating - mu))
predicted_ratings <- test_set %>%
left_join(user_avgs, by='userId') %>%
mutate(pred = mu + b_u) %>%
pull(pred)
u_rmse <-RMSE(predicted_ratings, test_set$rating)
rmse_results <- rmse_results %>% add_row(method = "User", RMSE = u_rmse)
#Predict using only year effect
year_avgs <- train_set %>%
group_by(year) %>%
summarize(b_y = mean(rating - mu))
predicted_ratings <- test_set %>%
left_join(year_avgs, by='year') %>%
mutate(pred = mu + b_y) %>%
pull(pred)
y_rmse <-RMSE(predicted_ratings, test_set$rating)
rmse_results <- rmse_results %>% add_row(method = "Year", RMSE = y_rmse)
```
If we take both user and movie in account model can be described by formula
$$Y_{u,i} = \mu + b_{i} + b_{u}+\varepsilon_{u, i}$$
where term $b_{u}$ is user specific effect. so $\hat{b_u}$ is mean of $Y_{u,i} - \hat{\mu} - \hat{b_i}$
We can use below code to calculate result
```{r echo=TRUE}
#Predict using user & movie effect
user_avgs <- train_set %>%
left_join(movie_avgs, by='movieId') %>%
group_by(userId) %>%
summarize(b_u = mean(rating - mu - b_i))
predicted_ratings <- test_set %>%
left_join(movie_avgs, by='movieId') %>%
left_join(user_avgs, by='userId') %>%
mutate(pred = mu + b_i + b_u) %>%
pull(pred)
i_u_rmse <-RMSE(predicted_ratings, test_set$rating)
rmse_results <- rmse_results %>% add_row(method = "Movie/User", RMSE = i_u_rmse)
i_u_rmse
```
The prediction is improved again, but we still have room for improvement.
```{r echo=FALSE}
#Predict using user & movie & year effect
year_avgs <- train_set %>%
left_join(movie_avgs, by='movieId') %>%
left_join(user_avgs, by='userId') %>%
group_by(year) %>%
summarize(b_y = mean(rating - mu - b_i - b_u))
predicted_ratings <- test_set %>%
left_join(movie_avgs, by='movieId') %>%
left_join(user_avgs, by='userId') %>%
left_join(year_avgs, by='year') %>%
mutate(pred = mu + b_i + b_u + b_y ) %>%
pull(pred)
i_u_y_rmse <-RMSE(predicted_ratings, test_set$rating)
rmse_results <- rmse_results %>% add_row(method = "Movie/User/Year", RMSE = i_u_y_rmse)
#Predict using user & movie & genres effect
genres_avgs <- train_set %>%
left_join(movie_avgs, by='movieId') %>%
left_join(user_avgs, by='userId') %>%
group_by(genres) %>%
summarize(b_g = mean(rating - mu - b_i - b_u ))
predicted_ratings <- test_set %>%
left_join(movie_avgs, by='movieId') %>%
left_join(user_avgs, by='userId') %>%
left_join(year_avgs, by='year') %>%
left_join(genres_avgs, by='genres') %>%
mutate(pred = mu + b_i + b_u + b_g) %>%
pull(pred)
i_u_g_rmse<-RMSE(predicted_ratings, test_set$rating)
rmse_results <- rmse_results %>% add_row(method = "Movie/User/Genres", RMSE = i_u_g_rmse)
```
Following the similar logic as in "Movie/User" model "Movie/User/Year/Genres" combination can be expressed using following formula:
$$Y_{u,i} = \mu + b_{i} + b_{u} + b_{y} + b_{g} + \varepsilon_{u, i}$$
```{r echo=TRUE}
#Predict using user & movie & year & genres effect
genres_avgs <- train_set %>%
left_join(movie_avgs, by='movieId') %>%
left_join(user_avgs, by='userId') %>%
left_join(year_avgs, by='year') %>%
group_by(genres) %>%
summarize(b_g = mean(rating - mu - b_i - b_u - b_y))
predicted_ratings <- test_set %>%
left_join(movie_avgs, by='movieId') %>%
left_join(user_avgs, by='userId') %>%
left_join(year_avgs, by='year') %>%
left_join(genres_avgs, by='genres') %>%
mutate(pred = mu + b_i + b_u + b_y + b_g) %>%
pull(pred)
i_u_y_g_rmse<-RMSE(predicted_ratings, test_set$rating)
rmse_results <- rmse_results %>% add_row(method = "Movie/User/Year/Genres",
RMSE = i_u_y_g_rmse)
i_u_y_g_rmse
```
Result is the best so far but we still have not reached our second objective. e.i. RMSE < 0.86490. From the edx dataset is easy to see that some movies are receiving only a handful reviews and their rating could mess up with recommendation. To mitigate this effect regularization should be performed and suitable lambda which minimizes RMSE be chosen.
```{r, echo=TRUE}
mu <- mean(train_set$rating)
lambdas <- seq(0, 5, 0.25)
rmses <- sapply(lambdas, function(lambda) {
b_i <- edx %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu) / (n() + lambda))
b_u <- train_set %>%
left_join(b_i, by='movieId') %>%
group_by(userId) %>%
summarize(b_u = sum(rating - mu - b_i) / (n() + lambda))
b_y <- train_set %>%
left_join(b_i, by='movieId') %>%
left_join(b_u, by='userId') %>%
group_by(year) %>%
summarize(b_y = mean(rating - mu - b_i - b_u))
b_g <- train_set %>%
left_join(b_i, by='movieId') %>%
left_join(b_u, by='userId') %>%
left_join(b_y, by='year') %>%
group_by(genres) %>%
summarize(b_g = mean(rating - mu - b_i - b_u - b_y))
# Compute the predicted ratings on test_set dataset
predicted_ratings <- test_set %>%
left_join(b_i, by='movieId') %>%
left_join(b_u, by='userId') %>%
left_join(b_y, by='year') %>%
left_join(b_g, by='genres') %>%
mutate(pred = mu + b_i + b_u + b_y + b_g) %>%
pull(pred)
# Predict the RMSE on the validation set
return (RMSE(predicted_ratings,test_set$rating))
})
min_lambda <- lambdas[which.min(rmses)]
i_u_y_g_cv_rmse<-min(rmses)
rmse_results <- rmse_results %>% add_row(method = "Movie/User/Year/Genres cross-validation",
RMSE = i_u_y_g_cv_rmse)
```
Let's plot our lambdas to observe dynamic
```{r , echo=FALSE}
qplot(lambdas, rmses)
```
# Results
After observing results we had from our training set we conclude that the most valuable bias combination is : user/movie/year/genre in combination with data regularization, the most insignificant effect is Year, but we still decide to keep it.
Lambda value used:
```{r }
min_lambda <- lambdas[which.min(rmses)]
```
The results on train set
```{r rmse_results}
rmse_results %>% knitr::kable()
```
Once we tried different combinations of biases and created our final model we test it against validation set.
```{r , echo=FALSE}
b_i <- edx %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu) / (n() + min_lambda))
b_u <- edx %>%
left_join(b_i, by='movieId') %>%
group_by(userId) %>%
summarize(b_u = sum(rating - mu - b_i) / (n() + min_lambda))
b_y <- edx %>%
left_join(b_i, by='movieId') %>%
left_join(b_u, by='userId') %>%
group_by(year) %>%
summarize(b_y = mean(rating - mu - b_i - b_u))
b_g <- edx %>%
left_join(b_i, by='movieId') %>%
left_join(b_u, by='userId') %>%
left_join(b_y, by='year') %>%
group_by(genres) %>%
summarize(b_g = mean(rating - mu - b_i - b_u - b_y))
# Compute the predicted ratings on validation dataset
predicted_ratings <- validation %>%
left_join(b_i, by='movieId') %>%
left_join(b_u, by='userId') %>%
left_join(b_y, by='year') %>%
left_join(b_g, by='genres') %>%
mutate(pred = mu + b_i + b_u + b_y + b_g) %>%
pull(pred)
# Predict the RMSE on the validation set
finalRMSE<- RMSE(predicted_ratings,validation$rating)
finalRMSE
```
So we have achieved our second objective, although it is a little bit worse than what we get in the training data, but still in the range we need.
# Conclusion
Neither of bias alone would have been sufficient enough to fulfill our requirements. Movie and User, Genres effects play important roles, year bias not that important and regularization helped us achive our ultimate goal for RMSE.
Next step in analyzes is to split "genres" column into separate columns and see if some genres combination contributes rating more than others and or it is more beneficial treat them as whole. What about time affect. Also in the current global pandemic situation it will be interesting to see if pandemic year affected movie ratings more.