assignment-4.tex

% This version of CVPR template is provided by Ming-Ming Cheng.
% Please leave an issue if you found a bug:
% https://github.com/MCG-NKU/CVPR_Template.

\documentclass[final]{cvpr}

\usepackage{times}
\usepackage{epsfig}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}

% Include other packages here, before hyperref.
\usepackage{csquotes}

% If you comment hyperref and then uncomment it, you should delete
% egpaper.aux before re-running latex.  (Or just hit 'q' on the first latex
% run, let it finish, and you should be clear).
\usepackage[pagebackref=true,breaklinks=true,colorlinks,bookmarks=false]{hyperref}


\def\cvprPaperID{****} % *** Enter the CVPR Paper ID here
\def\confYear{2021}
%\setcounter{page}{4321} % For final version only

\newcommand{\q}[1]{\enquote{#1}}
\newcommand{\Set}[1]{\left\{ #1 \right\}}

\usepackage{amsmath}
\usepackage{amssymb}
\newcommand{\R}{\mathbb{R}}

\begin{document}

%%%%%%%%% TITLE
\title{
	Assignment 4 - Neural Network Recommendation System \\~\\
	\large{Team name: M2 Robo}
}

\author{
	Chan Kwan Yin\\
	3035466978 \\
	Team leader

	\and

	Lee Chun Yin\\
	3035469140\\

	\and

	Chiu Yu Ying\\
	3035477630
}

\maketitle

\clearpage

\section{Background}

Among all collaborative filtering techniques, matrix factorization (MF) become a standard approach to the model-based recommendation. Many research teams have put their efforts on enhancing it to break the accuracy record. 

However, it has several limitations such as the cold-start problem -- how the system provide recommendations to new users, how to recommend when new item is added without feature vector or embedding \cite{FMF}. Also, the use of dot product for recommending popular items will lead to the tendency of recommending popular items to those users without specific user interests. With the use of inner product of User and item feature embeddings, MF limits the capture of the complex relations  of the users and items interactions \cite{he2017neural}.

In this assignment, we aim to implement the new approach, Neural Collaborative Filtering (NCF), proposed by He team in 2017 \cite{he2017neural}. They suggested that their design of deep neural network can overcome the limitation of MF.

\section{Neural Collaborative Filtering (NCF) model}

This is the model proposed by He team in 2017 \cite{he2017neural}. Their proposed model uses a combination of matrix factorization and multi-layer perceptron techniques, and the hidden layers from the two techniques are combined by a fully-connected layer. However, their paper mainly focuses on the prediction of \textit{implicit user feedback}, that is, they are predicting the values of $y_{ui}$ where $y_{ui}=1$ when the obervation (user $u$, item $i$) is observed, and $0$ when the
observation (user $u$, item $i$) is not observed. Thus, at the final stage of their proposed model, the output of the fully-connected layer is passed into a sigmoid function, such that the model only outputs values within $[0, 1]$. 

The problem of predicting implicit feedback is different from the problem we wish to solve in the Netflix dataset, because in the Netflix problem we wish to predict actual ratings from $[1, 5]$.
Nevertheless, we still think that He's paper provides a good starting point - We modify He's model by removing the final sigmoid function and changing the loss functions used directly, such that the neural network solves a regression problem instead of the original binary
classification problem.

The model proposed by He is split into 3 parts - Generalized Matrix Factorization, Multi-Layer Perceptron, and Neural matrix factorization. In fact, the third part is a combination of the previous two parts. For this part of our project, we will follow the same workflow as He and implement the 3 parts separately. 

\begin{figure}[h]
	\includegraphics[width=0.5\textwidth]{./NeuCF.PNG}
	\caption{Neural Collaborative Filtering Model Design \cite{he2017neural}}
\end{figure}

\subsection{Model - Input Layer}

The input of the model are the user ID and item ID transformed to a binarized vector with one-hot encoding.

\subsection{Model - Embedding Layer}

For each user and each movie, we train two sets of embedding latent vectors. This is because one set of latent
vectors will be passed into the itemwise multiplication part
of the model (in He’s paper, it is called Generalized Matrix
Factorization (GMF)), and the other set of latent vectors will
be passed into the neural network part of the model (in He’s
paper, it is called Multi-Layer Perceptron (MLP)).

\subsection{Model - Generalized Matrix Factorization (GMF)}

In this part of the model, the user latent vector and item latent vector are combined via itemwise multiplication, similar to that in the SVD matrix factorization model. The output of this part is another vector with the same size at the latent vectors and is passed on to the final output layer of the model.

\subsection{Model - Multi-Layer Perceptron (MLP)}

In this part of the model, the user latent vector and item latent vector are first concatenated. Afterwards, the whole concatenated vector is passed through several layers of fully connected hidden linear layers, using the ReLU activation function. the output of this part is a vector dependng on the output size of the last hidden layer and is passed on to the final output layer of the model.

\subsection{Model - Output Layer}

The final output layer takes in the vectors produced by the GMF and MLP parts of the model and concatenates them. The output of this part, which is the final output of the whole model, is a weighted sum of the vector elements, where the weights are model parameters to be learned. The final output layer produces a predicted score $\hat{y_{ui}}$.

\subsection{Model - Formula}
\bigskip
\textbf{ The formula of the GMF is as follow:}
$$\hat{y_{ij}} = f(P^T v_i Q^T v_j|P,Q,\Theta_f)$$

The variables with m users and n items are denoting as:
\begin{itemize}
    \item $P \in \R^{(m \times K)}$ latent factor matrix for users from embedding layer.
    \item $Q \in \R ^{(n \times K)}$ latent factor matrix for items from embedding layer.
    \item $f$ interaction function.    
    \item $\Theta_{f}$ model parameters of the interaction function $f$
\end{itemize}

\bigskip
\textbf{ For MLP $f$, it can be formulated as:}
$$f(P^T v_i, Q^T v_j) = \phi_{output}(\phi_X(...\phi_2(\phi_1(P^T v_i, Q^T v_j))...))$$

The mapping functions are as follow:
\begin{itemize}
    \item $\phi_{output}$: the mapping function for output layer
    \item $\phi_x$ the activation function for the xth hidden neural FC layer. $x$ is the number of the layers
\end{itemize}

\section{Technical Details}
The training data set provided by Netflix consists of more than 100 million ratings with 17770 movies and 480189 users.
% Such a huge data set would consume a significant amount of training time and memory
% ($O(m^2 n)$, since a correlation matrix between users is to be constructed),
% which is not possible for our hardware available.
% Therefore, only a subset of data is used for evaluation.
% To be specific, only first $10000$ movies and first $1000$ users that appear in the data set are considered.

\subsection{Data preprocessing}
The movies are loaded into a list with columns of Movie ID, User ID and Rating. Then, it will be fed into the test loader with batch size.


% Approximately $80\%$ data are then reformatted into a rating matrix $R \in \mathbb R^{m \times n}$ for training,
% where $r_{ij}$ is the rating of user $i$ on movie $j$;
% the rest are retained for performance evaluation.
% The missing data are being skipped in our model.


%In general, our algorithm runs through all rows for $\epsilon+1$ times
% where $\epsilon$ is the number of epochs of gradient descent.
% The model involves 5 parameters,
% namely $\mu \in \mathbb R, \mathbf b \in \mathbb R^m, \mathbf c \in \mathbb R^n, P \in \mathbb R^{mk}, Q \in \mathbb R^{nk}$.
% The blocking term is $nk$, so this model is scalable to the full dataset given that $nk$ is not beyond memory limit.
% We have not used the full model in this assignment due to time constraints,
% but we will do so in the final report.

\subsection{Optimization and scheduler}
We select Adadelta as our optimizer. 
% To have a better performance on the converge of loss function, we implement a scheduler with step size = 1 and $\gamma$ = 0.7.

To have a better performance on the converge of loss function,we implement a scheduler with different combinations of hyperparameters of scheduler
\subsection{Hyperparameters selection for scheduler}
There are two hyperparameters, namely
\begin{itemize}
    \item step size
    \item $\gamma$
\end{itemize}

\subsection{Hyperparameters selection for model}
In this MLP model, there are five hyperparameters to be selected, namely
\begin{itemize}
	\item Learning Rate ($\alpha$)
	\item Regularization ($\lambda$)
	\item Rank of factorization ($k$)
	\item Number of epoch ($x$)
	\item Batch Size ($N$)
% 	\item Log Interval ($t$)
\end{itemize}


\subsubsection{Learning Rate ($\alpha$)}
The learning rate indicates the speed at which the model learns. It controls how much to change the modal to respond the estimate error each time when the model weights are updated.

Rate with too small value may cause a long training time while a large value may allow the modal learning too fast to have less accurate or even not meaningful result.

\subsubsection{Regularization ($\lambda$)}
% Regularization plays a crucial role in preventing the overfitting issue when training model. It aims to reduce the complexity of model to achieve its function. When the regularization parameter is large, the decay in weights during SGD update will be increased and therefore the weights of the hidden units will be negligible (close to zero).

% If it is set to be too large, each change to the parameters are cancelled by the regularization term. If it is set to be too small, overfitting may result.

\subsubsection{Rank of factorization ($k$)}
Intuitively, each rank represents some characteristic of a user or a movie. Users and movies that interact strongly on a rank would be affected.

A large value of $k$ increases the training time, memory consumption and probably the chance of overfitting,
while a low value of $k$ reduces the representativeness of the model and hence reduced accuracy.

\subsubsection{Number of epoch ($x$)}
The number of epochs defines the number of times that the modal will work through the whole training dataset.
Since we already report the training score for each epoch, it is adjusted to a value such that the training score appears to converge negligibly.

% \subsubsection{Log-interval ($t$)}

\subsubsection{Batch size ($N$)}
The batch size defines the number of samples to be passed through before the update of model parameters.

The small batch size usually can give a better performance since the model can start learning with a small subset and have more cycles to improve.

However, the smaller batch size costs the slower learning and hence more training time.

\subsection{Predictive test set score (RMSE)}
The model is evaluated by computing the RMSE between predicted and actual rating values:
$$ \text{RMSE} = \sqrt{\sum_{(i, j) \in E} \frac{{(\hat r_{ij} - r_{ij})}^2}{\left| E \right|}} $$
where $E$ is the set of retained evaluation data.

\section{Model performance}
\subsection{Findings}
The following table exhaustively lists our test results. Train RMSE and Test RMSE refer to the RMSE value calculated for the training dataset and testing dataset respectively.

\begin{tabular}{|c|c|c|c|c|c|c|c|c|}
	\hline
	No. & $k$ & $\alpha$ & $\gamma$ & $N$ & Epochs & Train RMSE & Test RMSE
	\\ \hline
	1 & $100$ & $1$ & $0.7$ & $65535$ & $9$ & $1.16384$ & $1.08611$
	\\ \hline
	2 & $300$ & $1$ & $0.7$ & $65535$ & $9$ & $1.16491$ & $1.08696$
	\\ \hline
	3 & $100$ & $0.1$ & $0.9$ & $65535$ & $9$ & $1.16536$ & $1.08816$
	\\ \hline
	4 & $100$ & $1$ & $0.7$ & $4096$ & $9$ & $1.16368$ & $1.08977$
	\\ \hline
	5 & $100$ & $1$ & $0.7$ & $1048576$ & $8$ & $5.14467$ & $2.23113$
	\\ \hline
\end{tabular}


\begin{figure}
	\includegraphics[width=0.5\textwidth]{screenshot20210415234153.png}
	\caption{RMSE graph for model 1}
\end{figure}

\begin{figure}
	\includegraphics[width=0.5\textwidth]{screenshot20210415234346.png}
	\caption{RMSE graph for model 2}
\end{figure}

\begin{figure}
	\includegraphics[width=0.5\textwidth]{screenshot20210415234450.png}
	\caption{RMSE graph for model 3}
\end{figure}

\begin{figure}
	\includegraphics[width=0.5\textwidth]{screenshot20210415234538.png}
	\caption{RMSE graph for model 4}
\end{figure}

\begin{figure}
	\includegraphics[width=0.5\textwidth]{screenshot20210415234604.png}
	\caption{RMSE graph for model 5}
\end{figure}

\hspace{10em}

\subsection{Discussion}
When comparing the results of tuning the value of rank from 100 to 300, we can see the RMSE curves have a similar shape in general. After epoch 0, both MSE values drop significantly by around 4.5 units and then the values remain constant. The shape becomes smoother than previous results by tuning learning rate to 0.1 and $\gamma$ to 0.9. The curve of decreasing the batch size to 4096 has a similar shape with the first and second one but the extent of drop is much smaller (by 0.3). By increasing the batch size to 1048576, the RMSE drops constantly (remains a straight line) from $\text{mse} = 13.17$ to $\text{mse} = 5.144$.

Overall, the first one with (rank = 100, learning rate = 1.0, gamma = 0.7, batch size = 65535) has the lowest mse (mse = 1.163) after training. The highest belongs to the (rank = 100; learning rate = 1.0, gamma = 0.7, batch size = 1048576)

{\small
	\bibliographystyle{ieee_fullname}
	\bibliography{egbib}
}

\end{document}