This repository hosts a project of sentiment analysis performed on IMDb movie reviews using Logistic regression, Bernoulli Naive Bayes classifier and a biGRU RNN. The purpose of this project is to showcase and examine performance differences between the aforementioned methods of classification, using metrics such as accuracy score, precision score, recall score and f1 score.
The datased used is the IMDb movie review sentiment classification dataset. It consists of 25,000 movies reviews from IMDb, labeled by sentiment (positive/negative).
After fetching and transforming the data, we implement the Logistic regression algorithm and a custom Bernoulli Naive Bayes classifier. We also construct a bidirectional GRU cell RNN with 2 layers.
Then, we proceed to making comparisons of our custom approaches with the corresponding Scikit-Learn implementations mainly by plotting learning curves and printing classification reports to observe their behavior for both training and testing data.
Lastly, we compare the behavior and performance of the bidirectional GRU cell RNN with those of the other two algorithms.
The classification methods featured are:
custom | scikit-learn |
---|---|
NaiveBayesClassifier | BernoulliNB |
CustomLogisticRegression | LogisticRegression, SGDClassifier |
BiGRU_RNN |
This project consists of a Jupyter Notebook with all code cells already ran, so that you can easily examine the learning curves, classification reports, comparison heatmaps and other results. Also, there is a Report included summarizing some implementation details and conclusions. It's written in Greek (as the Artificial Intelligence course was offered in Greek).
If you would like to rerun the Jupyter Notebook code cells, please note that some training processes may require plenty of time to complete depending on your machine characteristics.
BernouliNB is implemented as a class (class NaiveBayesClassifier()) where fit and predict are defined as methods.
-
fit (x_train_binary, y_train) Fit method is used for the computation (and assignment to the relevant class fields) of all probabilities required during prediciton:
- The probabilities
$\color{#2c73cc}P(C=1), P(C=0)$ of positive and negative classes respectively have to be computed. - Then, the conditional probabilities
$\color{#a42574} P(feature_i = 1 | C=0), P(feature_i = 1 | C=1)$ should be known during prediciton to compute their product (Assumption of Independence). Obviously, there is no need to store$P(feature_i = 0 | C=0) = 1 - P(feature_i = 1 | C=0), P(feature_i= 0 | C=1) = 1 - P(feature_i = 1 | C=1)$ , therefore reducing memory requirements.
💡Laplace smoothing is used in the above computation to avoid zeroing of the entire classification probability due to zeroing of a single product term during prediction. Thus, +1 is added to the numerators of the above probabilities, while +2 is added to the denominators (if there are two possible values for each feature).
$\boxed{P(C=0 | example) = \color{#2c73cc}P(C=0) \color{black}*\color{#a42574} \prod_{i=1}^{m}P (feature_i | C=0)}$
$\boxed{P(C=1 | example) = \color{#2c73cc}P(C=1) \color{black}*\color{#a42574} \prod_{i=1}^{m}P (feature_i | C=1)}$ - The probabilities
-
predict (x_test_binary) Predict method is used to compute the classification probabilities for the testing data given.
- Algorithm:
- ➡️ For each instance of the testing set to be classified:
- ➡️ For each feature of that instance:
- ➡️ Compute the following probabilities:
- ➡️ For each feature of that instance:
- ➡️ For each instance of the testing set to be classified:
$P(C=0 | feature_i) = P(C=0)*\color{#2c73cc}P(feature_i | C=0) =\begin{cases} P(feature_i = 1 | C=0), if\; feature_i = 1,\\ 1 - P(feature_i = 1 | C=0), if\; feature_i=0 \end{cases}$ $P(C=1 | feature_i) = P(C=1)*\color{#2c73cc}P(feature_i | C=1) =\begin{cases} P(feature_i = 1 | C=1), if\; feature_i = 1,\\ 1 - P(feature_i = 1 | C=1), if\; feature_i=0 \end{cases}$
...which are then multiplied according to the formula, giving$P(C=0 | example)$ and$P(C=1|example)$ The instance is classified in the class associated with the greatest probability. - Algorithm:
Logistic Regression is implemented as a class (class CustomLogisticRegression()) where fit and predict are defined as methods along with an auxilliary sigmoid function (pos_category_sigmoid). Moreover, find_best_regularizator is implemented to estimate the optimal regularization factor.
-
fit (x_train_binary, y_train)
- Data is splitted into training data and validation data, with validation percentage being 20%.
- All attributes' weights are initialized to 0.
For a maximum number of iterations/epochs (n_iters) we randomly reorder the instances in the beginning of each iteration (so that the steps towards weight convergence be independent from iteration to iteration) and for each instance we update the weights (and the bias factor) according to the formula:
$\boxed{\vec{w} = (1-2*λ*η)*\vec{w}+η*\sum_{i=1}^m[y^{(i)}-P(C_+|\vec{x}^{(i)})]*\vec{x}^{(i)}}$
Therefore, since weights are updated based on one example at a time,
-
predict (x_test_binary)
- Algorithm:
- ➡️For each instance of the testing set to be classified:
-
➡️Compute the product of the weight vector (as learned during fitting) and the feature vector of the current example (
$\vec{w} * \vec{x}$ ). The classification is done according to the sign of this product:$(\vec{w} * \vec{x})=\begin{cases} pos (+) \rightarrow C=1\\ neg (-)\rightarrow C=0 \end{cases}$
-
- ➡️For each instance of the testing set to be classified:
- Algorithm:
To find the optimal regularization factor (find_best_regularizator function) an iterative process takes place, where for a range of values of λ (from 1e-15 to 0.99 + 1e-15) accuracy score evaluations are performed on validation data and the optimal λ is returned along with other relevant information. If the accuracy score does not improve within 5 consecutive iterations/trials, then the process is terminated.
The optimal λ was proved to be the smallest of the above range (1e-15), and even after a trial in which λ was set to 0 the results were even better (i.e. without using regularization). This is reasonable, since, as will be seen in the diagrams, there is not a high variance (overfitting) problem, as the learning curves of the training data and testing data have converged. Therefore, any effort for further improvement would rather move towards the reduction of λ.
We use the typical value 0.001.
The value is set to 100, as it was experimentally observed that by using early stopping the iterations/epochs needed were generally less than 100.
The bidirectional GRU cell RNN is implemented as a class () where fit and predict are defined as methods.
-
Fit: Each time fit is invoked, the Neural Network is reconstructed (create_bi_GRU_RNN()) and compiled. Then, the standard fit (of keras model) is called to train it. The purpose is to not retain the data of previous calls, i.e. to "clear" the memory of the Neural Network.
-
Predict: Predict invokes the corresponding standard predict (of keras model).Then the predictions are converted from probabilistic to binary. This is done to be able to reuse the learning_curves function for our graphs.
- Alviona Mancho [alvionaM]
- Christos Patrinopoulos [techristosP]