Comment Analyzer

This is a Sentiment Analysis project. A Web Application that determines whether a comment given for a product is positive or negative(with the probability)! In other words, whether the person who bought the product is satisfied or not! CLICK HERE TO START THE APP

Preview Setup Details of Implementation Network Hyperparameters and Tools Results References Useful Resources

Preview

Stack:

HTML - CSS - Js - Chart.js - Underscore.js - Jquery - Bootstrap - Python - Flask

Icon Set:

- Fontawesome

- Bootstrap

Setup

Ubuntu 20 LTS

First install python3.8 and python3.8-venv on the OS.

Check:

By typing python3.8 and hiting Enter, python shell will appear (version 3.8).

Leave the python shell and run the commands below:

Commands:

 $ git clone https://github.com/amirdy/comment_analyzer.git
 $ cd comment_analyzer/Demo
 $ python3.8 -m venv env 
 $ source ./env/bin/activate
 (env)$ pip install -r requirements.txt
 (env)$ pip install torch==1.8.1+cpu torchvision==0.9.1+cpu  -f https://download.pytorch.org/whl/torch_stable.html
 (env)$ flask run

It will run a Flask development server.

Details of Implementation

Dataset

Amazon Reviews for Sentiment Analysis

This dataset consists of ~4 million Amazon customer reviews(~3.6million for training and 0.4million for testing).

It has 2 labels:

Label 1 : for comments correspond to 1 and 2-star reviews (Negative).
Label 2 : for comments correspond to 4 and 5-star reviews (Positive).

Preprocessing

I just used 1000000 samples for training(validating) and 400000 samples for test set.

Train set : 780000 samples (the number of negative samples and positive samples is almost equal.).
Validation set : 220000 samples.
Test set : 400000 samples.
Each sample is a sequence of sentences.
First, each sample turns into a sequence of vocabs(tokens).
The length of each sample is equal to the size of this sequence.
The lengths of samples are different. For Mini Batch Gradient Descent, we need the same length samples.

So, what is the solution?

We can only consider the first N Vocabs(Tokens) of each sample.

Then, we add enough <pad> tokens to the end of samples with a length of less than N to reach the length of N.

But, what is the value of N?

Let's assume that the length of samples in the train set is a random variable X with a Normal Distribution¹.

In the training set, the average length of the samples is 92.70 and the standard deviation is 50.23.

We already know that in the Normal Distribution:

By placing average and the standard deviation we will have:

Thus, considering N = 244 can be a good choice!

* We only consider the first 244 Vocabs(Tokens) of each sample.

* Vocabs with less than 10 repetitions will be replaced with <UNK>.¹

Network

The entire model consists of three networks:

1. Encoder (2-layer and Bidirectional - LSTM)

2. Encoder2Decoder (MLP)

3. Decoder (with Attention - LSTM)

1. Encoder

Word indexes 1 to 244 were given to a 2-layer bidirectional LSTM (Encoder).


Left to Right Direction


Right to Left Direction

2. Encoder2Decoder

The hiddens of the last time step were given to the Encoder2Decoder Network(MLP) to obtain the decoder hiddens.

In other words, the Encoder2Decoder receives the hiddens of the last time step in the Encoder (every two layers in both directions) and generates Decoder hiddens.


Encoder to Decoder Network

2. Decoder

For better result, the Attention mechanism which is a simple MLP was used.


Attention

Then, a combination of Attention outputs and the Encoder hiddens were given to the Decoder.


Decoder Network

Hyperparameters and Tools

Batch size:
- 64
Embedding Size:
- 16
Encoder:
- Sequence lenght:
  - 244
- Hidden size:
  - 16
- Number of layers:
  - 2
- Bidirectional:
  - True
Decoder:
- Sequence lenght:
  - 1
- Hidden size:
  - 16
- Number of Layer:
  - 1
- Bidirectional:
  - False
Optimizer:
- ADAM
Learning rate:
- 0.0005
Weight Decay:
- 0.0001
Loss:
- Cross entropy
Train vs Validation Split:
- Approximately : 0.78 | 0.22
Tools:
- NLTK
- Python - Pytorch ( Using Google Colab Pro )

Results

Training finished in ~ 19 Hours on Tesla V100-SXM2-16GB (using Google Colab Pro).

Best Validation Loss : 0.1355
- (In Epoch 30 | Accuracy : 94.97 %) [This model is selected]
Test Result on this model:
- Loss : 0.1365 | Accuracy : ~ 94.96 %

References

[1] Sentiment Analysis on IMDB Dataset - Seyed Naser Razavi

Useful Resources

[1] Text Classification (Sentiment Analysis) - Seyed Naser Razavi

Files

README.md

Latest commit

History

README.md

File metadata and controls

Comment Analyzer

Preview

Stack:

Icon Set:

Setup

Ubuntu 20 LTS

Check:

Commands:

Details of Implementation

Dataset

Preprocessing

So, what is the solution?

We can only consider the first N Vocabs(Tokens) of each sample.

Then, we add enough <pad> tokens to the end of samples with a length of less than N to reach the length of N.

But, what is the value of N?

Let's assume that the length of samples in the train set is a random variable X with a Normal Distribution1.

In the training set, the average length of the samples is 92.70 and the standard deviation is 50.23.

We already know that in the Normal Distribution:

By placing average and the standard deviation we will have:

Thus, considering N = 244 can be a good choice!

* We only consider the first 244 Vocabs(Tokens) of each sample.

* Vocabs with less than 10 repetitions will be replaced with <UNK>.1

Network

The entire model consists of three networks:

1. Encoder (2-layer and Bidirectional - LSTM)

2. Encoder2Decoder (MLP)

3. Decoder (with Attention - LSTM)

1. Encoder

Word indexes 1 to 244 were given to a 2-layer bidirectional LSTM (Encoder).

2. Encoder2Decoder

2. Decoder

Hyperparameters and Tools

Batch size:

Embedding Size:

Encoder:

Decoder:

Optimizer:

Learning rate:

Weight Decay:

Loss:

Train vs Validation Split:

Tools:

Results

Training finished in ~ 19 Hours on Tesla V100-SXM2-16GB (using Google Colab Pro).

Best Validation Loss : 0.1355

(In Epoch 30 | Accuracy : 94.97 %) [This model is selected]

Test Result on this model:

Loss : 0.1365 | Accuracy : ~ 94.96 %

References

Useful Resources

Let's assume that the length of samples in the train set is a random variable X with a Normal Distribution¹.

* Vocabs with less than 10 repetitions will be replaced with <UNK>.¹