Skip to content

Latest commit

 

History

History
197 lines (142 loc) · 6.49 KB

README.md

File metadata and controls

197 lines (142 loc) · 6.49 KB

Comment Analyzer

This is a Sentiment Analysis project. A Web Application that determines whether a comment given for a product is positive or negative(with the probability)! In other words, whether the person who bought the product is satisfied or not! CLICK HERE TO START THE APP Demo

Preview    Setup    Details of Implementation    Network    Hyperparameters and Tools    Results    References    Useful Resources   

Preview

Demo

      Stack:

          HTML - CSS - Js - Chart.js - Underscore.js - Jquery - Bootstrap - Python - Flask

      Icon Set:

           - Fontawesome

           - Bootstrap

Setup

Ubuntu 20 LTS

First install python3.8 and python3.8-venv on the OS.

Check:

By typing python3.8 and hiting Enter, python shell will appear (version 3.8).

Leave the python shell and run the commands below:

Commands:

 $ git clone https://github.com/amirdy/comment_analyzer.git
 $ cd comment_analyzer/Demo
 $ python3.8 -m venv env 
 $ source ./env/bin/activate
 (env)$ pip install -r requirements.txt
 (env)$ pip install torch==1.8.1+cpu torchvision==0.9.1+cpu  -f https://download.pytorch.org/whl/torch_stable.html
 (env)$ flask run

It will run a Flask development server.

Details of Implementation

Dataset

This dataset consists of ~4 million Amazon customer reviews(~3.6million for training and 0.4million for testing).

It has 2 labels:

  • Label 1 : for comments correspond to 1 and 2-star reviews (Negative).

  • Label 2 : for comments correspond to 4 and 5-star reviews (Positive).

Preprocessing

I just used 1000000 samples for training(validating) and 400000 samples for test set.

  • Train set : 780000 samples (the number of negative samples and positive samples is almost equal.).

  • Validation set : 220000 samples.

  • Test set : 400000 samples.

  • Each sample is a sequence of sentences.

  • First, each sample turns into a sequence of vocabs(tokens).

  • The length of each sample is equal to the size of this sequence.

  • The lengths of samples are different. For Mini Batch Gradient Descent, we need the same length samples.

So, what is the solution?
We can only consider the first N Vocabs(Tokens) of each sample.
Then, we add enough <pad> tokens to the end of samples with a length of less than N to reach the length of N.
But, what is the value of N?
Let's assume that the length of samples in the train set is a random variable X with a Normal Distribution1.
In the training set, the average length of the samples is 92.70 and the standard deviation is 50.23.

We already know that in the Normal Distribution:

By placing average and the standard deviation we will have:

Thus, considering N = 244 can be a good choice!

* We only consider the first 244 Vocabs(Tokens) of each sample.

* Vocabs with less than 10 repetitions will be replaced with <UNK>.1

Network

The entire model consists of three networks:

1. Encoder (2-layer and Bidirectional - LSTM)
2. Encoder2Decoder (MLP)
3. Decoder (with Attention - LSTM)

1. Encoder

Word indexes 1 to 244 were given to a 2-layer bidirectional LSTM (Encoder).

Left to Right Direction
Right to Left Direction

2. Encoder2Decoder

The hiddens of the last time step were given to the Encoder2Decoder Network(MLP) to obtain the decoder hiddens.

In other words, the Encoder2Decoder receives the hiddens of the last time step in the Encoder (every two layers in both directions) and generates Decoder hiddens.

Encoder to Decoder Network

2. Decoder

For better result, the Attention mechanism which is a simple MLP was used.

Attention

Then, a combination of Attention outputs and the Encoder hiddens were given to the Decoder.

Decoder Network

Hyperparameters and Tools

  • Batch size:

    • 64
  • Embedding Size:

    • 16
  • Encoder:

    • Sequence lenght:
      • 244
    • Hidden size:
      • 16
    • Number of layers:
      • 2
    • Bidirectional:
      • True
  • Decoder:

    • Sequence lenght:
      • 1
    • Hidden size:
      • 16
    • Number of Layer:
      • 1
    • Bidirectional:
      • False
  • Optimizer:

    • ADAM
  • Learning rate:

    • 0.0005
  • Weight Decay:

    • 0.0001
  • Loss:

    • Cross entropy
  • Train vs Validation Split:

    • Approximately : 0.78 | 0.22
  • Tools:

    • NLTK
    • Python - Pytorch ( Using Google Colab Pro )

Results

   Training finished in ~ 19 Hours on Tesla V100-SXM2-16GB (using Google Colab Pro).
  • Best Validation Loss : 0.1355
    • (In Epoch 30 | Accuracy : 94.97 %) [This model is selected]
  •    Test Result on this model:
    •     Loss : 0.1365  | Accuracy : ~ 94.96 %

alt text alt text

References

[1] Sentiment Analysis on IMDB Dataset - Seyed Naser Razavi

Useful Resources

[1] Text Classification (Sentiment Analysis) - Seyed Naser Razavi