Sentiment analysis is the task of classifying the polarity of a given text.
The IMDb dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. No more than 30 reviews are included per movie. Models are evaluated based on accuracy.
Model | Accuracy | Paper / Source |
---|---|---|
XLNet (Yang et al., 2019) | 96.21 | XLNet: Generalized Autoregressive Pretraining for Language Understanding |
BERT_large+ITPT (Sun et al., 2019) | 95.79 | How to Fine-Tune BERT for Text Classification? |
BERT_base+ITPT (Sun et al., 2019) | 95.63 | How to Fine-Tune BERT for Text Classification? |
ULMFiT (Howard and Ruder, 2018) | 95.4 | Universal Language Model Fine-tuning for Text Classification |
Block-sparse LSTM (Gray et al., 2017) | 94.99 | GPU Kernels for Block-Sparse Weights |
oh-LSTM (Johnson and Zhang, 2016) | 94.1 | Supervised and Semi-Supervised Text Categorization using LSTM for Region Embeddings |
Virtual adversarial training (Miyato et al., 2016) | 94.1 | Adversarial Training Methods for Semi-Supervised Text Classification |
BCN+Char+CoVe (McCann et al., 2017) | 91.8 | Learned in Translation: Contextualized Word Vectors |
The Stanford Sentiment Treebank contains 215,154 phrases with fine-grained sentiment labels in the parse trees of 11,855 sentences in movie reviews. Models are evaluated either on fine-grained (five-way) or binary classification based on accuracy.
Fine-grained classification (SST-5, 94,2k examples):
Model | Accuracy | Paper / Source |
---|---|---|
BCN+Suffix BiLSTM-Tied+CoVe (Brahma, 2018) | 56.2 | Improved Sentence Modeling using Suffix Bidirectional LSTM |
BCN+ELMo (Peters et al., 2018) | 54.7 | Deep contextualized word representations |
BCN+Char+CoVe (McCann et al., 2017) | 53.7 | Learned in Translation: Contextualized Word Vectors |
Binary classification (SST-2, 56.4k examples):
The Yelp Review dataset consists of more than 500,000 Yelp reviews. There is both a binary and a fine-grained (five-class) version of the dataset. Models are evaluated based on error (1 - accuracy; lower is better).
Fine-grained classification:
Model | Error | Paper / Source |
---|---|---|
XLNet (Yang et al., 2019) | 27.80 | XLNet: Generalized Autoregressive Pretraining for Language Understanding |
BERT_large+ITPT (Sun et al., 2019) | 28.62 | How to Fine-Tune BERT for Text Classification? |
BERT_base+ITPT (Sun et al., 2019) | 29.42 | How to Fine-Tune BERT for Text Classification? |
ULMFiT (Howard and Ruder, 2018) | 29.98 | Universal Language Model Fine-tuning for Text Classification |
DPCNN (Johnson and Zhang, 2017) | 30.58 | Deep Pyramid Convolutional Neural Networks for Text Categorization |
CNN (Johnson and Zhang, 2016) | 32.39 | Supervised and Semi-Supervised Text Categorization using LSTM for Region Embeddings |
Char-level CNN (Zhang et al., 2015) | 37.95 | Character-level Convolutional Networks for Text Classification |
Binary classification:
Model | Error | Paper / Source |
---|---|---|
XLNet (Yang et al., 2019) | 1.55 | XLNet: Generalized Autoregressive Pretraining for Language Understanding |
BERT_large+ITPT (Sun et al., 2019) | 1.81 | How to Fine-Tune BERT for Text Classification? |
BERT_base+ITPT (Sun et al., 2019) | 1.92 | How to Fine-Tune BERT for Text Classification? |
ULMFiT (Howard and Ruder, 2018) | 2.16 | Universal Language Model Fine-tuning for Text Classification |
DPCNN (Johnson and Zhang, 2017) | 2.64 | Deep Pyramid Convolutional Neural Networks for Text Categorization |
CNN (Johnson and Zhang, 2016) | 2.90 | Supervised and Semi-Supervised Text Categorization using LSTM for Region Embeddings |
Char-level CNN (Zhang et al., 2015) | 4.88 | Character-level Convolutional Networks for Text Classification |
SemEval (International Workshop on Semantic Evaluation) has a specific task for Sentiment analysis. Latest year overview of such task (Task 4) can be reached at: http://www.aclweb.org/anthology/S17-2088
SemEval-2017 Task 4 consists of five subtasks, each offered for both Arabic and English:
-
Subtask A: Given a tweet, decide whether it expresses POSITIVE, NEGATIVE or NEUTRAL sentiment.
-
Subtask B: Given a tweet and a topic, classify the sentiment conveyed towards that topic on a two-point scale: POSITIVE vs. NEGATIVE.
-
Subtask C: Given a tweet and a topic, classify the sentiment conveyed in the tweet towards that topic on a five-point scale: STRONGLYPOSITIVE, WEAKLYPOSITIVE, NEUTRAL, WEAKLYNEGATIVE, and STRONGLYNEGATIVE.
-
Subtask D: Given a set of tweets about a topic, estimate the distribution of tweets across the POSITIVE and NEGATIVE classes.
-
Subtask E: Given a set of tweets about a topic, estimate the distribution of tweets across the five classes: STRONGLYPOSITIVE, WEAKLYPOSITIVE, NEUTRAL, WEAKLYNEGATIVE, and STRONGLYNEGATIVE.
Subtask A results:
Model | F1-score | Paper / Source |
---|---|---|
LSTMs+CNNs ensemble with multiple conv. ops (Cliche. 2017) | 0.685 | BB twtr at SemEval-2017 Task 4: Twitter Sentiment Analysis with CNNs and LSTMs |
Deep Bi-LSTM+attention (Baziotis et al., 2017) | 0.677 | DataStories at SemEval-2017 Task 4: Deep LSTM with Attention for Message-level and Topic-based Sentiment Analysis |
Sentihood is a dataset for targeted aspect-based sentiment analysis (TABSA), which aims to identify fine-grained polarity towards a specific aspect. The dataset consists of 5,215 sentences, 3,862 of which contain a single target, and the remainder multiple targets.
Dataset mirror: https://github.com/uclmr/jack/tree/master/data/sentihood
Model | Aspect (F1) | Sentiment (acc) | Paper / Source | Code |
---|---|---|---|---|
QACG-BERT (Wu and Ong, 2020) | 89.7 | 93.8 | Context-Guided BERT for Targeted Aspect-Based Sentiment Analysis | Official |
Sun et al. (2019) | 87.9 | 93.6 | Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence | Official |
Liu et al. (2018) | 78.5 | 91.0 | Recurrent Entity Networks with Delayed Memory Update for Targeted Aspect-based Sentiment Analysis | Official |
SenticLSTM (Ma et al., 2018) | 78.2 | 89.3 | Targeted Aspect-Based Sentiment Analysis via Embedding Commonsense Knowledge into an Attentive LSTM | |
LSTM-LOC (Saeidi et al., 2016) | 69.3 | 81.9 | Sentihood: Targeted aspect based sentiment analysis dataset for urban neighbourhoods |
The SemEval-2014 Task 4 contains two domain-specific datasets for laptops and restaurants, consisting of over 6K sentences with fine-grained aspect-level human annotations.
The task consists of the following subtasks:
-
Subtask 1: Aspect term extraction
-
Subtask 2: Aspect term polarity
-
Subtask 3: Aspect category detection
-
Subtask 4: Aspect category polarity
Preprocessed dataset: https://github.com/songyouwei/ABSA-PyTorch/tree/master/datasets/semeval14
https://github.com/howardhsu/BERT-for-RRC-ABSA (with both subtask 1 and subtask 2)
Subtask 1 results (SemEval-2014 Task 4 for Laptop and SemEval-2016 Task 5 for Restaurant):
Model | Laptop (F1) | Restaurant (F1) | Paper / Source | Code |
---|---|---|---|---|
ACE + fine-tune (Wang et al., 2020) | 87.4 | 81.3 | Automated Concatenation of Embeddings for Structured Prediction | Official |
BERT-PT (Hu, Xu, et al., 2019) | 84.26 | 77.97 | BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis | official |
DE-CNN (Hu, Xu, et al., 2018) | 81.59 | 74.37 | Double Embeddings and CNN-based Sequence Labeling for Aspect Extraction | official |
MIN (Li, Xin, et al., 2017) | 77.58 | 73.44 | [Deep Multi-Task Learning for Aspect Term Extraction with Memory Interaction] | |
RNCRF (Wang, Wenya. et al., 2016) | 78.42 | 69.74 | Recursive Neural Conditional Random Fields for Aspect-based Sentiment Analysis | official |
Subtask 2 results:
This is the same task on sentiment classification, where the given text is a review, but we are also additionally given (a) the user who wrote the text, and (b) the product which the text is written for. There are three widely used datasets, introduced by Tang et. al (2015): IMDB, Yelp 2013, and Yelp 2014. Evaluation is done using both accuracy and RMSE, but for brevity, we only provide the accuracy here. Please look at the papers for the RMSE values.
A related task to sentiment analysis is the subjectivity analysis with the goal of labeling an opinion as either subjective or objective.
Subjectivity dataset includes 5,000 subjective and 5,000 objective processed sentences.
Model | Accuracy | Paper / Source |
---|---|---|
AdaSent (Zhao et al., 2015) | 95.50 | Self-Adaptive Hierarchical Sentence Model |
CNN+MCFA (Amplayo et al., 2018) | 94.80 | Translations as Additional Contexts for Sentence Classification |
Byte mLSTM (Radford et al., 2017) | 94.60 | Learning to Generate Reviews and Discovering Sentiment |
USE (Cer et al., 2018) | 93.90 | Universal Sentence Encoder |
Fast Dropout (Wang and Manning, 2013) | 93.60 | Fast Dropout Training |