🔍 A Comparative Study of Random Forest, Naive Bayes, and Transformer-Based Models for Emotion Classification
Authors
Mingshu Liu, Kaibo Zhang, and Alek Bedard
Affiliation: McGill University, This project is carried out under the supervision of Professors Isabeau Prémont-Schwarz and Reihaneh Rabbany. It is a part of the coursework of COMP551 Applied Machine Learning.
This project evaluates traditional machine learning models and transformer-based architectures for emotion classification on the GoEmotions dataset. It explores how model architecture, data preprocessing, and hyperparameter tuning impact performance, particularly in handling rare emotions and imbalanced data.
Key contributions include:
- Analysis of traditional models (Random Forest and Naive Bayes).
- Evaluation of pre-trained and fine-tuned BERT and GPT-2 models.
- Investigation of attention mechanisms in transformer models.
- Recommendations for future improvements using advanced sampling and hybrid architectures.
The GoEmotions dataset consists of 58,000 Reddit comments labeled into 27 emotion categories plus a neutral class.
- Training Samples: 40,000
- Validation Samples: 10,000
- Test Samples: 8,000
Class imbalance is a notable challenge, with the majority class ("neutral") dominating the dataset. Minority classes like "grief" and "pride" require specialized techniques to improve model performance.
-
Random Forest Baseline
- Leveraged bag-of-words representation for text features.
- Achieved training accuracy: 99.61% and test accuracy: 54.12%, indicating significant overfitting.
- Struggled with rare emotions due to shallow feature representations.
-
Naive Bayes Model
- Tuned smoothing hyperparameter (
alpha
) for optimal performance. - Test accuracy: 44.49%, F1 score: 36.89%, and AUC: 0.8199%.
- Highlighted limitations of the independence assumption in nuanced text classification.
- Tuned smoothing hyperparameter (
-
BERT Pre-training and Fine-tuning
- Pre-trained BERT struggled with test accuracy: 3.33%.
- Fine-tuned BERT achieved accuracy: 63.03%, F1 score: 61.33%, and AUC: 0.9390%.
- Attention analysis revealed strengths in token-level embedding and contextual relationships but struggled with rare emotions.
-
GPT-2 Evaluation
- Fine-tuned GPT-2 achieved accuracy: 59.59% and AUC: 0.9186%.
- Demonstrated improvements over pre-trained performance but exhibited signs of overfitting.
Model | Test Accuracy | F1 Score | AUC |
---|---|---|---|
Random Forest | 54.12% | 46.78% | 0.8426 |
Naive Bayes | 44.49% | 36.89% | 0.8199 |
Fine-tuned BERT | 63.03% | 61.33% | 0.9390 |
Fine-tuned GPT-2 | 59.59% | 58.29% | 0.9186 |
- Class Imbalance: Imbalanced class distribution significantly impacts rare emotion detection. Techniques like data augmentation are essential for improvement.
- BERT's Attention Mechanism: Offers superior performance in capturing semantic relationships but struggles with ambiguous or polarizing sentences.
- GPT-2 Performance: Showcases potential in nuanced emotion detection, albeit with limitations in overfitting.
Future Directions:
- Incorporate hybrid architectures (e.g., CNN-Transformer).
- Experiment with advanced sampling techniques.
- Utilize interpretability tools like Grad-CAM to enhance model insights.
[1] Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, and Sujith Ravi. GoEmotions: A Dataset of Fine-Grained Emotions. In 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020.
[2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
[3] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.