In this highly-digitalised day and age, texting has become the preferred way of communication for the current generation. However, texting has indirectly impacted the art of communicating - through the negligence of emotion. Consequently, text messages can often be misinterpreted, depending on the perspectives of the perceiver and sender.
The main objective of this project is to utilise the knowledge we learnt in elementary data science and machine learning to build a simple application bsaed on the following key factors:
- To predict the emotion of a text message at a reasonable accuracy.
- To provide the predicted probability of each emotion from the given sentence to account for cases where a multitude of emotions are present.
In addition, we figured a few potential routes to take our simple application further into development in the future:
- Sentimental analysis of customer review on online products.
- Sentimental analysis of IMDB ratings of movies.
- Online dating profile matching algorithm fine-tuning based on the general perception of emotion from a conversation.
- Perform Exploratory Data Analysis on unstructured data (texts) using Word Cloud.
- Concepts about Recall, Precision & F1-score.
- Logistic Regression, Linear Support Vector Machine & Naive Bayes Algorithm implementation in Machine Learning.
- Implementation of Cross-Validation Check.
- Implementation of an application's graphical user interface using Streamlit.
- Elementary Object-Oriented Programming during the standardization of functions & classes.
- Introduction to documentation writing.
- Collaboration using GitHub.
https://www.kaggle.com/praveengovi/emotions-dataset-for-nlp by Praveen
text emotion i didnt feel humiliated sadness i can go from feeling so hopeless to so damned hopeful just from being around... sadness im grabbing a minute to post i feel greedy wrong anger i am ever feeling nostalgic about the fireplace i will know that it is still... love Note: text and emotion are separated by a semi-colon ';'.
i didnt feel humiliated;sadness i am feeling grouchy;anger ...
Lee Juin (Alias: @Neo-Zenith)
- Co-authored Text-Message Sentiment Analyser
- Documentation writing for README & Libaries Information
Kassim bin Mohamad Malaysia (Alias: @kassimmalaysia)
- Co-authored Text-Message Sentiment Analyser
- Presentation slides & scripts writing
Lee Ci Hui (Alias: @perfectsquare123)
- Co-authored Text-Message Sentiment Analyser
- Application design
The following libraries are used throughout the project.
Note:
Word Cloud
has not received any official support for Python 3.8x and above. Thus, we used Word Cloud unofficial as our library instead. For Python 3.7x and below, please refer to Word Cloud. However, do note that our project is ran and tested on Python 3.8x and above.
We have compiled a list of functions and classes which are useful during our project. These functions are repeatedly used within our project, and can be found in Libraries.
Please read Libaries Information for the details of the functions and classes found within our custom library.
There appears to be a widespread issue ongoing on Github w.r.t the incorrect printing/inability to print outputs from Jupyter Notebook formatted files.
Replicable:Yes
Source of Issue: Most likelyGithub
Fixed:Yes
Comments: Please use an alternative IDE to inspect the main code sections. Visual Studio Code is known to be working properly.
In certain scenario, clicking into our Jupyter Notebook will not render the notebook completely, or there is a tiny scrollable box which displays the notebook itself. While it is possible to read the entire notebook this way, it is highly inconvenient and certain visualisation will not be seen in its entirety.
Replicable:Yes
Source of Issue: Most likely due to the large file size of our notebook.
Fixed:No
Comments: Please refresh the notebook if the aforementioned error occurs. Otherwise, please use an alternative IDE to inspect the main code sections. Visual Studio Code is known to be working properly.
Our code section is divided into
3
main portion:
In this section, we perform the necessary import of libraries, as well as our train dataset. We also performed simple analysis of our dataset to get a brief outlook of what kind of data we were dealing with.
Please refer to Text-Message Sentiment Analyser for the details of our source code.
In this section, we perform mainly more in-depth analysis of our dataset. From the analysis, we figured out that our dataset requires some cleaning. Thus, we have performed dataset cleaning which can be classified into the following 3 phases:
- Lemmatization of words
- Removal of HTML tags and attributes
- Removal of stopwords
We are mainly using the NLTK library as our de-facto dataset cleaning library.
We are mainly using the Word Cloud as our main data visualisation library.
Please refer to Text-Message Sentiment Analyser under
Exploratory Data Analysis
for the details of our source code.
In this section, we perform machine learning by using the following 3 models on our train dataset:
- Logisitc Regression
- Naive Bayes Algorithm
- Linear Support Vector Machine
We proceeded to apply our trained models on the validation dataset, and obtain their respective Precision, Recall and F1-socre.
We further performed a repeated k-fold cross validation check on each model to determine the best model from the three.
Finally we apply the best model we chose on the test dataset.
Please refer to Text-Message Sentiment Analyser under
Machine Learning
for the details of our source code.
Special thanks to our Teaching Assistant, Ms. Song Nan, for providing some valuable feedbacks and suggestions throughout the project.
Below are some links that we have used as references throughout the project:
- https://towardsdatascience.com/comprehensive-guide-on-multiclass-classification-metrics-af94cfb83fbd
- https://medium.com/@sangha_deb/naive-bayes-vs-logistic-regression-a319b07a5d4c#:~:text=Both%20Naive%20Bayes%20and%20Logistic,was%20generated%20given%20the%20results.
- https://machinelearningmastery.com/repeated-k-fold-cross-validation-with-python/#:~:text=Repeated%20k-fold%20cross-validation%20provides%20a%20way%20to%20improve,all%20folds%20from%20all%20runs
- https://www.analyticsvidhya.com/blog/2020/04/beginners-guide-exploratory-data-analysis-text-data/
- https://towardsdatascience.com/nlp-part-3-exploratory-data-analysis-of-text-data-1caa8ab3f79d