Table of Contents
This project implements a sentiment analyzer
with Spark ML
library. It uses tweets and retweets of Twitter Dataset for training and testing model. There are many phases in this project which complete the project implementation step by step. Steps are as follows:
- Introduce Data
- Preprocess Data
- Simulate Streaming
- Introduce and Train model
- Using model in Real-Time Senario
- Further works!
Project uses Spark(SparkSQL)
for data exploration, model training and Streaming. Therefore, Spark should be installed in order to run Notebooks. Moreover, It should be noted that this project is totally implemented in python
, so in order to make Spark
able to run codes, PySpark
should be installed.
$ pip install pyspark
Also, project needs SparkSQL
to use built-in Dataframe
objects for querying data and training model. It should be noted that RDD
is not used during this project.
$ pip install pyspark[sql]
(optional) And finally, in order to set environmental variables for enabling Jupyter to run Spark
code, findspark
should be installed. (Also, setting environmental variables could be done manually without findspark
)
$ pip install findspark
Run all cells in order. Before running stream_sentiment_analysis.ipynb file, please run simulate streaming.py script.
The data is gathered by Sentiment140 from Twitter API as a CSV (1,600,000 lines) with emotions removed. Data file format has 6 fields:
- The polarity of the tweet (0 = negative, 4 = positive)
- The id of the tweet (2087)
- The date of the tweet (Sat May 16 23:58:44 UTC 2009)
- The query (lyx). If there is no query, then this value is NO_QUERY.
- The user that tweeted (robotickilldozr)
- The text of the tweet (Lyx is cool)
For more details and downloading data, please visit here.
For this phase data_exploration.ipynb file is implemented and explained in details. Also, around 300,000 lines of raw dataset have been randomly sampled and used for next phase (streaming simulation).
In this phase, simulate_streaming.py script is implemented to simulate real-time data generation every one second. In each second, a new CSV file will be created as a new fetched tweet. Obviously, this time is very unrealistic; Hence, feel free to decrease it as much as possible to make it close to real.
In this phase Logistic Regression is used in order to classify tweets. Also, Spark ML library is used for all machine learning algorithms. For more details about implementation, check classifier_model.ipynb file.
In this phase, stream_sentiment_analysis.ipynb file is implemented to simulate real-time data classification every five seconds. Obviously, this time is very unrealistic; Hence, feel free to decrease it as much as possible to make it close to real.
⚠️ warning: Before running stream_sentiment_analysis.ipynb file, please run simulate streaming.py script.
This is a small project just to work with Spark. In real world, data must be in range of GB to use Spark. Also, using Logistic Regression might not be a good choice. Feel free to use other models such as Naive Bayes, SVM, and even Neural Networks.
Also, all configurations are set to run codes on local, so in order to use codes on clusters, one should change configurations and even partitioning strategy!!