Twitter Sentiment Analysis Using Spark

Table of Contents

About The Project
Usage
- Requirements
- Run
Phases
Contact

About The Project

This project implements a sentiment analyzer with Spark ML library. It uses tweets and retweets of Twitter Dataset for training and testing model. There are many phases in this project which complete the project implementation step by step. Steps are as follows:

Introduce Data
Preprocess Data
Simulate Streaming
Introduce and Train model
Using model in Real-Time Senario
Further works!

(back to top)

Usage

Requirements

Project uses Spark(SparkSQL) for data exploration, model training and Streaming. Therefore, Spark should be installed in order to run Notebooks. Moreover, It should be noted that this project is totally implemented in python, so in order to make Spark able to run codes, PySpark should be installed.

$ pip install pyspark

Also, project needs SparkSQL to use built-in Dataframe objects for querying data and training model. It should be noted that RDD is not used during this project.

$ pip install pyspark[sql]

(optional) And finally, in order to set environmental variables for enabling Jupyter to run Spark code, findspark should be installed. (Also, setting environmental variables could be done manually without findspark)

$ pip install findspark

Run

Run all cells in order. Before running stream_sentiment_analysis.ipynb file, please run simulate streaming.py script.

(back to top)

Phases

Introduce Data

The data is gathered by Sentiment140 from Twitter API as a CSV (1,600,000 lines) with emotions removed. Data file format has 6 fields:

The polarity of the tweet (0 = negative, 4 = positive)
The id of the tweet (2087)
The date of the tweet (Sat May 16 23:58:44 UTC 2009)
The query (lyx). If there is no query, then this value is NO_QUERY.
The user that tweeted (robotickilldozr)
The text of the tweet (Lyx is cool)

For more details and downloading data, please visit here.

Preprocess Data

For this phase data_exploration.ipynb file is implemented and explained in details. Also, around 300,000 lines of raw dataset have been randomly sampled and used for next phase (streaming simulation).

Simulate Streaming

In this phase, simulate_streaming.py script is implemented to simulate real-time data generation every one second. In each second, a new CSV file will be created as a new fetched tweet. Obviously, this time is very unrealistic; Hence, feel free to decrease it as much as possible to make it close to real.

Introduce and Train model

In this phase Logistic Regression is used in order to classify tweets. Also, Spark ML library is used for all machine learning algorithms. For more details about implementation, check classifier_model.ipynb file.

Using model in Real-Time Senario

In this phase, stream_sentiment_analysis.ipynb file is implemented to simulate real-time data classification every five seconds. Obviously, this time is very unrealistic; Hence, feel free to decrease it as much as possible to make it close to real.

⚠️ warning: Before running stream_sentiment_analysis.ipynb file, please run simulate streaming.py script.

Further works!

This is a small project just to work with Spark. In real world, data must be in range of GB to use Spark. Also, using Logistic Regression might not be a good choice. Feel free to use other models such as Naive Bayes, SVM, and even Neural Networks.

Also, all configurations are set to run codes on local, so in order to use codes on clusters, one should change configurations and even partitioning strategy!!

(back to top)

Contact

Amirreza Naziri
Email: naziriamirreza@gmail.com

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
code		code
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Twitter Sentiment Analysis Using Spark

About The Project

Usage

Requirements

Run

Phases

Introduce Data

Preprocess Data

Simulate Streaming

Introduce and Train model

Using model in Real-Time Senario

⚠️ warning: Before running stream_sentiment_analysis.ipynb file, please run simulate streaming.py script.

Further works!

Contact

About

Releases

Packages

Languages

Amir79Naziri/TwitterSentimentAnalysisWithSpark_Project

Folders and files

Latest commit

History

Repository files navigation

Twitter Sentiment Analysis Using Spark

About The Project

Usage

Requirements

Run

Phases

Introduce Data

Preprocess Data

Simulate Streaming

Introduce and Train model

Using model in Real-Time Senario

⚠️ warning: Before running stream_sentiment_analysis.ipynb file, please run simulate streaming.py script.

Further works!

Contact

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages