Spark_Project--TwitterStreamWordCount

Spark Name : Bentic Sebastian

Project Title: To create a Naive Bayes Classifier in Spark to track Twitter Stream for Amazon complaints Question: How can I classify the nature of tweets connected to the company Amazon? Methods: Bernoulli Naive Bayes( 0 or 1) Multinomial Naive Bayes( range of values)

Introduction: I wanted to use Spark to analyse Twitter data. I received access to the Twitter API, and wrote a script(GatheringDataFromTwitter.ipynb) to generate a file with around 5000 tweets(result_Tweets_full). I searched for posts directed to '@amazonHelp', and collected all resulting posts. The tweets were interesting, since I could use a Naive Bayes Classifier to analyse the topics that were most frequently mentioned. This method would be useful for Amazon to monitor which segments of their company need to be improved Once I had the tweets, I wrote a script in PySpark (main.py) that manipulated the Twitter data.

Results: I have been able to collect the frequencies of words, after removing stop words etc. The results are saved in 'wordcount_16thDec.out' The most common words are “sending”, and “return”. Surpringly, “Receipt” also shows up. This may suggest that Amazon should focus on making sure that customers are receiving the correct receipts for their purchases, either electronically or in the packages. Although I wanted to create a complete Naive Bayes Classifier , I was not able to do so.

Future work: I would like to continue to implement this classifier, by calculating the Idf values and use the join command to find the tf-idf values. The Tf-Idf values are useful in vectorizing the data, in order to find the priors and use it for the Bayes Classifier.

Another future work I would like to be involved in is in creating a Sentiment Analysis algorithm, to predict which tweets are positive and negative.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitattributes		.gitattributes
GatheringDataFromTwitter.ipynb		GatheringDataFromTwitter.ipynb
README.docx		README.docx
README.md		README.md
WordCloud		WordCloud
main.py		main.py
result_Tweets_full.csv		result_Tweets_full.csv
stopWords_List.txt		stopWords_List.txt
wordcount_16thDec.out		wordcount_16thDec.out

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark_Project--TwitterStreamWordCount

About

Releases

Packages

Languages

bsebast2/Spark_Project--TwitterStreamWordCount

Folders and files

Latest commit

History

Repository files navigation

Spark_Project--TwitterStreamWordCount

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages