Clustering tweets by utlizing cosine Distance metric and K-means clustering algorithm.
Data redundancy is an important problem of Twitter. Twitter users are likely to generate similar tweets (e.g., using the Retweet function) about some popular topics/events.
a result of a huge number of tweets which let tweetos not interested to loss time about reading for the same topic many tweets
So by clustering similar tweets together, we can generate a more concise and organized representation of the raw tweets, which will be very useful for busy Tweetos to read only one tweet per class
So when a new tweet is added to the corpus, it must be labeled easily without performing the full clustering again
Text mining / clustering / NLP / tweepy / NLTK / twitter API
- Get a twitter API Key
Try this link
https://www.youtube.com/watch?v=vlvtqp44xoQ - Install tweepy
!pip install tweepy
- Install NLTK
!pip install nltk
conda install -c anaconda nltk
- Install stopwords from nltk graphic ( download nltk )
K-Means algorithm has been executed by
- data representation method :TF-IDF
- Distance metrics : Cosine Similarity
- k =6 values (2 to 6 clusters)