Skip to content

Using R's tidytext package to inspect sentiment of Reddit comments for Smithsonianmag.com.

Notifications You must be signed in to change notification settings

aleszu/reddit-sentiment-analysis

Repository files navigation

Sentiment analysis of Reddit comments using R's tidytext package

In February 2018, Felippe Rodrigues and I published a story at Smithsonianmag.com about brain-boosting substances that are at once banned in the Olympics and popular in the tech world. To understand the size and tenor of the conversation surrounding these so-called "nootropics" - think the pill from the movie Limitless - we used R's tidytext package to analyze more than 150,000 Reddit comments scraped using Python. RMarkdown file here.

Here's how we did it.

First, load the following packages.

# for data wrangling
library(tidyr)
library(stringr)
library(magrittr)
library(dplyr)
library(lubridate)

# for sentiment analysis
library(tidytext)

# for visualization
library(ggplot2)
library(ggridges)

Load in the data

Next, load in the 164,000 comments we scraped from the subreddits r/Nootropics and r/StackAdvice. By using glimpse() we can take a peek at the tibble in RStudio's console.

All_comments <- read.csv("reddit/all_Noot_Stack_comments.csv", 
                         header=TRUE, stringsAsFactors=FALSE) 
All_comments %>% glimpse()

img1

Next, we's filtered these Reddit comments by substance and then exported each as a CSV for manual exploration in Excel using write.csv().

# Get caffeine comments
All_caffeine_comments <- All_comments %>%
  filter(str_detect(body, "caffeine")) %>% glimpse()
write.csv((All_caffeine_comments), "reddit/substances/all_caffeine_mentions.csv")

# Get theanine comments
All_theanine_comments <- All_comments %>%
  filter(str_detect(body, "theanine")) %>% glimpse()
write.csv((All_theanine_comments), "reddit/substances/all_theanine_mentions.csv")

# Get piracetam comments
All_piracetam_comments <- All_comments %>%
  filter(str_detect(body, "piracetam")) %>% glimpse()
write.csv((All_piracetam_comments), "reddit/substances/all_piracetam_mentions.csv")

# Get noopept comments
All_noopept_comments <- All_comments %>%
  filter(str_detect(body, "noopept")) %>% glimpse()
write.csv((All_noopept_comments), "reddit/substances/all_noopept_mentions.csv")

# Get modafinil comments
All_modafinil_comments <- All_comments %>%
  filter(str_detect(body, "modafinil")) %>% glimpse()
write.csv((All_modafinil_comments), "reddit/substances/all_modafinil_mentions.csv")

# Get phenibut comments
All_phenibut_comments <- All_comments %>%
  filter(str_detect(body, "phenibut")) %>% glimpse()
write.csv((All_phenibut_comments), "reddit/substances/all_phenibut_mentions.csv")

# Get bacopa comments
All_bacopa_comments <- All_comments %>%
  filter(str_detect(body, "bacopa")) %>% glimpse()
write.csv((All_bacopa_comments), "reddit/substances/all_bacopa_mentions.csv")

In Excel, we surveyed each of these substances, adding a corresponding substance column to each csv. We then pasted all of these into one spreadsheet, which we named all_substance_mentions.csv.

This could have easily been done with the dplyr R package, too, FYI.

Visualize substance mentions over time

First, we load in our csv with the all Reddit substance mentions

TotalMentions <- read.csv("reddit/all_substance_mentions.csv", 
                          header=TRUE, stringsAsFactors=FALSE) 
TotalMentions %>% glimpse()

We need to fix the date using lubridate and then breakdown the data by month. This will make a much more presentable plot.

TotalMentions$date <- ymd(TotalMentions$date)
TotalMentions$Month <- as.Date(cut(TotalMentions$date,
                                   breaks = "month"))
TotalMentions %>% 
  arrange(date) %>% glimpse() 

img2

Finally, we'll use the ggplot2 visualization package to plot mentions of each substance over time. Note: facet_grid breaks the plot out into small plots for each substance.

ggplot(TotalMentions, aes(Month)) + geom_histogram(aes(fill=factor(substance)), stat = "count") +
  facet_grid(~substance)

img3

We also wondered if we could stack these substance mentions over time using what some call the "Joy Division" plot. It comes with the ggridges package.

ggplot(TotalMentions, aes(x = Month, y = substance)) + geom_density_ridges()

img4

We also tried visualizing using some other plots.

ggplot(TotalMentions, aes(Month, color=substance)) + geom_histogram(aes(binwidth=0.01), stat = "bin") +
  facet_grid(~substance)
ggplot(TotalMentions, aes(Month)) + geom_histogram(aes(fill=factor(substance)), binwidth="40", stat = "count") + facet_grid(~substance)
ggplot(TotalMentions, aes(TotalMentions$Month, color=substance)) + geom_freqpoly(aes(binwidth=0.01), stat="bin") 

img5

ggplot(TotalMentions, aes(time)) + geom_histogram(aes(fill=factor(substance)), stat = "count")

Sentiment analysis

We employed sentiment analysis using the "tidytext" R package on our CSV file of mentions of all seven substances.

First, load the AFINN sentiment analysis library that comes with "tidytext"

AFINN <- sentiments %>%
  filter(lexicon == "AFINN") %>%
  select(word, afinn_score = score)

Tokenize comments and merge with words ranked by sentiment

Next, tokenize comments into one-word rows and cut out stop words. Then join AFINN-scored words with words in comments, if present. Next, return a 114,000-row tibble with X1, substance, word and sentiment score. (Uncomment the write.csv() line to create a CSV of this tibble.)

TotalMentionsComments <- TotalMentions

all_sentiment <- TotalMentionsComments %>%
  select(body, X1, time, substance, date) %>%
  unnest_tokens(word, body) %>%
  anti_join(stop_words) %>%
  inner_join(AFINN, by = "word") %>%
  group_by(X1, substance, word) %>%
  summarize(sentiment = mean(afinn_score))

all_sentiment

# write.csv((all_sentiment), "reddit/all_sentiment.csv")

img6

Visualize sentiment analysis

Let's see what we've got. Using ggplot2 we will plot words by sentiment and frequency, with dot size representing the frequency of words. Add a geom_hline to show the average sentiment.

ggplot(all_sentiment, aes(x=substance, y = all_sentiment$sentiment)) +
  geom_point() +
  geom_count() +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1, hjust = 1) + geom_hline(yintercept = mean(all_sentiment$sentiment), color = "red", lty = 2)

img7

To eventually plot sentiment vs. word frequency, we will need to count word occurences and then merge these two dataframes.

all_sentiment_wordcount <- all_sentiment %>%
  select(X1, substance, word, sentiment) %>%
  group_by(word) %>%
  tally()

Bind_sent_and_word <- all_sentiment %>%
  full_join(all_sentiment_wordcount, by="word")

img8

Ok, now we're ready to plot sentiment of all substances vs. word frequency, using facet_wrap to split up the charts by substance.

ggplot(Bind_sent_and_word, aes(y=n, x=sentiment, color=substance)) +
  geom_point() +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1, hjust = 1) +
  geom_hline(yintercept = mean(Bind_sent_and_word$sentiment), color = "red", lty = 2) +
  facet_wrap(~substance)

Since our Smithsonian story is only about modafinil, piracetam and noopept, let's filter by those three substances.

Filtered_sent_vs_word <- Bind_sent_and_word %>%
  filter(substance == "modafinil" | substance == "piracetam" | substance == "noopept") %>% glimpse()

Visualize our publication-ready chart

The graph we included in the story plots word frequency vs. sentiment and colors the points by substance and labels them by word.

ggplot(Filtered_sent_vs_word, aes(y=n, x=sentiment, color=substance)) +
  geom_point() +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.1, hjust = 1.1) 

img9

We exported an SVG, brought it into Adobe Illustrator and designed it up.

That's it!

img10

About

Using R's tidytext package to inspect sentiment of Reddit comments for Smithsonianmag.com.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published