insights.Rmd

---
title: "subreddit insights"
output: github_document
---

Load data:

```{r}
require(data.table)

setwd('/home/xk/github/xk/reddit_mining')

stats_link = fread('zstdcat subreddit_stats_link.csv.zst')
stats_text = fread('zstdcat subreddit_stats_text.csv.zst')
stats_text
```

I'm not 100% sure what the `""` subreddit is. The data that those records originate from seem to be mostly spam. Maybe these records are incorrectly articulated programmatic submissions? Because that data is tiny, I've uploaded the full records to this repo under `null_subreddit_*`:

```{sh eval=FALSE, include=FALSE}
sqlite-utils ~/lb/reddit/submissions.db 'select * from media where playlist_path is null' --csv > ~/github/xk/reddit_mining/null_subreddit_link.csv
sqlite-utils ~/lb/reddit/submissions.db 'select * from reddit_posts where playlist_path is null' --csv > ~/github/xk/reddit_mining/null_subreddit_text.csv
```

It seems like a lot of data is user subreddits so I will remove those because:

1.  they are less interesting
2.  they will need to be interpreted differently in any analysis
3.  I'm not sure how accurate they are because my user subreddit has only one record from four years ago. (Am I shadow banned??? I must be on some list somewhere because I also get 0 likes on bumble. oh well\~\~ The data analysis which [follows](https://youtube.com/watch?v=x8llJbEDA0g) is hopefully more interesting than my well-being)

```{r eval=FALSE}
fwrite(stats_link[subreddit %like% '^u_'], 'user_stats_link.csv')
fwrite(stats_text[subreddit %like% '^u_'], 'user_stats_text.csv')
fwrite(stats_link[!(subreddit %like% '^u_')], 'subreddit_stats_link.csv')
fwrite(stats_text[!(subreddit %like% '^u_')], 'subreddit_stats_text.csv')
```

The first thing that we can see is that about half of all subreddits only one post of either type.

The median subreddit has 3 link posts and 1 text post... although keep in mind that these variables were observed independently--a subreddit has to have at least one link post to show up in the stats_link data and likewise for text posts and stats_text. It seems like it should be okay to compare the values because they are independent variables but maybe it's not okay to combine values like that? I'm not sure. Statisticians, please let me know.

```{r}
summary(stats_link)
summary(stats_text)
```

You can see the count of subreddits which have a low number of submissions like this:

```{r}
data.table(sort(table(stats_text$count)))[N > 32000]
data.table(sort(table(stats_link$count)))[N > 96000]
```


Though that may only be because some subreddits have only "link" posts enabled or only "text" posts enabled. We will need to create a third dataset to evaluate that:

```{r}
setkey(stats_link, "subreddit")
setkey(stats_text, "subreddit")
stats_both = stats_link[stats_text, nomatch=FALSE]
data.table(sort(table(stats_both$COUNT + stats_both$i.COUNT)))[N > 32000]
```

Maybe it explains some of the data but there are still many subreddits which have very few posts.

```{r}
full_outer_join = merge(stats_link, stats_text, all=TRUE)
stats_antijoin = full_outer_join[is.na(COUNT.x) | is.na(COUNT.y)]
rm(full_outer_join)
gc()

stats_antijoin
```

So far, we've figured out that there are 1,838,991 subreddits which exist only in one of the 'link' or 'text' datasets and there are 1,786,105 subreddits which exist in both datasets. This is not duplicate data per se, it just means that there are 1,838,991 subreddits which either: a) ONLY have 'link' posts or b) ONLY have 'text' posts; and there are 1,786,105 subreddits which have both 'link' and 'text' posts.

Hypothetically, there are a large number of subreddits which people:

1.  Accidentally created
2.  Created with many plans to promote and encourage activity but for whatever reason they lost momentum or the market didn't exist yet
3.  Some combination of 1 and 2

Because of this, I feel, for this analysis, I have enough reason to ignore subreddits which have fewer than 20 total posts and fewer than 10 distinct posting users:

```{r}
stats_text = stats_text[COUNT >= 600 & count_authors >= 15]
stats_link = stats_link[COUNT >= 600 & count_authors >= 15]
```

As well as removing subreddits where there wasn't a lot of upvoting or commenting going on. I'm not sure if there are subreddits where comments are disabled. I imagine that would defeat the point of reddit? But I am assuming that feature does not exist:

```{r}
stats_text = stats_text[max_score >= 60 & max_comments >= 15]
stats_link = stats_link[max_score >= 60 & max_comments >= 15]
```

For the simplicity of this analysis I will also focus on interpreting the results of 'link' and 'text' datasets separately. Rather than try to combine the analysis by using the 'antijoin' or 'both' data tables that we created in a previous step.


```{r}
rm(stats_antijoin, stats_both)
```

Now we have a much smaller dataset to work with. Let's transform the data types into something meaningful:

```{r}
require(lubridate)

stats_link$oldest = as_datetime(stats_link$oldest)
stats_link$avg_age = as_datetime(stats_link$avg_age)
stats_link$latest = as_datetime(stats_link$latest)

stats_text$oldest = as_datetime(stats_text$oldest)
stats_text$avg_age = as_datetime(stats_text$avg_age)
stats_text$latest = as_datetime(stats_text$latest)

stats_link$avg_time_to_edit = dseconds(stats_link$avg_time_to_edit)
stats_text$avg_time_to_edit = dseconds(stats_text$avg_time_to_edit)
```

Unfortunately, it seems like `avg_time_to_edit` is still wrong even after I re-ran the 8 hour long SQL query to re-compute that. I'm not sure what's going on there.

Also, I'm not sure what `time_modified` (of reddit_links.parquet--used to calculate `avg_time_to_edit`) means for link submissions. It's my understanding that only text posts can be edited, but maybe Reddit mods can rename titles of link posts??

```{r}
summary(stats_text$avg_time_to_edit)
summary(stats_link$avg_time_to_edit)
```
```{sql eval=FALSE, include=FALSE}
@sqlite> select time_modified, time_modified - NULLIF(time_created, 0) from (select * from reddit_posts where ROWID in (select rowid from reddit_posts limit 100 offset 100000)) where time_modified > 0;
QUERY PLAN
|--SEARCH reddit_posts USING INTEGER PRIMARY KEY (rowid=?)
`--LIST SUBQUERY 1
   `--SCAN reddit_posts
┌───────────────┬─────────────────────────────────────────┐
│ time_modified │ time_modified - NULLIF(time_created, 0) │
├───────────────┼─────────────────────────────────────────┤
│ 1261173996    │ 3600                                    │
│ 1261174153    │ 3600                                    │
│ 1261174163    │ 3600                                    │
│ 1261174184    │ 3600                                    │
│ 1261174392    │ 3600                                    │
│ 1261174497    │ 3600                                    │
│ 1261174650    │ 3600                                    │
│ 1261175381    │ 3600                                    │
│ 1261175580    │ 3600                                    │
│ 1261175747    │ 3600                                    │
│ 1261175936    │ 3600                                    │
│ 1261176163    │ 3600                                    │
│ 1261176268    │ 3600                                    │
│ 1261178286    │ 3600                                    │
│ 1261178640    │ 3600                                    │
│ 1261178731    │ 3600                                    │
│ 1261179736    │ 3600                                    │
│ 1261179887    │ 3600                                    │
│ 1261180343    │ 3600                                    │
│ 1261181721    │ 3600                                    │
│ 1261182220    │ 3600                                    │
│ 1261182299    │ 3600                                    │
│ 1261182347    │ 3600                                    │
└───────────────┴─────────────────────────────────────────┘
Run Time: real 0.130 user 0.002494 sys 0.059661
changes:   0   total_changes: 0
@sqlite> select time_modified, time_modified - NULLIF(time_created, 0) from (select * from reddit_posts where ROWID in (select rowid from reddit_posts limit 100 offset 1000000)) where time_modified > 0;
QUERY PLAN
|--SEARCH reddit_posts USING INTEGER PRIMARY KEY (rowid=?)
`--LIST SUBQUERY 1
   `--SCAN reddit_posts
Progress 5356
Progress 5357
Progress 5358
Progress 5359
┌───────────────┬─────────────────────────────────────────┐
│ time_modified │ time_modified - NULLIF(time_created, 0) │
├───────────────┼─────────────────────────────────────────┤
│ 1             │ -1296786185                             │
│ 1             │ -1296786154                             │
│ 1             │ -1296786100                             │
│ 1             │ -1296786064                             │
│ 1             │ -1296786033                             │
│ 1             │ -1296785911                             │
│ 1             │ -1296785878                             │
│ 1             │ -1296785728                             │
│ 1             │ -1296785621                             │
│ 1             │ -1296785595                             │
│ 1             │ -1296785523                             │
└───────────────┴─────────────────────────────────────────┘
Run Time: real 0.604 user 0.025196 sys 0.402265
changes:   0   total_changes: 0
```


Yeah, there's _definitely_ something weird going on and I'm too lazy to figure it out. Still, `avg_time_to_edit` seems like an interesting metric! I pass the baton to you.

```{r}
stats_link[, avg_time_to_edit := NULL]
stats_text[, avg_time_to_edit := NULL]
```

Now for some EDA (exploratory data analysis). Even though we have relatively small data I'm going to sample 10,000 random rows from the data to keep things snappy. I always like to develop hypotheses with an extract of the data and validate that hypothesis again with another extract or with the full data.

```{r}
SL = stats_link[sample(.N, 10000), ]
ST = stats_text[sample(.N, 10000), ]

require(DataExplorer)
plot_qq(SL)
plot_qq(ST)
```


```{r}
require(ggplot2)
```

Now we create some plots. The first thing that I found which was interesting is that it seems like the average link-post title is trending down very slightly over a decade.

```{r}
ggplot(SL) +
 aes(x = oldest, y = avg_length_title) +
 geom_line(size = 0.05, colour = "#0C4C8A") +
 labs(x = "Subreddit's first link post",
 y = "Avg. Title Length") +
 theme_minimal()
```
Subreddits die every day (or they switch to text-post only):

```{r}
ggplot(SL) +
 aes(x = latest, y = avg_length_title) +
 geom_line(size = 0.05, colour = "#0C4C8A") +
 labs(x = "Subreddit's most recent link post",
 y = "Avg. Title Length") +
 theme_minimal()
```

Subreddits with highest authorship have been around for a while

```{r}
ggplot(SL) +
 aes(x = oldest, y = count_authors) +
 geom_line(size = 0.3, colour = "#0C4C8A") +
 labs(x = "Subreddit's first link post", y = "Count of authors") +
 theme_minimal()
```

It also seems that most new subreddits are likely to allow NSFW link-posts.

```{r}

ggplot(SL) +
 aes(x = avg_age, y = is_over_18_ratio) +
 geom_line(size = 0.05, colour = "#112446") +
 labs(x = "Average Subreddit Link Posts Age",
 y = "Over 18 ratio") +
 theme_minimal()
```

I'm bad at stats and graphs. I think it might be more interesting if I try to sort both and create a comprehensive list somehow, then find some interesting subreddits to note so more people are aware of them; but I don't have time to go through 100,000 subreddits (unless some rich patron wants to donate a small fortune).

It's obvious that some subreddits are saturated with new posts. AskReddit, for example, has three times the number of link posts as the second highest, a subreddit called "TalkativePeople" which I've never heard of ?!. I feel that I am somewhat like a subreddit connoisseur but I'm surprised by just how many subreddits there actually are.

I think number of posts should be part of the sorting. Also, sorting by most recent post will help weed out the recently banned or abandoned subreddits.

### Out of millions of subreddits, these are the top 50,000 link post subreddits:

```{r}
S = stats_link[order(-cut(count_authors, 300), cut(avg_age, 2), -cut(latest, 2), -cut(max_comments, 30), -cut(COUNT, 200), -cut(avg_comments, 30))]
fwrite(S, file = 'top_link_subreddits.csv')
S
```

### Top 20,000 text post subreddits

```{r}
S = stats_text[order(-cut(count_authors, 300), cut(avg_age, 2), -cut(latest, 2), -cut(avg_comments, 30),  -cut(COUNT, 50), -cut(max_comments, 30))]
fwrite(S, file = 'top_text_subreddits.csv')
S
```

### Subreddits with long link post titles

```{r}
S = stats_link[order(-cut(avg_length_title, 300), -cut(avg_length, 30), -cut(COUNT, 20), -cut(avg_comments, 30), -cut(latest, 2), cut(avg_age, 2))]
fwrite(S, file = 'long_link_subreddits.csv')
S
```

### Subreddits with long text posts:

```{r}
S = stats_text[order(-cut(avg_length, 6), -cut(COUNT, 20), -cut(avg_comments, 30), -cut(latest, 2), cut(avg_age, 2))]
fwrite(S, file = 'long_text_subreddits.csv')
S
```

### Top NSFW link subreddits:

```{r}
S = stats_link[order(-cut(latest, 4), -cut(is_over_18_ratio, 2), -cut(max_score, 12), -cut(count_authors, 15), cut(avg_score, 20), -cut(COUNT, 20), -cut(max_comments, 100) )]
fwrite(S, file = 'nsfw_link_subreddits.csv')
S
```

### Top NSFW text subreddits:

```{r}
S = stats_text[order(-cut(latest, 4), -cut(is_over_18_ratio, 2), -cut(max_score, 12), -cut(count_authors, 15), cut(avg_score, 20), -cut(COUNT, 20), -cut(max_comments, 100) )]
fwrite(S, file = 'nsfw_text_subreddits.csv')
S
```


## Downloads

You can download the sorted CSVs here: [https://github.com/chapmanjacobd/reddit_mining/tree/main#downloads](https://github.com/chapmanjacobd/reddit_mining/tree/main#downloads)


```{r include=FALSE}

# These subreddits are similar in average post age so they might be subreddits primarily used by bots or it may just be a statistical happening (a math party).

stats_link[avg_age %in% data.table(sort(table(stats_link$avg_age)))[N>5]$V1]
stats_link[oldest %in% data.table(sort(table(stats_link$oldest)))[N>5]$V1]


# Subreddits where people, unusually more than average, edit their text post long after they created it

stats_text$avg_time_to_edit
```