forked from chapmanjacobd/reddit_mining
-
Notifications
You must be signed in to change notification settings - Fork 0
/
insights.Rmd
331 lines (251 loc) · 14.8 KB
/
insights.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
---
title: "subreddit insights"
output: github_document
---
Load data:
```{r}
require(data.table)
setwd('/home/xk/github/xk/reddit_mining')
stats_link = fread('zstdcat subreddit_stats_link.csv.zst')
stats_text = fread('zstdcat subreddit_stats_text.csv.zst')
stats_text
```
I'm not 100% sure what the `""` subreddit is. The data that those records originate from seem to be mostly spam. Maybe these records are incorrectly articulated programmatic submissions? Because that data is tiny, I've uploaded the full records to this repo under `null_subreddit_*`:
```{sh eval=FALSE, include=FALSE}
sqlite-utils ~/lb/reddit/submissions.db 'select * from media where playlist_path is null' --csv > ~/github/xk/reddit_mining/null_subreddit_link.csv
sqlite-utils ~/lb/reddit/submissions.db 'select * from reddit_posts where playlist_path is null' --csv > ~/github/xk/reddit_mining/null_subreddit_text.csv
```
It seems like a lot of data is user subreddits so I will remove those because:
1. they are less interesting
2. they will need to be interpreted differently in any analysis
3. I'm not sure how accurate they are because my user subreddit has only one record from four years ago. (Am I shadow banned??? I must be on some list somewhere because I also get 0 likes on bumble. oh well\~\~ The data analysis which [follows](https://youtube.com/watch?v=x8llJbEDA0g) is hopefully more interesting than my well-being)
```{r eval=FALSE}
fwrite(stats_link[subreddit %like% '^u_'], 'user_stats_link.csv')
fwrite(stats_text[subreddit %like% '^u_'], 'user_stats_text.csv')
fwrite(stats_link[!(subreddit %like% '^u_')], 'subreddit_stats_link.csv')
fwrite(stats_text[!(subreddit %like% '^u_')], 'subreddit_stats_text.csv')
```
The first thing that we can see is that about half of all subreddits only one post of either type.
The median subreddit has 3 link posts and 1 text post... although keep in mind that these variables were observed independently--a subreddit has to have at least one link post to show up in the stats_link data and likewise for text posts and stats_text. It seems like it should be okay to compare the values because they are independent variables but maybe it's not okay to combine values like that? I'm not sure. Statisticians, please let me know.
```{r}
summary(stats_link)
summary(stats_text)
```
You can see the count of subreddits which have a low number of submissions like this:
```{r}
data.table(sort(table(stats_text$count)))[N > 32000]
data.table(sort(table(stats_link$count)))[N > 96000]
```
Though that may only be because some subreddits have only "link" posts enabled or only "text" posts enabled. We will need to create a third dataset to evaluate that:
```{r}
setkey(stats_link, "subreddit")
setkey(stats_text, "subreddit")
stats_both = stats_link[stats_text, nomatch=FALSE]
data.table(sort(table(stats_both$COUNT + stats_both$i.COUNT)))[N > 32000]
```
Maybe it explains some of the data but there are still many subreddits which have very few posts.
```{r}
full_outer_join = merge(stats_link, stats_text, all=TRUE)
stats_antijoin = full_outer_join[is.na(COUNT.x) | is.na(COUNT.y)]
rm(full_outer_join)
gc()
stats_antijoin
```
So far, we've figured out that there are 1,838,991 subreddits which exist only in one of the 'link' or 'text' datasets and there are 1,786,105 subreddits which exist in both datasets. This is not duplicate data per se, it just means that there are 1,838,991 subreddits which either: a) ONLY have 'link' posts or b) ONLY have 'text' posts; and there are 1,786,105 subreddits which have both 'link' and 'text' posts.
Hypothetically, there are a large number of subreddits which people:
1. Accidentally created
2. Created with many plans to promote and encourage activity but for whatever reason they lost momentum or the market didn't exist yet
3. Some combination of 1 and 2
Because of this, I feel, for this analysis, I have enough reason to ignore subreddits which have fewer than 20 total posts and fewer than 10 distinct posting users:
```{r}
stats_text = stats_text[COUNT >= 600 & count_authors >= 15]
stats_link = stats_link[COUNT >= 600 & count_authors >= 15]
```
As well as removing subreddits where there wasn't a lot of upvoting or commenting going on. I'm not sure if there are subreddits where comments are disabled. I imagine that would defeat the point of reddit? But I am assuming that feature does not exist:
```{r}
stats_text = stats_text[max_score >= 60 & max_comments >= 15]
stats_link = stats_link[max_score >= 60 & max_comments >= 15]
```
For the simplicity of this analysis I will also focus on interpreting the results of 'link' and 'text' datasets separately. Rather than try to combine the analysis by using the 'antijoin' or 'both' data tables that we created in a previous step.
```{r}
rm(stats_antijoin, stats_both)
```
Now we have a much smaller dataset to work with. Let's transform the data types into something meaningful:
```{r}
require(lubridate)
stats_link$oldest = as_datetime(stats_link$oldest)
stats_link$avg_age = as_datetime(stats_link$avg_age)
stats_link$latest = as_datetime(stats_link$latest)
stats_text$oldest = as_datetime(stats_text$oldest)
stats_text$avg_age = as_datetime(stats_text$avg_age)
stats_text$latest = as_datetime(stats_text$latest)
stats_link$avg_time_to_edit = dseconds(stats_link$avg_time_to_edit)
stats_text$avg_time_to_edit = dseconds(stats_text$avg_time_to_edit)
```
Unfortunately, it seems like `avg_time_to_edit` is still wrong even after I re-ran the 8 hour long SQL query to re-compute that. I'm not sure what's going on there.
Also, I'm not sure what `time_modified` (of reddit_links.parquet--used to calculate `avg_time_to_edit`) means for link submissions. It's my understanding that only text posts can be edited, but maybe Reddit mods can rename titles of link posts??
```{r}
summary(stats_text$avg_time_to_edit)
summary(stats_link$avg_time_to_edit)
```
```{sql eval=FALSE, include=FALSE}
@sqlite> select time_modified, time_modified - NULLIF(time_created, 0) from (select * from reddit_posts where ROWID in (select rowid from reddit_posts limit 100 offset 100000)) where time_modified > 0;
QUERY PLAN
|--SEARCH reddit_posts USING INTEGER PRIMARY KEY (rowid=?)
`--LIST SUBQUERY 1
`--SCAN reddit_posts
┌───────────────┬─────────────────────────────────────────┐
│ time_modified │ time_modified - NULLIF(time_created, 0) │
├───────────────┼─────────────────────────────────────────┤
│ 1261173996 │ 3600 │
│ 1261174153 │ 3600 │
│ 1261174163 │ 3600 │
│ 1261174184 │ 3600 │
│ 1261174392 │ 3600 │
│ 1261174497 │ 3600 │
│ 1261174650 │ 3600 │
│ 1261175381 │ 3600 │
│ 1261175580 │ 3600 │
│ 1261175747 │ 3600 │
│ 1261175936 │ 3600 │
│ 1261176163 │ 3600 │
│ 1261176268 │ 3600 │
│ 1261178286 │ 3600 │
│ 1261178640 │ 3600 │
│ 1261178731 │ 3600 │
│ 1261179736 │ 3600 │
│ 1261179887 │ 3600 │
│ 1261180343 │ 3600 │
│ 1261181721 │ 3600 │
│ 1261182220 │ 3600 │
│ 1261182299 │ 3600 │
│ 1261182347 │ 3600 │
└───────────────┴─────────────────────────────────────────┘
Run Time: real 0.130 user 0.002494 sys 0.059661
changes: 0 total_changes: 0
@sqlite> select time_modified, time_modified - NULLIF(time_created, 0) from (select * from reddit_posts where ROWID in (select rowid from reddit_posts limit 100 offset 1000000)) where time_modified > 0;
QUERY PLAN
|--SEARCH reddit_posts USING INTEGER PRIMARY KEY (rowid=?)
`--LIST SUBQUERY 1
`--SCAN reddit_posts
Progress 5356
Progress 5357
Progress 5358
Progress 5359
┌───────────────┬─────────────────────────────────────────┐
│ time_modified │ time_modified - NULLIF(time_created, 0) │
├───────────────┼─────────────────────────────────────────┤
│ 1 │ -1296786185 │
│ 1 │ -1296786154 │
│ 1 │ -1296786100 │
│ 1 │ -1296786064 │
│ 1 │ -1296786033 │
│ 1 │ -1296785911 │
│ 1 │ -1296785878 │
│ 1 │ -1296785728 │
│ 1 │ -1296785621 │
│ 1 │ -1296785595 │
│ 1 │ -1296785523 │
└───────────────┴─────────────────────────────────────────┘
Run Time: real 0.604 user 0.025196 sys 0.402265
changes: 0 total_changes: 0
```
Yeah, there's _definitely_ something weird going on and I'm too lazy to figure it out. Still, `avg_time_to_edit` seems like an interesting metric! I pass the baton to you.
```{r}
stats_link[, avg_time_to_edit := NULL]
stats_text[, avg_time_to_edit := NULL]
```
Now for some EDA (exploratory data analysis). Even though we have relatively small data I'm going to sample 10,000 random rows from the data to keep things snappy. I always like to develop hypotheses with an extract of the data and validate that hypothesis again with another extract or with the full data.
```{r}
SL = stats_link[sample(.N, 10000), ]
ST = stats_text[sample(.N, 10000), ]
require(DataExplorer)
plot_qq(SL)
plot_qq(ST)
```
```{r}
require(ggplot2)
```
Now we create some plots. The first thing that I found which was interesting is that it seems like the average link-post title is trending down very slightly over a decade.
```{r}
ggplot(SL) +
aes(x = oldest, y = avg_length_title) +
geom_line(size = 0.05, colour = "#0C4C8A") +
labs(x = "Subreddit's first link post",
y = "Avg. Title Length") +
theme_minimal()
```
Subreddits die every day (or they switch to text-post only):
```{r}
ggplot(SL) +
aes(x = latest, y = avg_length_title) +
geom_line(size = 0.05, colour = "#0C4C8A") +
labs(x = "Subreddit's most recent link post",
y = "Avg. Title Length") +
theme_minimal()
```
Subreddits with highest authorship have been around for a while
```{r}
ggplot(SL) +
aes(x = oldest, y = count_authors) +
geom_line(size = 0.3, colour = "#0C4C8A") +
labs(x = "Subreddit's first link post", y = "Count of authors") +
theme_minimal()
```
It also seems that most new subreddits are likely to allow NSFW link-posts.
```{r}
ggplot(SL) +
aes(x = avg_age, y = is_over_18_ratio) +
geom_line(size = 0.05, colour = "#112446") +
labs(x = "Average Subreddit Link Posts Age",
y = "Over 18 ratio") +
theme_minimal()
```
I'm bad at stats and graphs. I think it might be more interesting if I try to sort both and create a comprehensive list somehow, then find some interesting subreddits to note so more people are aware of them; but I don't have time to go through 100,000 subreddits (unless some rich patron wants to donate a small fortune).
It's obvious that some subreddits are saturated with new posts. AskReddit, for example, has three times the number of link posts as the second highest, a subreddit called "TalkativePeople" which I've never heard of ?!. I feel that I am somewhat like a subreddit connoisseur but I'm surprised by just how many subreddits there actually are.
I think number of posts should be part of the sorting. Also, sorting by most recent post will help weed out the recently banned or abandoned subreddits.
### Out of millions of subreddits, these are the top 50,000 link post subreddits:
```{r}
S = stats_link[order(-cut(count_authors, 300), cut(avg_age, 2), -cut(latest, 2), -cut(max_comments, 30), -cut(COUNT, 200), -cut(avg_comments, 30))]
fwrite(S, file = 'top_link_subreddits.csv')
S
```
### Top 20,000 text post subreddits
```{r}
S = stats_text[order(-cut(count_authors, 300), cut(avg_age, 2), -cut(latest, 2), -cut(avg_comments, 30), -cut(COUNT, 50), -cut(max_comments, 30))]
fwrite(S, file = 'top_text_subreddits.csv')
S
```
### Subreddits with long link post titles
```{r}
S = stats_link[order(-cut(avg_length_title, 300), -cut(avg_length, 30), -cut(COUNT, 20), -cut(avg_comments, 30), -cut(latest, 2), cut(avg_age, 2))]
fwrite(S, file = 'long_link_subreddits.csv')
S
```
### Subreddits with long text posts:
```{r}
S = stats_text[order(-cut(avg_length, 6), -cut(COUNT, 20), -cut(avg_comments, 30), -cut(latest, 2), cut(avg_age, 2))]
fwrite(S, file = 'long_text_subreddits.csv')
S
```
### Top NSFW link subreddits:
```{r}
S = stats_link[order(-cut(latest, 4), -cut(is_over_18_ratio, 2), -cut(max_score, 12), -cut(count_authors, 15), cut(avg_score, 20), -cut(COUNT, 20), -cut(max_comments, 100) )]
fwrite(S, file = 'nsfw_link_subreddits.csv')
S
```
### Top NSFW text subreddits:
```{r}
S = stats_text[order(-cut(latest, 4), -cut(is_over_18_ratio, 2), -cut(max_score, 12), -cut(count_authors, 15), cut(avg_score, 20), -cut(COUNT, 20), -cut(max_comments, 100) )]
fwrite(S, file = 'nsfw_text_subreddits.csv')
S
```
## Downloads
You can download the sorted CSVs here: [https://github.com/chapmanjacobd/reddit_mining/tree/main#downloads](https://github.com/chapmanjacobd/reddit_mining/tree/main#downloads)
```{r include=FALSE}
# These subreddits are similar in average post age so they might be subreddits primarily used by bots or it may just be a statistical happening (a math party).
stats_link[avg_age %in% data.table(sort(table(stats_link$avg_age)))[N>5]$V1]
stats_link[oldest %in% data.table(sort(table(stats_link$oldest)))[N>5]$V1]
# Subreddits where people, unusually more than average, edit their text post long after they created it
stats_text$avg_time_to_edit
```