Digital Humanities sentiment analysis and effective prosody research data samples and data transformation code
I made a big mistake by lowercasing the URLs in the reddit_links dataset. I recommend you use https://the-eye.eu/redarcs/ instead.
There are over two million subreddits but I've curated a list of the top ~60,000 or so.
The most interesting files are likely going to be top_link_subreddits.csv and top_text_subreddits.csv.
The files starting with long_* and nsfw_* contain the same data -- they are just sorted differently. Check insights.md for more details.
I thought I knew most subreddits but there were a few popular ones that I discovered while writing this:
- /r/lastimages
- /r/invasivespecies
- /r/MomForAMinute
- /r/CrazyDictatorIdeas
- /r/drydockporn
- /r/ancientpics
- /r/coaxedintoasnafu/
- /r/actualconspiracies
- /r/3FrameMovies
- /r/thisisntwhoweare
- /r/CorporateMisconduct
- /r/NuclearRevenge
- /r/redditserials
- /r/HobbyDrama
The data aggregates loaded here were created by converting pushshift RS*.zst data into SQLITE format using the pushshift subcommand of the xklb python package:
wget -e robots=off -r -k -A zst https://files.pushshift.io/reddit/submissions/
pip install xklb
for f in psaw/files.pushshift.io/reddit/submissions/*
echo "unzstd --memory=2048MB --stdout $f | library pushshift (basename $f).db"
end | parallel -j4
library merge submissions.db psaw/RS*.db
This takes several days per step (and several terabytes of free space) but the end result is a 600 GB SQLITE file. You can save some disk space by downloading the parquet files below.
I split up submissions.db into two parquet files via sqlite2parquet.
Query the Parquet files using octosql
. Depending on the query, octosql
is usually faster than SQLITE and parquet compresses very well. You may download those parquet files here:
- reddit_links.parquet [87.7G]
- reddit_posts.parquet [~134G]
Additionally, for simple analysis you can get by with downloading the sub-100MB pre-aggregated files in this repo. For the sake of speed, the ideal of having clearly defined experimental variables, I have bifurcated the aggregations based on the type of post into two types of files:
- 'link' for traditional reddit posts.
- 'text' posts (aka selftext; which were introduced in 2008).
user_stats_link.csv.zstd was 150MB so I split it up into three files like this:
split -d -C 250MB user_stats_link.csv user_stats_link_
zstd -19 user_stats_link_*
You can combine it back to one file like this:
zstdcat user_stats_link_*.zstd > user_stats_link.csv