Skip to content

shaolinshadowhacking/reddit_mining

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

reddit_mining

Digital Humanities sentiment analysis and effective prosody research data samples and data transformation code

I made a big mistake by lowercasing the URLs in the reddit_links dataset. I recommend you use https://the-eye.eu/redarcs/ instead.

List of Subreddits

There are over two million subreddits but I've curated a list of the top ~60,000 or so.

Downloads

The most interesting files are likely going to be top_link_subreddits.csv and top_text_subreddits.csv.

The files starting with long_* and nsfw_* contain the same data -- they are just sorted differently. Check insights.md for more details.

I thought I knew most subreddits but there were a few popular ones that I discovered while writing this:

  • /r/lastimages
  • /r/invasivespecies
  • /r/MomForAMinute
  • /r/CrazyDictatorIdeas
  • /r/drydockporn
  • /r/ancientpics
  • /r/coaxedintoasnafu/
  • /r/actualconspiracies
  • /r/3FrameMovies
  • /r/thisisntwhoweare
  • /r/CorporateMisconduct
  • /r/NuclearRevenge
  • /r/redditserials
  • /r/HobbyDrama

How was this made?

The data aggregates loaded here were created by converting pushshift RS*.zst data into SQLITE format using the pushshift subcommand of the xklb python package:

wget -e robots=off -r -k -A zst https://files.pushshift.io/reddit/submissions/

pip install xklb

for f in psaw/files.pushshift.io/reddit/submissions/*
    echo "unzstd --memory=2048MB --stdout $f | library pushshift (basename $f).db"
end | parallel -j4

library merge submissions.db psaw/RS*.db

This takes several days per step (and several terabytes of free space) but the end result is a 600 GB SQLITE file. You can save some disk space by downloading the parquet files below.

I split up submissions.db into two parquet files via sqlite2parquet.

Query the Parquet files using octosql. Depending on the query, octosql is usually faster than SQLITE and parquet compresses very well. You may download those parquet files here:

  1. reddit_links.parquet [87.7G]
  2. reddit_posts.parquet [~134G]

Additionally, for simple analysis you can get by with downloading the sub-100MB pre-aggregated files in this repo. For the sake of speed, the ideal of having clearly defined experimental variables, I have bifurcated the aggregations based on the type of post into two types of files:

  1. 'link' for traditional reddit posts.
  2. 'text' posts (aka selftext; which were introduced in 2008).

Misc

user_stats_link.csv

user_stats_link.csv.zstd was 150MB so I split it up into three files like this:

split -d -C 250MB user_stats_link.csv user_stats_link_
zstd -19 user_stats_link_*

You can combine it back to one file like this:

zstdcat user_stats_link_*.zstd > user_stats_link.csv

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published