Identifies highly upvoted removed comments and posts on reddit by aggregating historical data provided by files.pushshift.io/reddit
. Results are displayed on subreddit top pages: Reveddit.com/r/<subreddit>/top
To process a full month's worth of comment data you need,
- 2TB HD: 1 TB of disk space to download the data and another 400 GB for intermediate processing files
- 40GB RAM: For the
2-aggregate-monthly.py
step. Splitting monthly files into smaller parts may use less memory.
Without this, you can run the code on the included test set in under a minute.
Create a conda
virtual environment and activate it,
conda create --name reveddit --file requirements-conda.txt
conda activate reveddit
Optionally, install PostgreSQL and include credentials in a dbconfig.ini
as shown in dbconfig-example.ini
To process the test dataset included in this repo,
./processData.sh all test
Results appear in test/3-aggregate_all
and test/4-add_fields
.
To load results into a database, prepare database credentials in dbconfig-example.ini
and run either,
./test.sh
runs the above command and load results into a local PostgreSQL database, or./test.sh normal
loads full results into the database if files have been downloaded (see below)
To download the subset of Pushshift comment and submission dumps used by this project, run
./downloadPushshiftDumps.sh
The results will be in data/0-pushshift_raw/
. That script's comments mention why only a subset of data is used.
Then run ./groupDaily.sh
. This creates monthly files from daily files and moves the daily files to another directory.
Other Pushshift download scripts:
To process full results,
- Download pushshift monthly dumps
- Store them in
data/0-pushshift_raw/
as specified inconfig.ini
- Run
./processData.sh all normal
I used a DO droplet. These are the rough steps,
- Set up ssh keys
- Install Postgres with docker
- Create a database login and password for your script
- Add the top 4 lines of
droplet-config/pg_hba.conf.head
to/var/lib/docker/volumes/hasura_db_data/_data/pg_hba.conf
sudo docker-compose up -d
- git clone this repo
- Put the database login and password into a file called
dbconfig.ini
in the root directory of this repo
Then, locally,
- In
prod.sh
changessh.rviewit.com
to the domain name of the droplet - Run
prod.sh
- Check the local and remote logs to know when it's done