ragger

Identifies highly upvoted removed comments and posts on reddit by aggregating historical data provided by files.pushshift.io/reddit. Results are displayed on subreddit top pages: Reveddit.com/r/<subreddit>/top

Requirements

To process a full month's worth of comment data you need,

2TB HD: 1 TB of disk space to download the data and another 400 GB for intermediate processing files
40GB RAM: For the 2-aggregate-monthly.py step. Splitting monthly files into smaller parts may use less memory.

Without this, you can run the code on the included test set in under a minute.

Environment

Create a conda virtual environment and activate it,

conda create --name reveddit --file requirements-conda.txt
conda activate reveddit

Optionally, install PostgreSQL and include credentials in a dbconfig.ini as shown in dbconfig-example.ini

Test

To process the test dataset included in this repo,

./processData.sh all test

Results appear in test/3-aggregate_all and test/4-add_fields.

To load results into a database, prepare database credentials in dbconfig-example.ini and run either,

./test.sh runs the above command and load results into a local PostgreSQL database, or
./test.sh normal loads full results into the database if files have been downloaded (see below)

Download

To download the subset of Pushshift comment and submission dumps used by this project, run

./downloadPushshiftDumps.sh

The results will be in data/0-pushshift_raw/. That script's comments mention why only a subset of data is used.

Then run ./groupDaily.sh. This creates monthly files from daily files and moves the daily files to another directory.

Other Pushshift download scripts:

Usage

To process full results,

Download pushshift monthly dumps
Store them in data/0-pushshift_raw/ as specified in config.ini
Run ./processData.sh all normal

With a remote database

I used a DO droplet. These are the rough steps,

Set up ssh keys
Install Postgres with docker
Create a database login and password for your script
Add the top 4 lines of droplet-config/pg_hba.conf.head to /var/lib/docker/volumes/hasura_db_data/_data/pg_hba.conf
sudo docker-compose up -d
git clone this repo
Put the database login and password into a file called dbconfig.ini in the root directory of this repo

Then, locally,

In prod.sh change ssh.rviewit.com to the domain name of the droplet
Run prod.sh
Check the local and remote logs to know when it's done

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
droplet-config		droplet-config
test		test
.gitignore		.gitignore
1-pushshift-slim.py		1-pushshift-slim.py
2-aggregate-monthly.py		2-aggregate-monthly.py
3-aggregate-all.py		3-aggregate-all.py
4-add-fields.py		4-add-fields.py
5-load-db.py		5-load-db.py
6-create-db-functions.py		6-create-db-functions.py
AddFields.py		AddFields.py
ConfigTyped.py		ConfigTyped.py
LICENSE		LICENSE
README.md		README.md
check_comment_removed_counts.sh		check_comment_removed_counts.sh
check_post_removed_counts.sh		check_post_removed_counts.sh
config.ini		config.ini
copyToRemote.sh		copyToRemote.sh
createDirectories.sh		createDirectories.sh
dailyFileDownloader.sh		dailyFileDownloader.sh
dbconfig-example.ini		dbconfig-example.ini
downloadPushshiftDumps.sh		downloadPushshiftDumps.sh
exceptions.py		exceptions.py
files_log.py		files_log.py
getConfigVar.py		getConfigVar.py
getRemoteFileSizes.sh		getRemoteFileSizes.sh
groupDaily.sh		groupDaily.sh
inaccessible_ids_comments.txt		inaccessible_ids_comments.txt
inaccessible_ids_posts.txt		inaccessible_ids_posts.txt
keepLog.sh		keepLog.sh
loadDB.sh		loadDB.sh
logger.py		logger.py
monthlyAgg_memory.ipynb		monthlyAgg_memory.ipynb
monthlyFileDownloader.sh		monthlyFileDownloader.sh
processData.sh		processData.sh
prod.sh		prod.sh
pushshift_file_reader_writer.py		pushshift_file_reader_writer.py
remote_file_sizes.txt		remote_file_sizes.txt
requirements-conda.txt		requirements-conda.txt
revddit_aggregator.py		revddit_aggregator.py
skippable_files.txt		skippable_files.txt
test.sh		test.sh
unprocessable_files_log.txt		unprocessable_files_log.txt
verifyFileSize.sh		verifyFileSize.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ragger

Requirements

Environment

Test

Download

Usage

With a remote database

About

Languages

License

reveddit/ragger

Folders and files

Latest commit

History

Repository files navigation

ragger

Requirements

Environment

Test

Download

Usage

With a remote database

About

Topics

Resources

License

Stars

Watchers

Forks

Languages