Skip to content
This repository has been archived by the owner on Feb 23, 2023. It is now read-only.

Releases: iterative/aita_dataset

Praw rescrape of entire dataset

20 Feb 22:53
Compare
Choose a tag to compare

In response to a discovery that pushshift.io returned unrepresentative scores on posts created during several months in 2018-19, have rescraped the entire dataset using praw to get the scores. This led to a ~30K new data points with scores >= 3 discovered!

For more see issue #1

Patch for bald spots in data

20 Feb 00:54
Compare
Choose a tag to compare

The pushshift.io API, which was used to get scores for posts, turned out to report several months with implausibly few posts scoring >= 3 karma. Specifically, those months appeared to be October 2018-January 2019 and December 2019. It is unclear why pushshift behaves this way for periods of time.

As a patch, praw has been used to rescrape those periods.

As a longer term fix, I am currently rescraping the dataset using exclusively praw to report scores.

January 1-31, 2020 scrape

13 Feb 20:53
Compare
Choose a tag to compare

Added scrape from January 1-31, 2020

v.20.0

13 Feb 20:54
Compare
Choose a tag to compare

Scrape from the genesis of r/AITA to January 1, 2020