Releases: iterative/aita_dataset
Praw rescrape of entire dataset
In response to a discovery that pushshift.io returned unrepresentative scores on posts created during several months in 2018-19, have rescraped the entire dataset using praw to get the scores. This led to a ~30K new data points with scores >= 3 discovered!
For more see issue #1
Patch for bald spots in data
The pushshift.io API, which was used to get scores for posts, turned out to report several months with implausibly few posts scoring >= 3 karma. Specifically, those months appeared to be October 2018-January 2019 and December 2019. It is unclear why pushshift behaves this way for periods of time.
As a patch, praw has been used to rescrape those periods.
As a longer term fix, I am currently rescraping the dataset using exclusively praw to report scores.
January 1-31, 2020 scrape
Added scrape from January 1-31, 2020
v.20.0
Scrape from the genesis of r/AITA to January 1, 2020