A collection of scripts to use the data dump from Stack Overflow's dearly departed Documentation feature.
Pour one out for Stack Overflow Documentation and then grab the data dump. With a JSON parser in hand, you can use that content wherever your dreams take you. Just be sure to provide proper attribution.
- Install Ruby.
- Execute
gem install bundler
to install Bundler. - In the repository's directory, run
bundle install
to install all the required gems for the scripts. - Clone or download this very repository on your machine.
- In order to test the scripts and download the Documentation
archive, run
bundle exec rake
. - If the test all succeed, you are all set to run the scripts from the
examples
directory.
so_docs.rb
—Library for loading and manipulating the JSON Documentation archive.wayback-api.rb
—Library to save and verify URLs on the Wayback Machine. (Probably should be a separate project as it has no particular connection to the Documentation project other than I want to save pages there.)
get-archive.rb
—Downloads the archive and extracts it's contents. You only need to do this once.example2html.rb
—Extract the HTML representation of an example. To see what this looks like, I made a copy of Creating and Initializing Arrays in Java.revision2jekyll.rb
—a Ruby script that prints a revision history item Markdown text.attribution2wbm.rb
—Submits example or topic attribution to the Wayback Machine.submit2wbm.rb
—a Ruby script that submits all topics to the Internet Archive Wayback Machine. Demonstrates how to usedoctags.json
andtopics.json
. (I ran it on August 16, 2017 after Documentation was put in readonly mode. There's probably no reason to run it again. Also, it doesn't work for C# asc%23
isn't allowed in their URLs.)- Stay tuned for other exciting scripts!*
I'm working with Ruby, but I'm happy to accept scripts written in other languages as long as I can test them out. I'm also happy to include links to other project using Documentation archive in this README. Feel free to submit pull requests and I'll incorporate them as quickly as I can.
If there's something you'd like to see from the archive and can't figure out how to extract the content, feel free to add an issue or ask on Meta Stack Overflow.
- Tests are fragile. Changing the way these scripts work in even minor ways will break the tests. (Fortunately, the tests are also simple, so changing the expected
md5
hash result usually suffices.) - The test framework also pulls in the entire archive from Archive.org and doesn't clean it up. This might be considered a feature by some.
- Getting user's display names requires a call to the Stack Exchange API, which is subject to rate limiting. The method does not check to see if it's used the daily quota. Nor does it cache results. So it's easy to be throttled if you aren't careful. I've added an application key since exceeding the quota was a leading cause of failure for Travis CI Continuous Integration tests.
- My code more or less reproduces a RDBMS—poorly. It would probably be smarter to load the JSON files into SQLite or something.
Footnote:
* Offer contingent on author's creativity and reader's ability to be excited.