All notable changes to the SOTorrent dataset project will be documented in this file.
- Extract language information from Stack Snippets, other tags (https://meta.stackexchange.com/questions/184108/what-is-syntax-highlighting-and-how-does-it-work/184109#184109, https://meta.stackexchange.com/questions/353983/goodbye-prettify-hello-highlight-js-swapping-out-our-syntax-highlighter), or highlight.js (see previous link). Integrate Stack Snippet table into post block versions or link individual Stack Snippets to their predecessors. See also https://stackoverflow.com/help/formatting
- Update database schema on website
- Add historical user reputation
- Automate import of tables
PostTags
(in particular the tag processing using BigQuery or update the Java application to import tags first into a HashMap and replace names by IDs there) andPostViews
- Revise table
PostBlockDiff
- Explore feasibility of providing SQLite instead of MySQL dumps
- Data type of some columns integer intead of boolean (e.g.
PostBlockVersion.MostRecentVersion
), alsoMostRecentVersion
seems to contain a bug (see, e.g., post 26853961) - Split scripts and data into two Zenodo archives
- Update to Stack Overflow data dump 2020-12-08
- Update to Stack Overflow data dump 2020-09-08
- Update to Stack Overflow data dump 2020-06-02
- Update escaping of newline characters (related to this issue )
- Now using MySQL dumps, newline characters are not espaced anymore in the BigQuery version of the dataset
- This also fixes a bug in the export script (for tables
PostVersionUrl
andCommentUrl
, columnLinkAnchor
was identical to columnFullMatch
) - Fix bug in creation of table
Threads
(now using correct dataset version)
- Update to Stack Overflow data dump 2020-03-02
- Update GitHub references to 2020-03-13 (according to BigQuery table info, retrieved 2020-03-15)
- Add table PostTags
- Fix bug in extraction of references from GitHub files causing links to posts published after September 2019 to be missing (affected tables are
PostReferenceGH
andGHMatches
) - Add new table
GHCommits
with links to Stack Overflow posts or comments found in GitHub commits (using BigQuery GitHub dataset)
- Update to Stack Overflow data dump 2019-12-02
- Add non-generated comments (
PostHistory.Comment
) to tablePostVersion
- Add view count history (new table
PostViews
)
- Update to Stack Overflow data dump 2019-09-04
- Automate execution of SQL scripts
- Add column
MostRecentVersion
to tableTitleVersion
- Add table
StackSnippetVersion
- Helper table
Threads
is now officially part of SOTorrent
- Update to Stack Overflow data dump 2019-06-03
- Improve matching of very short post blocks (containing only one token)
- Add table
VoteType
(see this issue on GitHub) - Automate execution of BigQuery scripts
- Update to Stack Overflow data dump 2019-03-04
- Update GitHub references to 2019-03-29 (according to BigQuery table info)
- Improve detection of HTML code blocks
- Improve detection of comment links (links containing a query string, such as
https://stackoverflow.com/questions/28705447/is-there-a-java-method-that-fills-a-list-by-calling-a-function-many-times/28705651?noredirect=1#comment45733057_28705651
, are now correctly handled) - Update regular expression used to extract Stack Overflow links from GitHub files, correctly handle multiple Stack Overflow links per source code line (previously only the first match in each line was extracted)
- Table
PostReferenceGH
now only contains links pointing to a validPostId
orCommentId
, remove columnPostTypeId
(which was derived from links and was thus sometimes wrong) and previously introduced id 99 for comments - New column
GHMatches.PostIds
that contains a space-separated list of post ids found in the matched line - Add new columns
PostVersion.MostRecentVersion
andPostBlockVersion.MostRecentVersion
that make it easier to analyze only the most recent version of a post/post block - Update to MySQL 8.0
- Switch to 7z for data compression
- Update to Stack Overflow data dump 2018-12-02
- Changes to table
PostReferenceGH
:- Improve Stack Overflow URL extraction from source code files in BigQuery GitHub dataset
- Stack Overflow links are now normalized to "https" instead of the "http" links
- Comment links are now distinguished from question links:
- Add new post type "Comment" with post type id
99
- Add new column
CommentId
(null
for question and answer links) SOUrl
now points directly to comments, not to corresponding questions
- Add new post type "Comment" with post type id
- Split column
RepoName
intoRepoOwner
andRepoName
, keep complete repo name as new columnRepo
- Retrieved references on 2018-12-09
- New table
GHMatches
with matched source code lines containing a link to Stack Overflow questions, answers, or comments - Improve post block predecessor matching
- Add remark to use Archive Utility on macOS to extract the dataset (see README file)
- Update to Stack Overflow data dump 2018-09-05
- Update
PostReferenceGH
(retrieved on 2018-09-23)
- Improve URL extraction (e.g., exclude matches in Markdown inline code, exclude invalid links)
- Add new columns
FragmentIdentifier
andQuery
to tablesPostVersionUrl
andCommentUrl
- Add new column
LinkType
to tablesPostVersionUrl
andCommentUrl
(e.g., inline Markdown link, bare link, etc.) - Add new column
LinkPosition
to tablesPostVersionUrl
andCommentUrl
(beginning, middle, end of post/comment, or "link only" if a comment/post consists only of a URL) - Add new column
FullMatch
to tablesPostVersionUrl
andCommentUrl
- Update to Stack Overflow data dump 2018-06-05
- Case-insensitive extraction of URL components
- Add new columns
Protocol
,CompleteDomain
, andRootDomain
to tablePostVersionUrl
- Add new columns
LocalId
,PredLocalId
, andPredPostHistoryId
to tablePostBlockDiff
(enables retrieval of diffs according to position in post without requiring a join) - Add new columns
PredLocalId
,PredPostHistoryId
,RootLocalId
, andRootPostHistoryId
to tablePostBlockVersion
(easier detection of position changes and easier retrieval of post block lifespans) - Rename column
RootPostBlockId
of tablePostBlockVersion
toRootPostBlockVersionId
and columnPredPostBlockId
toPredPostBlockVersionId
(reason: consistent naming) - Remove column
PostVersionId
from tablePostBlockVersion
(reason: the stablePostHistoryId
should be used instead) - Add new table
CommentUrl
- Add new table
TitleVersion
- Update to Stack Overflow data dump 2018-03-13
Comments.UserDisplayName
:VARCHAR(30)
→VARCHAR(40)
(unify the type of all display name columns)- Create indices for all user display name columns
- Add table
PostHistoryType
(see columnRevision
here) and add columnPostHistoryTypeId
to tablePostVersion
- Add auto-generated primary key
Id
to tablePostReferenceGH
- All tables from the offical Stack Overflow dump are now available in the BigQuery version of the dataset
- Schema files for importing SOTorrent into Google BigQuery (db-scripts)
- Improve filename regex (db-scripts)
- Prevent matching of directory names starting with "." in table
PostReferenceGH
(for example.history/17/10db4490e45300171a8a828d7b324fa2
)
- Prevent matching of directory names starting with "." in table
- Order post versions according to
CreationDate
instead ofPostHistoryId
(so-posthistory-extractor and db-scripts)- In the SOTorrent 2018-01-18 dataset, 283 posts created in 2008/2009 were not ordered chronologically (see "broken_entries" in "analysis_postversion_edit_timespan.R").
- Thus, we now order post versions according to their
CreationDate
(instead of using thePostHistoryId
). - Updated database schema and class
PostVersion to include new member variable
CreationDate`.
- Fixed import and export scripts (db-scripts)
- Replaced newline character in GitHub path, which was present in two rows of table PostReferenceGH (db-scripts)
UserId
/OwnerUserId
isnull
in some cases. Then, theUserDisplayName
has to be employed to identify users. This applies for tablesComments
,PostHistory
,Posts
. Idea: Find the corresponding Ids usingUserDisplayName
and tableUsers
, replace thenull
values, and add foreign key constraints, which is currently not possible. UPDATE 2018-03-13: 533,378 of 5,765,510UserDisplayNames
are not unique, thus the approach described above does not work.
The format of this file is based on Keep a Changelog.