Skip to content
Mikhail Koltsov edited this page Nov 29, 2016 · 17 revisions

This page elaborates on what data we use and how do we process it.

Overview

Stack Overflow is a question-and-answer website for programming-related questions. Each question has one or more tags associated with it. At the time of writing there are approximately 56.000 tags.
We consider two tags related if they are assigned to the same question. The more questions are shared between two particular tags, the more they are related. What we do is:

  • obtain information about questions and tags from Stack Overflow;
  • leave only relevant information for our problem (number of questions shared by any two tags);
  • convert this information so that t-SNE can be applied to it.

Raw data

Stack Overflow database is fairly accessible for data scientists. There is a .torrent dump of Stack Exchange data (i.e. all sites under Stack Exchange umbrella, including Stack Overflow), which is updated approximately bi-yearly. It is a snapshot of some particular database state. We could use it for our analysis.

But hopefully, there is a github repository called StackLite, that provides minified version of the dump. That version contains mostly question metadata, which is sufficient for our needs (because it includes relationships between tags and questions).

We download two files from StackLite (questions.csv.gz and question_tags.csv.gz) and store them in raw folder.

Intermediate data

Relevant information

First, we convert StackLite data and get 3 files: posts.csv (contains question id, creation date), post_tag.csv (contains pairs of <question id, tag id>) and tags.csv (contains tag id, tag name).

Adjacency matrix

Then, we build an adjacency matrix adj_matrix.txt from this data. Each row represents a tag and contains information about all tags that share at least one question with our tag. It has the following format:
vertex_from: vertex_to1,w1 vertex_to2,w2 ... vertex_tok,wk
Each row starts with a vertex number, that this row is describing. Then zero or more space-separated strings come, each has a destination vertex, followed by a comma, followed by w(source, dest) (the number of questions shared by tags with ids source and dest). Each destination vertex number must be greater than the source number: we store only higher-than-main-diagonal part of the matrix.
Example:

1: 3517,1
2:
3: 41,1 44,1 54836,1
4: 37043,1 44404,1
6:
7: 33,6 95,1 106,2 228,1 1098,1 1547,4 1906,1 3196,5

Nearest neighbour matrix

In order to use t-SNE (more precisely, the bhtsne implementation), we convert our adjacency matrix into a nearest neighbour matrix. For every tag, we perform a Dijkstra's algorithm and find K=155 nearest neighbours (maybe indirect), as well as distance to that neighbour. Tags are renumbered to 0..number_of_tags - 1 before finding nearest neighbours, so we also store a file called adj_id_to_nn_id.txt, which provides mapping from tag id in the adjacency matrix to tag id in the nearest neighbour matrix. This information is used for obtaining additional information about every tag (precisely, for matching tags with pieces of information) later in the pipeline.

Example rows:

5: 29,13.6357 4,14.9908 1,16.5168 6,17.3267 265,17.3455 269,20.5443
6: 29,9.00355 4,10.2083 265,11.3719 6282,11.4855 1013,12.8675
9: 8,10.2039 1,12.0656 283,12.1757 74,12.3212 29,12.7001

Processed data

We put output from our modified bhtsne version in processed directory as raw_tsne_output.txt. It shows regular output from t-SNE (used input parameters, iteration timing, etc.) as well as tab-separated triples of the form:

######START TSV
x	y
-4.40369	16.8293
-6.66582	-11.9767
-4.40616	16.8284
-4.40498	16.8306
-5.12879	-9.24884
-4.40527	16.8236
-4.21825	-10.4839
-4.39632	16.8212
...
6.40564	2.99697
-4.1544	16.883
-4.1544	16.883
######END TSV

We also store some additional information in id_to_additional_info.csv, e.g.:

Id,name,PostCount
0,.bash-profile,7
1,cygwin,43
2,visual-studio-2015,303
3,windows,586
4,.htaccess,341
5,amazon-web-services,602

This is sufficient to start generating a map.

Example data

We provide an example dataset, which is a valid subset of all Stack Overflow tags. It was manually gathered from Data Explorer using this query.
Those are 376 tags that share at least one question with the logarithm tag. We store each element of the adjacency matrix in logarithm.csv, as well as mapping from tag ids to names in data_stackexchange_tags.csv.
The process of visualizing example dataset is very similar to the described above.