-
Notifications
You must be signed in to change notification settings - Fork 1
Data
This page elaborates on what data we use and how do we process it.
Stack Overflow is a question-and-answer website for programming-related questions. Each question has one or more tags associated with it. At the time of writing there are approximately 56.000 tags.
We consider two tags related if they are assigned to the same question. The more questions are shared between two particular tags, the more they are related. What we do is:
- obtain information about questions and tags from Stack Overflow;
- leave only relevant information for our problem (number of questions shared by any two tags);
- convert this information so that t-SNE can be applied to it.
Stack Overflow database is fairly accessible for data scientists. There is a .torrent dump of Stack Exchange data (i.e. all sites under Stack Exchange umbrella, including Stack Overflow), which is updated approximately bi-yearly. It is a snapshot of some particular database state. We could use it for our analysis.
But hopefully, there is a github repository called StackLite, that provides minified version of the dump. That version contains mostly question metadata, which is sufficient for our needs (because it includes relationships between tags and questions).
We download two files from StackLite (questions.csv.gz
and question_tags.csv.gz
) and store them in raw
folder.
First, we convert StackLite data and get 3 files: posts.csv
(contains question id, creation date), post_tag.csv
(contains pairs of <question id, tag id>) and tags.csv
(contains tag id, tag name).
Then, we build an adjacency matrix adj_matrix.txt
from this data. Each row represents a tag and contains information about all tags that share at least one question with our tag. It has the following format:
vertex_from: vertex_to1,w1 vertex_to2,w2 ... vertex_tok,wk
Each row starts with a vertex number, that this row is describing. Then zero or more space-separated strings come,
each has a destination vertex, followed by a comma, followed by w(source, dest)
(the number of questions shared by tags with ids source
and dest
). Each destination vertex number
must be greater than the source number: we store only higher-than-main-diagonal part of the matrix.
Example:
1: 3517,1
2:
3: 41,1 44,1 54836,1
4: 37043,1 44404,1
6:
7: 33,6 95,1 106,2 228,1 1098,1 1547,4 1906,1 3196,5
In order to use t-SNE (more precisely, the bhtsne implementation), we convert our adjacency matrix into a nearest neighbour matrix. For every tag, we perform a Dijkstra's algorithm and find K=155 nearest neighbours (maybe indirect), as well as distance to that neighbour. Tags are renumbered to 0..number_of_tags - 1
before finding nearest neighbours, so we also store a file called adj_id_to_nn_id.txt
, which provides mapping from tag id in the adjacency matrix to tag id in the nearest neighbour matrix. This information is used for obtaining additional information about every tag (precisely, for matching tags with pieces of information) later in the pipeline.
Example rows:
5: 29,13.6357 4,14.9908 1,16.5168 6,17.3267 265,17.3455 269,20.5443
6: 29,9.00355 4,10.2083 265,11.3719 6282,11.4855 1013,12.8675
9: 8,10.2039 1,12.0656 283,12.1757 74,12.3212 29,12.7001
We put output from our modified bhtsne
version in processed
directory as raw_tsne_output.txt
. It shows regular output from t-SNE (used input parameters, iteration timing, etc.) as well as tab-separated triples of the form:
######START TSV
x y
-4.40369 16.8293
-6.66582 -11.9767
-4.40616 16.8284
-4.40498 16.8306
-5.12879 -9.24884
-4.40527 16.8236
-4.21825 -10.4839
-4.39632 16.8212
...
6.40564 2.99697
-4.1544 16.883
-4.1544 16.883
######END TSV
We also store some additional information in id_to_additional_info.csv
, e.g.:
Id,name,PostCount
0,.bash-profile,7
1,cygwin,43
2,visual-studio-2015,303
3,windows,586
4,.htaccess,341
5,amazon-web-services,602
This is sufficient to start generating a map.
We provide an example dataset, which is a valid subset of all Stack Overflow tags. It was manually gathered from Data Explorer using this query.
Those are 376 tags that share at least one question with the logarithm
tag. We store each element of the adjacency matrix in logarithm.csv
, as well as mapping from tag ids to names in data_stackexchange_tags.csv
.
The process of visualizing example dataset is very similar to the described above.