-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathskillability.rmd
2261 lines (1952 loc) · 129 KB
/
skillability.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Skillability"
author: "Giovanni Azua Garcia - giovanni.azua@outlook.com"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
bookdown::pdf_book:
includes:
in_header: packages.sty
df_print: kable
keep_tex: yes
number_sections: yes
toc: yes
toc_depth: 3
word_document:
toc: yes
toc_depth: '3'
bibliography:
- bibliography.bib
description: HarvardX - PH125.9x Data Science Capstone
documentclass: report
fontsize: 11pt
geometry: a4paper,left=2.5cm,right=2.5cm,top=2.5cm,bottom=2.5cm
github-repo: https://github.com/bravegag/HarvardX-Skillability
link-citations: yes
lof: yes
lot: yes
mainfont: Lato
monofont: Hack
monofontoptions: Scale=0.7
colorlinks: yes
site: bookdown::bookdown_site
subtitle: HarvardX - PH125.9x Data Science Capstone
tags:
- data science
- machine learning
- recommender systems
- stochastic gradient descent
- nlp
- collaborative filtering
- stack-overflow
biblio-style: apalike
---
```{r initialization,echo=FALSE,message=FALSE}
##########################################################################################
## GLOBAL Initialization
##########################################################################################
# clean the environment
rm(list = ls())
##########################################################################################
## Install and load required library dependencies
##########################################################################################
if(!require(tidyverse)) install.packages("tidyverse", repos = "http://cran.us.r-project.org")
if(!require(caret)) install.packages("caret", repos = "http://cran.us.r-project.org")
if(!require(boot)) install.packages("boot", repos = "http://cran.us.r-project.org")
if(!require(purrr)) install.packages("purrr", repos = "http://cran.us.r-project.org")
if(!require(data.table)) install.packages("data.table", repos = "http://cran.us.r-project.org")
if(!require(tictoc)) install.packages("tictoc", repos = "http://cran.us.r-project.org")
if(!require(lubridate)) install.packages("lubridate", repos = "http://cran.us.r-project.org")
if(!require(stringr)) install.packages("stringr", repos = "http://cran.us.r-project.org")
if(!require(doMC)) install.packages("doMC", repos = "http://cran.us.r-project.org")
if(!require(parallel)) install.packages("parallel", repos = "http://cran.us.r-project.org")
if(!require(microbenchmark)) install.packages("microbenchmark", repos = "http://cran.us.r-project.org")
if(!require(ggplot2)) install.packages("ggplot2", repos = "http://cran.us.r-project.org")
if(!require(ggmap)) install.packages("ggmap", repos = "http://cran.us.r-project.org")
if(!require(ggrepel)) install.packages("ggrepel", repos = "http://cran.us.r-project.org")
if(!require(scales)) install.packages("scales", repos = "http://cran.us.r-project.org")
if(!require(RColorBrewer)) install.packages("RColorBrewer", repos = "http://cran.us.r-project.org")
if(!require(Metrics)) install.packages("Metrics", repos = "http://cran.us.r-project.org")
if(!require(kableExtra)) install.packages("kableExtra", repos = "http://cran.us.r-project.org")
if(!require(here)) install.packages("here", repos = "http://cran.us.r-project.org")
if(!require(knitr)) install.packages("knitr", repos = "http://cran.us.r-project.org")
##########################################################################################
## Setup initial global values
##########################################################################################
# register cores for parallel processing
ncores <- detectCores()
registerDoMC(ncores)
##########################################################################################
## knitr settings
##########################################################################################
# Trigger line numbering
knitr::opts_chunk$set(
class.source = "numberLines lineAnchors",
class.output = c("numberLines lineAnchors chunkout")
)
# knitr global settings - By default, the final document will not include source code unless
# expressly stated.
knitr::opts_chunk$set(
# Figure position hold
fig.pos = 'H',
# Chunks
eval = TRUE,
cache = TRUE,
echo = FALSE,
message = FALSE,
warning = FALSE,
# filepaths
fig.path = 'build/figure/graphics-',
cache.path = 'build/cache/graphics-',
# Graphics
out.width = "110%",
fig.align = "center",
# fig.height = 3,
# Text size
size = "small"
)
if (knitr::is_html_output()) {
knitr::opts_chunk$set(dev = "png")
} else {
knitr::opts_chunk$set(dev = "pdf")
}
# Modify the size of the code chunks
# https://stackoverflow.com/questions/25646333/code-chunk-font-size-in-rmarkdown-with-knitr-and-latex
def.chunk.hook <- knitr::knit_hooks$get("chunk")
knitr::knit_hooks$set(chunk = function(x, options) {
x <- def.chunk.hook(x, options)
ifelse(options$size != "normalsize", paste0("\n \\", options$size, "\n\n", x, "\n\n \\normalsize"), x)
})
##########################################################################################
## Define important reusable functions e.g. the portable.set.seed(...)
##########################################################################################
# Portable set.seed function (across R versions) implementation
# @param seed the seed number
portable.set.seed <- function(seed) {
if (R.version$minor < "6") {
set.seed(seed)
} else {
set.seed(seed, sample.kind="Rounding")
}
}
# Returns the file path for the given object name.
#
# @param objectName the name of the object e.g. "Users"
# @param prefixDir the prefix directory where all data is stored e.g. "data"
# @param rdsDir the directory where the RDS files are located e.g. "rds"
# @param ext the extension for the RDS files i.e. ".rds"
# @returns the file path for the given dataset name.
filePathForObjectName <- function(objectName, prefixDir="data",
rdsDir="rds", ext=".rds") {
rdsFolder <- file.path(prefixDir, rdsDir)
if (!dir.exists(rdsFolder)) {
dir.create(rdsFolder, recursive = T)
}
fileName <- paste(objectName, ext, sep="")
filePath <- file.path(rdsFolder, fileName)
return(filePath)
}
# Returns the object (dataset or otherwise) by name, it will either load the dataset from an
# RDS file if it exists or download it from GitHub automatically. If downloaded then the file
# will be created in the expected location so that we won't be downloading it again.
#
# @param objectName the name of the dataset e.g. "Users"
# @param prefixDir the prefix directory where all data is stored e.g. "data"
# @param rdsDir the directory where the RDS files are located e.g. "rds"
# @param ext the extension for the RDS files i.e. ".rds"
# @param baseUrl the base GitHub url where the data is located.
# @param userName the GitHub user name e.g. "bravegag"
# @param repoName the GitHub repository name e.g. "HarvardX-Skillability"
# @param branchName the GitHub branch name e.g. "master"
# @returns the dataset by name.
readObjectByName <- function(objectName, prefixDir="data", rdsDir="rds", ext=".rds",
userName="bravegag", repoName="HarvardX-Skillability", branchName="master",
baseUrl="https://github.com/%s/%s/blob/%s/data/rds/%s?raw=true") {
tryCatch({
filePath <- filePathForObjectName(objectName = objectName, prefixDir = prefixDir,
rdsDir = rdsDir, ext = ext)
fileName <- basename(filePath)
if (!file.exists(filePath)) {
# download the file
url <- sprintf(baseUrl, userName, repoName, branchName, fileName)
cat(sprintf("downloading \"%s\"\n", url))
download.file(url, filePath, extra="L")
} else {
cat(sprintf("object \"%s\" exists, skipping download ...\n", filePath))
}
return(readRDS(filePath))
}, warning = function(w) {
cat(sprintf("WARNING - attempting to access or download the %s data:\n%s\n",
objectName, w))
file.remove(filePath)
return(NULL)
}, error = function(e) {
cat(sprintf("ERROR - attempting to access or download the %s data:\n%s\n",
objectName, e))
file.remove(filePath)
return(NULL)
}, finally = {
# nothing to do here
})
}
# Saves the object (dataset or otherwise) by name, the required folders will be
# created if they don't already exist.
#
# @param object the object e.g. tibble or data frame
# @param objectName the name of the object e.g. "Users"
# @param prefixDir the prefix directory where all data is stored e.g. "data"
# @param rdsDir the directory where the RDS files are located e.g. "rds"
# @param ext the extension for the RDS files i.e. ".rds"
saveObjectByName <- function(object, objectName, prefixDir="data",
rdsDir="rds", ext=".rds") {
filePath <- filePathForObjectName(objectName = objectName, prefixDir = prefixDir,
rdsDir = rdsDir, ext = ext)
saveRDS(object=object, file=filePath)
}
# Pretty prints the given tibble using kable
# @param t the tibble to print
# @param latex_options passed to kable
# @param caption the table caption
# @param booktabs the booktabs passed to kable
prettyPrint <- function(t, latex_options = c("striped", "scale_down"),
caption = NULL, booktabs = T) {
return(kable(t, "latex", caption = caption, booktabs = booktabs) %>%
kable_styling(position = "center", latex_options = latex_options) %>%
row_spec(0, bold = T))
}
```
```{css echo=FALSE}
code {
font-family: Hack, monospace;
font-size: 85%;
padding: 0;
padding-top: 0.2em;
padding-bottom: 0.2em;
background-color: rgba(0,0,0,0.04);
border-radius: 3px;
}
code:before,
code:after {
letter-spacing: -0.2em;
content: "\00a0";
}
```
# Licenses, Terms of Service, Privacy Policy, and Disclaimer {-}
**Data license**: The dataset files downloaded, extracted, transformed, assembled or processed in any form as part of this project are derived from Stack Overflow^[https://archive.org/details/stackexchange]'s and retain their original license: Attribution-Share Alike 4.0 International (CC BY-SA 4.0)^[https://creativecommons.org/licenses/by-sa/4.0/].
\
**Code license**: The code delivered as part of this project is licensed under the GNU Affero General Public License (AGPL v3)^[https://www.gnu.org/licenses/agpl-3.0.en.html].
\
**Terms of Service and Privacy Policy**: In this project we use anonymised user data and the Google Maps API therefore we're also bound by [Google's Terms of Service](https://www.google.com/intl/en/policies/terms) and [Google's Privacy Policy](https://www.google.com/policies/privacy).
\
**DISCLAIMER Third-Party Trademark Notice**: All third-party trademarks referenced in this report (i.e. tags or skills), whether in logo form, name form or product form, or otherwise remain the property of their respective owners, and are used here only to refer the credentials and proficiency of the Stack Overflow users in using the technology, products or supporting solutions. The use of these trademarks in no way indicates any relationship between the author of this report and their respective owners. The description of the capabilities regarding any of the listed trademarks does not imply any relationship, affiliation, sponsorship or endorsement and reference to those shall be considered nominative fair use under the trademark law.
# Introduction {-}
If you ever programmed, faced a technical question and "Googled it", it's needless to say that you have already probably landed in the Stack Overflow^[https://www.stackoverflow.com] site. Stack Overflow is a platform aimed at programmers of all levels for asking and answering technical questions. The platform was created in 2008^[https://en.wikipedia.org/wiki/Stack_Overflow] and it's the most popular site as part of the Stack Exchange Network^[https://stackexchange.com/sites#]. The platform offers multiple features, the most popular one being the ability to upvote and downvote user posts (either questions or answers) contributing to the posting user's overall reputation. Based on reputation, the plaform enables users to reach different priviledges such as the ability to vote, comment and edit posts. Furthermore, users are awarded "badges" (i.e. achievements) that relate to: the overall reputation, answers, questions or even the frequency of use of the site. The author of this work is an avid user of the platform^[https://stackoverflow.com/users/1142881/] since its inception and has periodically used it more towards asking rather than answering technical questions. A result of the present work is to more strongly consider using the site for answering questions rather than just asking.
\
Very luckily for us the Stack Overflow data is available for download^[https://archive.org/details/stackexchange] and in different formats, opening the door for conducting truly interesting data analysis with many applications in the context of the recruitment industry such as: resume (or CV) compilation^[https://www.kickstarter.com/projects/1647975128/one-thousand-words-cv-1kwcv/], job candidate shortlisting, job candidate assessment, staff skills assessment, technology trends analysis, and many more.
## Project goals {-}
We first focus in delivering a fully automated method and R code to download, extract, process and clean the Stack Overflow dataset that's directly applicable to any other Stack Exchange Network site data. The dataset we compiled is **anonymised** i.e only the artificial integer key `userId` is stored. After a basic exploratory data analysis of the full dataset we move to the following data analysis use-cases. The first two objectives are covered as part of the [Data exploration and visualization] section. The last objective is covered in the [Modeling approach] and [Method implementation] sections.
### The What: skills and technology trends {-}
Questions in Stack Overflow contain one or more tags e.g. [java](https://stackoverflow.com/questions/tagged/java]). For professionals who work in programming these tags are skills and would be listed as such in a resume (or CV)^[https://www.dropbox.com/s/6t7mq5zcztarah4/1kwcv_prototype_Giovanni.pdf?dl=1]. Therefore, in this analysis step we'd like to find the top skills by frequency of use, establish a proximity measure and discover skill groups. We're also interested in discovering what major technology trends exist and their importance: in other words, how those skill clusters have evolved over time.
\
To this end, the top tags (or skills) are first selected. The key modeling approach (in NLP terms) is to view questions as "documents" and skills as "words" and count how many times skills occur pair-wise together in the same questions, therefore we generate a skills co-occurrence matrix. We then compute Principal Component Analysis (PCA) on the scaled co-occurrence matrix. The first two principal components reveal what the skill groups explaining most of the variance in the data are or how we prefer to call it, the main technology trends. We visualize the top and bottom ends of these two principal components. The remaining PCA components reveal other skill groups.
\
This first analysis step enpower us to answer many practical questions, for example:
* As a programmer: what are the main technology trends at the moment and which one shall I invest learning on?
* As a company: we'd like to build a new product, which technology stack should we use? review the top components for the most popular stacks in the area required.
* As a resume (or CV) compilation service: suggest candidates with skills they may have overlooked to include in their CV and are among the most important e.g. when listing `java`, suggest also `java-ee`.
Note that applying this method in a rolling time window fashion e.g. every year compute this analysis for a time window of three years ending at the given date; will reveal the industry changes in technology trends over time.
### The Where: putting it in geographical context {-}
Here we search for all Stack Overflow users in Switzerland, find their top technology skills looking into the tags linked to their top answers by score or otherwise top questions by score, and link these top skills to the top technology trends revealed in the first few PCA components of the previous analysis. We then use Google's [Maps Static API](https://developers.google.com/maps/documentation/maps-static/) and [Geocoding API](https://developers.google.com/maps/documentation/geocoding/) to extract a map of Switzerland and compute the user locations (i.e. longitude and latitude) respectively. Finally we put the main technology trends revealed by the previous analysis into geographical context. Note that due to the [Google Maps Platform Terms of Services](https://cloud.google.com/maps-platform/terms/#3-license) the geocoding results may not be cached, therefore to be able to execute and reproduce the results in this section you'd need a valid Google API key see [Get an API Key](https://developers.google.com/places/web-service/get-api-key) and make it available in your environment as `GOOGLE_API_KEY`. However, the few API calls needed will easily fit cost-free within a free trial version of the Google Maps API.
\
You may wonder why Switzerland? because it's where the author lives and Switzerland is a relatively small country which is nice in order to keep the geocoding costs low i.e. we need to call Google's Geocoding API for every^[Actually all the distinct user locations i.e. about four hundreds] user located in Switzerland. This second step enables answering very practical questions, for example:
* As a programmer: which locations should I consider to find jobs that match my main areas of expertise?
* As a company: where do we find relevant partners and support on the technology areas we need?
* As a recruitment company: where should we look for talent?
\
### The How: rating user skills {-}
The author of this work has in the past applied to jobs with listing describing requirements that include one or a few skills for which he had no previous experience. For example, he was recently rejected while applying for a position that required knowledge of Tableau^[See https://www.tableau.com/]. Indeed, he had no previous experience on this particular skill so it wasn't listed in his CV; however, he has extensive experience in data visualization using e.g. `d3.js` and `ggplot2`. His `sql` skill level is well above average too, therefore he intuitively felt that he would have nevertheless been a great match for that position and that's why he applied in the first place.
\
This project reminded of another anecdote regarding a collegue that was very unsure which graduate program to pursue. Surprisingly he decided to ask the admissions secretary who recommended that he would be better off going for a master in Computational Biology, and so he did. In addition to asking someone without a specialised scientific knowledge, could we not also trust a machine learning model to recommend what future career to pursue given your skill ratings?
\
In this final step a user skill `ratings` dataset derivation is designed, constructed and modeled to solve the ultimate task covered in this project: to predict how good a candidate would be in a skill for which there is no previous evidence. That's it, we present and implement a recommender system to predict user skill ratings using the collaborative filtering (CF) approach. More specifically, we'll apply the model-based approach using low-rank matrix factorization (LRMF) and two implementation variations of the stochastic gradient descent (SGD) algorithm to fit this model. The first algorithm is based on the classic SGD with multiple improvements for faster and better convergence e.g. design the `P` and `Q` matrices of the SGD algorithm in a way that aligns all matrix operation workloads with R's column-major default matrix order. The second algorithm is an R-based lock-free parallel multi-core SGD variation of the first featuring a speed up of roughly 2x with comparable high quality out-of-sample RMSE results and with potential for higher speed ups. However, we first navigate through a simpler baseline model implementation based on isolating the different biases or effects and there we'll outline some really interesting findings.
\
What used to be just an intuitive feeling was confirmed by the CF model employed in this project as it predicted the author's rating on `tableau` to be well above average. What's the moral of the story? hiring personnel could be made more efficient by broadening a search through related skills which don't readily match the job requirements. They can do so by consulting a machine learning model like the one we've built in this project.
\
This last analysis enables us to answer many practical questions too, for example:
* As a programmer: given my current skill ratings, in what technologies am I predicted to perform above average?
* As a company: Can we reorganize and optimize our skills distribution per department without firing or hiring anyone?
* As a recruitment company: candidate X doesn't explicitly list required skill Y in her resume; however, our model predicted her to be a perfect match for that job.
# Data analysis
## Data import
The script `create_dataset.r` includes the code implementation to automatically download, extract, parse, clean and construct the complete dataset which is stored in `rds` format. It will also construct the derived `ratings.rds` dataset. The script is quite long and only the most important points will be covered; however, it's well structured and commented. Running `create_dataset.r` the first time may take several hours and require large amounts of free disk disk space. Furthermore the script requires running in an Unix-like^[Because of the dependency on the package `doMC` which isn't available for Windows.] environment^[It was tested in Ubuntu 18.04 with 32GB RAM, a 6-core Intel i7-4960X and a SSD drive.] that has the following tools available: `wc`, `split`, `awk`, `7z`, `rename`, `mv`, `grep` and `time`. The script will automatically create and populate the relative folders: `data/7z/*`, `data/xml/*` and `data/rds/*` containing the downloaded files shown in Table \@ref(tab:7z-files-summary), the extracted XML files and the final `rds` dataset files^[Available in the project's GitHub page: https://github.com/bravegag/HarvardX-Skillability] respectively. Note that running `create_dataset.r` **is optional** as the final `rds` files are readily available under the relative folder `data/rds/*` in the project's GitHub repository https://github.com/bravegag/HarvardX-Skillability.
\
```{r 7z-files-summary, fig.pos="H"}
local({
tbl <- data.frame(
Name = c("`stackoverflow.com-Badges.7z`",
"`stackoverflow.com-Posts.7z`",
"`stackoverflow.com-Tags.7z`",
"`stackoverflow.com-Users.7z`"),
Compressed = c("254.5MB",
"15.3GB",
"817.0kb",
"529.3MB"),
Uncompressed = c("4.0GB",
"76.5GB",
"5.1MB",
"3.7GB"),
Description = c("All badge assignments.",
"All the question and answer posts.",
"All the tags.",
"All the users.")
)
kable(tbl, "latex", caption = "Summary of Stack Overflow 7z dataset files", booktabs = T) %>%
kable_styling(full_width = F) %>%
column_spec(4, width = "6cm") %>%
row_spec(0, bold = T)
})
```
The extracted XML file where up to 76.5GB in size. Several methods were tested to load, parse and extract the data from such huge files and the best solution we found was a combination of the following points^[See functions `downloadExtractAndProcessXml(...)` and `extractDataFromXml2(...)`]:
1. Splitting the huge files into smaller ones (split into as many files as there are cores available), loading and parsing the files in parallel. Note that the split files are temporarily written to the relative `data/xml` directory which may therefore grow in size during the process.
2. Using `readr::read_lines_chunked` to read chunks of XML, keeping the memory footprint low as each core will process a bounded chunk of XML.
3. The trick to turn these smaller XML chunks of rows into a valid XML was to wrap them within `<xml>...</xml>` tags^[Credits given to the answer of https://stackoverflow.com/questions/59329354 for coining the idea.].
4. Finally use the package `xml2` for parsing, extracting and consolidating the data into tibbles which are later stored as `rds` files.
The function `extractDataFromXml2(...)` is generic and can handle any Stack Overflow XML data file. The key was to feature a `mapping` parameter that identifies which XML attributes to read and what column names they should be mapped to in the resulting tibble.
\
Note that the huge XML file containing all posts needs first to be segregated by questions and answers. We do so grepping for attributes that would only occur in either e.g. only questions contain the attribute `AnswerCount` so we do `system(command=sprintf("time grep \"AnswerCount\" %s/Posts.xml > %s/Questions.xml", xmlDir, xmlDir))`.
## Data cleaning
The data cleaning steps were also covered as part of the `create_dataset.r` implementation. The cleaning process removes rows with missing important XML attributes e.g. answer posts with missing "foreign key" `questionId`. Several data transformations are applied too e.g. the questions attribute `tags` has HTML entity separators which are transformed into pipe separated^[See `create_dataset.r` lines #548 and #549.]. The cleaning outcome is briefly summarized in Table \@ref(tab:data-cleaning-summary).
```{r data-cleaning-summary, fig.pos="H"}
local({
tbl <- data.frame(
Dataset = c("tags",
"users",
"badges",
"questions",
"answers"),
Before = c("56.5k rows",
"~11.37m rows",
"~12.59m rows",
"~18.59m rows",
"~28.25m rows"),
After = c("56.5k rows",
"~200k rows",
"~12.59m rows",
"~5.39m rows",
"~4.83m rows"),
Description = c("Unchanged.",
"Keep only users having reputation greater than 999 or are located in Switzerland.",
"Unchanged.",
"Keep only questions answered or created from the users selection and in the later case with a score greater than 0.",
"Keep only answers created from the users selection, with score greater than 2 and having a valid answerId.")
)
kable(tbl, "latex", caption = "Summary of the data cleaning process", booktabs = T) %>%
kable_styling(full_width = F) %>%
column_spec(4, width = "7cm") %>%
row_spec(0, bold = T)
})
```
We noticed that only 1.75% of the users are actually active in the platform. The vast majority of users only seem to create a handful of posts and use the site in "read-only" mode i.e. not producing any posts. Read-only users are not interesting for the different analyses because although they contribute to voting, we consider their lack of questions and answers to be equivalent to `NA`s and they were therefore excluded.
## Data exploration and visualization
### Description of the dataset
We load the bundled `rds` data files using the following code:
```{r load-data-files,echo=TRUE,message=FALSE}
# load the Users, Questions, Answers, Badges and Tags data files
users <- readObjectByName("Users")
questions <- readObjectByName("Questions")
answers <- readObjectByName("Answers")
badges <- readObjectByName("Badges")
tags <- readObjectByName("Tags")
```
The `users` dataset depicted in Table \@ref(tab:structure-users) includes all the users we have selected for analysis. Each row is uniquely identified by the `userId` key which is used to link with other tables. The column `creationDate` timestamp represents the time when the user account was created. The column `location` is provided by users as free text which we use later as input to the Google Geocoding API for generating geographic coordinates. Finally we will work extensively with the `reputation` column which represents the accrued user reputation as the weighted sum all of post upvotes minus the downvotes. In the `questions` and `answers` datasets the column `score` is equivalent to a `reputation` per post:
```{r structure-users, echo = TRUE, message = TRUE, fig.pos="H"}
prettyPrint(head(glimpse(users)),
caption = "Users dataset structure")
```
The `questions` dataset depicted in Table \@ref(tab:structure-questions) contains all the questions we have selected and each row is uniquely identified by the `questionId` key which is used to link with other tables. Note the `tags` column will be used extensively in this work; it has during the data cleaning phase already been preprocessed to pipe separated from HTML encoded entities. The `acceptedAnswerId` column identifies each question's accepted answer, which is designated by the asking user:
```{r structure-questions, echo = TRUE, message = TRUE, fig.pos="H"}
prettyPrint(head(glimpse(questions)),
latex_options = c("striped", "scale_down"),
caption = "Questions dataset structure")
```
The `answers` dataset depicted in Table \@ref(tab:structure-answers) contains all the answers also uniquely identified by the `answerId` key. Note that we can determine the tags or skills linked to an answer by joining `answers` with `questions` by the `questionId` column:
```{r structure-answers, echo = TRUE, message = TRUE, fig.pos="H"}
prettyPrint(head(glimpse(answers)),
latex_options = c("striped", "scale_down"),
caption = "Answers dataset structure")
```
The `badges` dataset depicted in Table \@ref(tab:structure-badges) contains the user badge assignments (linked via the `userId` foreign key), for example, the "Populist"^[See https://stackoverflow.com/help/badges/62/populist] is one of the hardest badges to get and requires outscoring an already accepted answer with a score of more than ten and by more than two times the score of the accepted answer:
```{r structure-badges, echo = TRUE, message = TRUE, fig.pos="H"}
prettyPrint(head(glimpse(badges)),
latex_options = c("striped"),
caption = "Badges dataset structure")
```
Finally the tags dataset depicted in Table \@ref(tab:structure-tags) contains the all the unique tags along with their use counts. We use the name tags and skills indistintly in this project:
```{r structure-tags, echo = TRUE, message = TRUE, fig.pos="H"}
prettyPrint(head(glimpse(tags)),
latex_options = c("striped"),
caption = "Tags dataset structure")
```
### Quick exploration
Let's explore some interesting facts from the data we have, that is: the top ranking question, answer, user, tags (i.e. skills) and the top ten gold badges. The top ranking answer applies to the top ranking question and they relate to `c++`, `performance` and code `optimization`. The top ten gold badges reveal that being awarded with a "Great Question"^[See https://stackoverflow.com/help/badges/22/great-question] is harder than for a "Great Answer"^[See https://stackoverflow.com/help/badges/25/great-answer], and it's no wonder why, since answers receive in average twice as many upvotes as questions:
```{r exploration-basic,echo=TRUE,message=TRUE}
# what's the question with highest score?
prettyPrint(
questions %>%
top_n(1, score) %>%
select(questionId, acceptedAnswerId, tags, score, answerCount, favoriteCount,
viewCount)
)
# what's the answer with highest score?
prettyPrint(
answers %>%
top_n(1, score) %>%
select(answerId, questionId, score, commentCount, creationDate)
, latex_options = c("striped"))
# what's the top user?
prettyPrint(
users %>%
top_n(1, reputation) %>%
select(userId, reputation, creationDate, location, upvotes, downvotes)
)
# what are the top ten tags / skills?
prettyPrint(
topTenSkills <- tags %>%
top_n(10, count) %>%
arrange(desc(count)) %>%
rename(skill=tag)
, latex_options = c("striped"))
# what are the top ten gold badges hardest to get i.e. with fewer users awarded?
prettyPrint(
badges %>%
filter(class == "gold") %>%
group_by(badge) %>%
summarise(awarded = n()) %>%
top_n(10, -awarded) %>%
arrange(awarded)
, latex_options = c("striped"))
# compare the average scores i.e. upvotes for answers vs. questions
prettyPrint(
questions %>%
summarise(postType='question',avg_score=mean(score)) %>%
bind_rows(answers %>%
summarise(postType='answer',avg_score=mean(score)))
, latex_options = c("striped"))
```
In the following listing we compute the mean and median statistics for the some columns of interest. We learn the highly skewed nature of the data (the mean is far from the median in most cases):
```{r exploration-statistics,echo=TRUE,message=TRUE}
# what's the average user reputation?
prettyPrint(
users %>%
summarise(median=median(reputation), mean=mean(reputation))
, latex_options = c("striped"))
# what's the average number of questions per user?
prettyPrint(
questions %>%
group_by(userId) %>%
summarise(n = n()) %>%
ungroup() %>%
summarise(median=median(n), mean=mean(n))
, latex_options = c("striped"))
# what's the average number of answers per user?
prettyPrint(
answers %>%
group_by(userId) %>%
summarise(n = n()) %>%
ungroup() %>%
summarise(median=median(n), mean=mean(n))
, latex_options = c("striped"))
# what's the average number of answers per question?
prettyPrint(
questions %>%
summarise(median=median(answerCount), mean=mean(answerCount))
, latex_options = c("striped"))
```
In Figure \@ref(fig:histogram-reputation) we apply the `log10` transformation^[We preferred to work with the `log10` for reputation because it's an order-invariant transformation, and easier to interpret than natural `log`.] to the users reputation and plot its histogram, the plot confirms that the users reputation is positively skewed. Remember that we set the user selection criteria to be: users with reputation greater than 999 or located in Switzerland, so there we have some of the users located in Switzerland to the left of $\text{log10}(999) \approx 3$ and we exclude those:
```{r histogram-reputation,echo=TRUE,message=TRUE,fig.cap="Histogram of users reputation",fig.pos="H"}
users %>%
filter(reputation > 999) %>%
mutate(reputation=log10(reputation)) %>%
ggplot(aes(reputation)) +
geom_histogram(bins = 200, colour="#377EB8", fill="#377EB8") +
xlab("log10 reputation") +
theme(plot.title = element_text(hjust = 0.5),
legend.text = element_text(size=12),
axis.text.x = element_text(angle = 45, hjust = 1))
```
Now, if we split the user reputations per badge^[We chose only the badges relevant for our analysis see https://stackoverflow.com/help/badges] then the histograms look a bit nicer i.e. no longer so skewed but still asymmetrical and quite departed from a normal distribution:
```{r histogram-reputation-facet, fig.width=10, fig.height=10, fig.fullwidth=TRUE, echo=TRUE,message=TRUE,fig.cap="Histogram of users reputation per badge",fig.pos="H"}
# histogram of the log10-transfored of users reputation per badge and
# excluding users with less than 999 reputation
users %>%
filter(reputation > 999) %>%
mutate(reputation=log10(reputation)) %>%
inner_join(badges %>%
select(userId, badge) %>%
filter(badge %in% c("Populist", "Great Answer", "Guru", "Great Question",
"Good Answer", "Good Question", "Nice Answer",
"Nice Question")), by="userId") %>%
mutate(badge=factor(badge, levels=c("Populist", "Great Answer", "Guru",
"Great Question", "Good Answer", "Good Question",
"Nice Answer", "Nice Question"))) %>%
ggplot(aes(reputation, group=badge, color=badge, fill=badge)) +
geom_histogram() +
xlab("log10 reputation") +
theme(legend.position="bottom", legend.text=element_text(size=3)) +
theme(plot.title = element_text(hjust = 0.5),
legend.text = element_text(size=12)) +
facet_wrap(~badge)
```
### The What: skills and technology trends
In the following listing we want to find the main technology trends and how skills group together. To this end we first select the top 2000 skills by frequency of tagging, compute their pair-wise co-occurrence matrix and run PCA on it.
```{r the-what-co-occurrence-pca,echo=TRUE,message=FALSE}
# select the top 2k tags/skills by count
mainSkills <- tags %>%
top_n(2000, count) %>%
rename(skill=tag) %>%
arrange(desc(count))
# what's the proportion to the total?
100*sum(mainSkills$count)/sum(tags$count)
# select a smaller questions subset matching the main tags
# to get the results faster ...
questionSkills <- questions %>%
filter(score > 9 & viewCount > 99 & answerCount > 1)
# takes ~35s
tic(sprintf('separating rows with %d', nrow(questionSkills)))
questionSkills <- questionSkills %>%
select(questionId, tags) %>%
separate_rows(tags, sep="\\|") %>%
rename(skill=tags) %>%
inner_join(mainSkills, by="skill") %>%
arrange(desc(count)) %>%
select(questionId, skill)
toc()
# takes ~15m if TRUE
if (FALSE) {
tic(sprintf('computing co-occurrence matrix with %d question-skill',
nrow(questionSkills)))
X <- crossprod(table(questionSkills[1:2]))
diag(X) <- 0
toc()
saveObjectByName(X, "XCo-occurrence")
}
X <- readObjectByName("XCo-occurrence")
# how sparse is it?
sum(X == 0)/(dim(X)[1]^2)
# compute PCA
pca <- prcomp(X)
```
Figure \@ref(fig:the-what-pca-variability) depicts the cumulative variability explained up to each principal component. We see that the first four components explain 50% of the variance, and only the first 30 components are required to explain ~95% of the variance:
```{r the-what-pca-variability,echo=TRUE,message=FALSE,fig.cap="Variance explained up to each principal component",fig.pos="H"}
# let's consider the first 50 components only
pc <- 1:50
# plot the variability explained
var_explained <- cumsum(pca$sdev^2 / sum(pca$sdev^2))
qplot(pc, var_explained[pc])
```
```{r the-what-prepare-visualization,echo=FALSE,message=FALSE}
# create tibble containing the first four principal components
pcs <- tibble(skill = rownames(pca$rotation), PC1=pca$rotation[,"PC1"],
PC2=pca$rotation[,"PC2"])
# highlight the top ten tags
pcs <- pcs %>%
mutate(fontface=ifelse(skill %in% (topTenSkills %>% pull(skill)),
'bold', 'plain'))
technologies <- c("Blockchain, Cloud, Build & Data Viz",
"Full Stack",
"Web Frontend & Mobile",
"Microsoft Stack",
"Python & C++",
"Software Engineering",
"Javascript",
"iOS Stack",
"Other")
# choose 9 colors: 2x4 components plus everything else
colorPalette <- RColorBrewer::brewer.pal(name='Set1', n=9)
colorSpec <- colorPalette[1:9]
names(colorSpec) <- technologies
# maximum tags to choose in each direction
M <- 25
highlight <- pcs %>%
arrange(PC1) %>%
slice(1:M) %>%
mutate(Technology=technologies[1], pc=1)
highlight <- pcs %>%
anti_join(highlight, by="skill") %>%
arrange(desc(PC1)) %>%
slice(1:M) %>%
mutate(Technology=technologies[2], pc=1) %>%
bind_rows(highlight)
highlight <- pcs %>%
anti_join(highlight, by="skill") %>%
arrange(PC2) %>%
slice(1:M) %>%
mutate(Technology=technologies[3], pc=2) %>%
bind_rows(highlight)
highlight <- pcs %>%
anti_join(highlight, by="skill") %>%
arrange(desc(PC2)) %>%
slice(1:M) %>%
mutate(Technology=technologies[4], pc=2) %>%
bind_rows(highlight)
nonHighlight <- pcs %>%
anti_join(highlight, by="skill") %>%
mutate(Technology=technologies[9])
# switch to the 3rd and 4rth PCA components
pcs <- tibble(skill = rownames(pca$rotation), PC3=pca$rotation[,"PC3"],
PC4=pca$rotation[,"PC4"])
# highlight the top ten tags
pcs <- pcs %>%
mutate(fontface=ifelse(skill %in% (topTenSkills %>% pull(skill)), 'bold', 'plain'))
highlight <- pcs %>%
anti_join(highlight, by="skill") %>%
arrange(PC3) %>%
slice(1:M) %>%
mutate(Technology=technologies[5], pc=3) %>%
bind_rows(highlight)
highlight <- pcs %>%
anti_join(highlight, by="skill") %>%
arrange(desc(PC3)) %>%
slice(1:M) %>%
mutate(Technology=technologies[6], pc=3) %>%
bind_rows(highlight)
highlight <- pcs %>%
anti_join(highlight, by="skill") %>%
arrange(PC4) %>%
slice(1:M) %>%
mutate(Technology=technologies[7], pc=4) %>%
bind_rows(highlight)
highlight <- pcs %>%
anti_join(highlight, by="skill") %>%
arrange(desc(PC4)) %>%
slice(1:M) %>%
mutate(Technology=technologies[8], pc=4) %>%
bind_rows(highlight)
# plot the components in log scale
highlightLog <- highlight %>%
mutate(PC1=sign(PC1)*log10(abs(PC1)),
PC2=sign(PC2)*log10(abs(PC2)))
nonHighlightLog <- nonHighlight %>%
mutate(PC1=sign(PC1)*log10(abs(PC1)),
PC2=sign(PC2)*log10(abs(PC2)))
```
Figure \@ref(fig:the-what-visualization) reveals the skill groups that explain most of the variance in the data or how we prefer to call it, the main technology trends. The top ten skills are highlighted in bold font-face. The top and bottom ends of the first principal component reveal "Blockchain, Cloud, Build and Data Visualization" (in red) and "Full Stack" (in blue) respectively. While the top and bottom ends of the second principal component reveal "Web Frontend & Mobile" (in green) and "Microsoft Stack" (in purple) respectively. Note that the technology trends were named e.g. "Microsoft Stack" after reviewing all the skills found in those segments and assigning a more general conceptual trend but the resulting groupings are not exact e.g. `c++` appears in the second component "Microsoft Stack" while the third component groups together mostly "Python & C++" skills e.g. `python`, `c++11`, `stl`, `boost`, `qt`, `visual-c++`, etc.
\
The skill clusters are indeed very interesting. For example, in the top end of the first principal component depicted in red, the link between cloud and build tools is clear i.e. most of the cloud technologies are related to and require building and deploying software. Then it would seem that blockchain software is linked to deploying software in the cloud. Likewise there seems to be a link between software deployment, cloud technologies, generating reports and data visualization.
```{r the-what-visualization,echo=TRUE,message=FALSE,fig.cap="Top skills in each direction of the first two Principal Components PC1 and PC2",fig.pos="H"}
portable.set.seed(1)
highlightLog %>%
filter(pc %in% c(1, 2)) %>%
ggplot(aes(PC1, PC2, label=skill, colour=Technology)) +
geom_jitter(alpha = 0.4, size = 2) +
theme(legend.position="bottom", plot.title = element_text(hjust = 0.5),
legend.text=element_text(size=6), legend.title = element_blank()) +
guides(fill = guide_legend(nrow=2)) +
xlab(sprintf("sign(PC1) x log10|PC1| - Variance explained %d%%",
round(100*pca$sdev[1]^2 / sum(pca$sdev^2)))) +
ylab(sprintf("sign(PC2) x log10|PC2| - Variance explained %d%%",
round(100*pca$sdev[2]^2 / sum(pca$sdev^2)))) +
geom_text_repel(aes(fontface=fontface), segment.alpha = 0.3, size = 3,
force = 7, nudge_x = 0.1, nudge_y = 0.1, seed = 1) +
scale_colour_manual(values = colorSpec) +
scale_x_continuous(limits=c(-8, 8)) +
scale_y_continuous(limits=c(-8, 8)) +
geom_jitter(data = nonHighlightLog, aes(PC1, PC2), alpha = 0.05, size = 1)
```
### The Where: putting it in geographical context
Running the following listing with a valid Google API key^[See instructions here to get a free trial Google API key https://developers.google.com/maps/documentation/javascript/get-api-key] value set for the `GOOGLE_API_KEY` environment variable will match all users located in Switzerland and compute their geographic coordinates using Google's geocoding API. We note that Switzerland has a large expatriate english-speaking technology community plus four official Swiss languages, therefore we filter for lower case user `location` containing the Swiss country code `ch` or `switzerland` to match the country name written in English or written using any of the four official Swiss languages: German `schweiz`, Italian `svizzera`, French `suisse` and Romansh `svizra`:
```{r the-where-geocoding,echo=TRUE,message=TRUE}
# do this only if the file isn't there to avoid costly Google geomapping calls
if (!file.exists(filePathForObjectName("UsersCH"))) {
# the environment variable GOOGLE_API_KEY is required or simply copy-paste your
# google key instead. To obtain a google key, follow the steps outlined here:
# https://developers.google.com/maps/documentation/javascript/get-api-key
register_google(key=Sys.getenv("GOOGLE_API_KEY"))
# get users whose location is Switzerland only
usersCh <- users %>%
filter(str_detect(tolower(location),
"(\\bch\\b|switzerland|schweiz|svizzera|suisse|svizra)")) %>%
arrange(desc(reputation))
# get the unique locations so that we avoid duplicate calls e.g. "Zurich, Switzerland"
swissLocations <- usersCh %>%
select(location) %>%
unique()
# WARNING! this code paired with a valid GOOGLE_API_KEY may cost money!
swissLocations <- mutate_geocode(swissLocations, location = location)
usersCh <- usersCh %>%
left_join(swissLocations, by="location")
# write the usersCh to disk
saveObjectByName(usersCh, "UsersCH")
}
usersCh <- readObjectByName("UsersCH")
# expected number of users located in Switzerland
stopifnot(nrow(usersCh) == 4258)
```
```{r the-where-prepare-visualization,echo=FALSE,message=FALSE}
# get the top answer skills
topAnswerTags <- answers %>%
semi_join(usersCh, by="userId") %>%
group_by(userId) %>%
summarise(score=max(score)) %>%
ungroup() %>%
inner_join(answers %>% select(userId, score, questionId), by=c("userId", "score")) %>%
inner_join(questions %>% select(questionId, tags), by="questionId") %>%
group_by(userId) %>%
summarise(questionId=first(questionId), score=first(score), tags=first(tags)) %>%
ungroup() %>%
mutate(type='answer') %>%
arrange(desc(score))
# otherwise get the top question tags
topQuestionTags <- questions %>%
semi_join(usersCh, by="userId") %>%
anti_join(topAnswerTags, by="userId") %>%
group_by(userId) %>%
summarise(score=max(score)) %>%
ungroup() %>%
inner_join(questions %>% select(userId, score, questionId, tags), by=c("userId", "score")) %>%
group_by(userId) %>%
summarise(questionId=first(questionId), score=first(score), tags=first(tags)) %>%
ungroup() %>%
mutate(type='question') %>%
arrange(desc(score))
# merge the two data sets
usersChTop <- topAnswerTags %>%
bind_rows(topQuestionTags) %>%
mutate(type=as.factor(type)) %>%
left_join(usersCh %>% select(userId, location, lon, lat), by="userId") %>%
separate_rows(tags, sep="\\|") %>%
rename(skill=tags) %>%
inner_join(mainSkills, by="skill") %>%
select(questionId, userId, score, skill, type, location, lon, lat)
# link to the principal component highlights, remove others
usersChTop <- usersChTop %>%
left_join(highlight %>% select(skill, pc, Technology), by=c("skill")) %>%
select(questionId, userId, score, skill, Technology, type, location, lon, lat) %>%
filter(!is.na(Technology) & Technology != technologies[9]) %>%
arrange(desc(score))
# do this only if the file isn't there to avoid costly Google map calls
if (!file.exists(filePathForObjectName("SwissMap"))) {
# the environment variable GOOGLE_API_KEY is required or simply copy-paste your
# google key instead. To obtain a google key, follow the steps outlined here:
# https://developers.google.com/maps/documentation/javascript/get-api-key
register_google(key=Sys.getenv("GOOGLE_API_KEY"))
# get Google map of Switzerland
center <- c(lon = 8.227512, lat = 46.818188)
map <- get_googlemap(center = center, zoom = 7,
color = "bw",
maptype = "terrain",
style = paste("feature:road|visibility:off&style=element:labels|",
"visibility:off&style=feature:administrative|visibility:on|lightness:60",
sep=""))
saveObjectByName(map, "SwissMap")
}
map <- readObjectByName("SwissMap")
```
Figure \@ref(fig:the-where-visualization) depicts the technology trends discovered in the previous analysis and now shown in geographical context for Switzerland. We note that Zurich is becoming a true technology center in Europe as all the trends are there. The most prominent data point by score in Switzerland was reached by an user located in Zurich posting on Full Stack development. We can also observe that the east and south of Switzerland e.g. the Tessin region has much lower activity technology-wise therefore it wound't be a wise decision looking for technology jobs there. The data points appearing in the center of Switzerland correspond to users who were not precise in providing their specific location i.e. they specified their location as "Switzerland" and that's the center of Switzerland but technology-wise we should not expect to find anything in the middle of the mountains. Geneve city was surprisingly less active in quantity and quality or may be that users there didn't provide their location precisely enough. The city of Bern shows two high scoring data points connected to the technology trends Microsoft Stack and Python & C++ respectively. We can also note several users in isolated Swiss regions working on `ios` i.e. potentially building iPhone applications in remote areas which would make sense.
```{r the-where-visualization, fig.width=10, fig.height=10, fig.fullwidth=TRUE, echo=TRUE, message=FALSE,fig.cap="Users located in Switzerland and their matching technology trends, weighted by score",fig.pos="H"}
# plot the top technology trends in Geo-context in Switzerland
ggmap(map) +
scale_colour_manual(values = colorSpec) +
geom_point(data=usersChTop, aes(x=lon, y=lat, colour=Technology, size=score),
position = position_jitterdodge(jitter.width=0.01, jitter.height=0.01,
seed=1)) +
theme(plot.title = element_text(hjust = 0.5), legend.text=element_text(size=8),
legend.title = element_blank(), legend.position="bottom")
```
The top most prominent post data points in Switzerland are revealed in Table \@ref(tab:the-where-top-ten).
```{r the-where-top-ten, echo=TRUE, message=FALSE, fig.pos="H"}
prettyPrint(
usersChTop %>%
top_n(10, score) %>%
arrange(desc(score))
, caption = "Top most prominent posts from users located in Switzerland")
```
### The How: rating user skills
One of the biggest challenges in this project was without any doubt to come up with a sound approach to assign skill ratings to users. First because there is no explicit link between users and skills and second because there isn't any apparent way to quantify a rating for a given user and skill. From the data exploration we know that questions contain `tags` (i.e. skills) and they also contain the posting `userId`. We also know that answers link to the parent question via the `questionId` and to the posting user via the `userId`. Therefore, through questions and answers we can link users and skills; namely the questions asked by an user: `user -> question -> tags` and the answers posted by an user: `user -> answer -> question -> tags`.
\
But what about the ratings? This is where the `badges` dataset comes into play. Badges^[For a detailed description of the badge system see https://stackoverflow.com/help/badges.] are awarded to users for different reasons including how good an answer or question is, this "how good" is backed up by a quantity which is the answer or question score (i.e. the up or downvotes), and thus we have a possible solution. The idea for filling the ratings would be to follow the same ordering provided by the badges system which is categorized with three quality class levels: `gold`, `silver` and `bronze`. We'd intuitively assume that e.g. an user that posted an answer which was awarded with `gold` for a question related to certain skills should be rated higher in those skills than an user asking a `silver` question on those same skills. But, will the order suggested by the badges system ensure significant differences among users? This is what we're about to find out in the following statistical inference analysis.
\
We'd like to validate the hypothesis of whether the awarded user groups would be significantly different with respect to the classes and badges. One possible way to do this, is to use the user reputation which is an overall quantity calculated independently of specific questions and answers. We wouldn't want to feed a model with data containing "lucky" users landing with very high ratings for a skill. We'd also like to validate the ordering i.e. is the average reputation of users awarded with gold answer badges in average significantly better than that of users awarded with gold question badges? specially taking into account the ambiguities what we've observed before, e.g. `gold` "Great Questions" are harder to be awarded than `gold` "Great Answers".
\
```{r badges-significance-prep, fig.height=8, fig.width=8, echo=FALSE, message=FALSE}
# Function to accumulates all badges into a final dataset. Every next call
# should pass the accumulated results so that they can be excluded from
# the selection.
#
# @param aBadge the badge to filter for.
# @param acc the accumulated results (result of previous call to this function).
# @param N the top N values to pick e.g. 1500
#
accumulateBadges <- function(aBadge, acc=NULL, N=1500) {
res <- NULL
# handle this non standard case separately
if (aBadge == "Other Answers") {
res <- users %>%
anti_join(acc %>% select(userId) %>% unique(), by="userId") %>%
semi_join(answers %>% filter(0 <= score & score < 10) %>%
select(userId) %>% unique(), by="userId") %>%
mutate(class="bronze", badge=aBadge) %>%
select(userId, class, badge, reputation) %>%
bind_rows(acc)
} else {
# this is the case for the first time
if (is.null(acc)) {
res <- users %>%
inner_join(badges %>% filter(badge == aBadge) %>%
select(userId, class, badge) %>% unique(), by="userId") %>%
select(userId, class, badge, reputation)
} else {
res <- users %>%
anti_join(acc %>% select(userId) %>% unique(), by="userId") %>%
inner_join(badges %>% filter(badge == aBadge) %>%
select(userId, class, badge) %>% unique(), by="userId") %>%
select(userId, class, badge, reputation) %>%
bind_rows(acc)
}
}
# sort by factor ordering
res$class <- factor(as.character(res$class), levels=c("gold", "silver", "bronze"))
return(res)
}
# let's check whether the badge system provides qualitatively a significative users'
# segregation w.r.t reputation using answers and questions. Compare the users by badge
# gold vs silver vs bronze badges.
N <- 1500
badgesOrder <- c("Populist", "Great Answer", "Great Question", "Guru",
"Good Answer", "Nice Answer", "Other Answers",
"Good Question", "Nice Question")
# accumulate all badge selections into a complete set
for (i in 1:length(badgesOrder)) {
if (i == 1) {
comp <- accumulateBadges(badgesOrder[i])
} else {
comp <- accumulateBadges(badgesOrder[i], comp)
}
}
# set the seed again
portable.set.seed(1)
# log10 transform reputation and sort by factor ordering
comp <- comp %>%
filter(reputation > 999) %>%
group_by(class, badge) %>%
sample_n(N) %>%
ungroup() %>%
mutate(reputation=log10(reputation)) %>%
mutate(badge = factor(badge, levels=badgesOrder))
```
Table \@ref(tab:badges-significance-avg) depicts the average reputations for users who have been granted the different badges of interest. The result enlightens us with a rough idea of the order we are after. Recall that we've observed in Figure \@ref(fig:histogram-reputation) the user reputations to be highly skewed so we choose to work with the median as measure of central tendency instead of the mean. The results in Table \@ref(tab:badges-significance-avg) reveal that users granted with answer badges depict in average higher reputations than those granted question badges. We also note that users granted gold badges depict higher average reputation than that of users granted silver badges, similarly users granted silver badges tend to have higher reputations in average than users granted bronze badges.
```{r badges-significance-avg, echo=TRUE, message=FALSE, fig.pos="H"}
# checkout the average reputation ordering by badge to get an idea though
# this is not the exact final ordering used due to the exclusion system.
prettyPrint(
users %>%
inner_join(badges %>% select(userId, class, badge) %>% unique(), by="userId") %>%
filter(badge %in% badgesOrder) %>%
group_by(class, badge) %>%
summarise(avg_reputation=median(reputation)) %>%
arrange(desc(avg_reputation))
, latex_options = c("striped"), caption = "Average user reputation per class \\& badge")
```
Note that since the `log` transformation preserves the order of the data i.e. if $x > y$ then $\text{log}(x) > \text{log}(y)$ and brings it to a nicer scale to work with (e.g. for plotting) we're going to do conduct the following inference analysis using a `log10` transformation of the users reputation.
\
Figure \@ref(fig:badges-significance-answers-plot) reveals the user reputations ordering difference between gold, silver and bronze for answer badges. We see that the average within each group matches the expected level of the class i.e. users granted gold answer badges depict higher reputation average than those granted with silver and bronze answer badges. The vertical dashed lines show the median for each class:
```{r badges-significance-answers-plot, echo=TRUE, message=TRUE, fig.cap="Histograms of users reputation per Answer badges",fig.pos="H"}
# create color specification for the different badges
colorSpec <- c("#f9a602", "#c0c0c0", "#cd7f32")
names(colorSpec) <- c("gold", "silver", "bronze")
selectedBadges <- c("Great Answer", "Good Answer", "Nice Answer")