-
Notifications
You must be signed in to change notification settings - Fork 1
/
analysis.Rmd
339 lines (273 loc) · 11.4 KB
/
analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
---
title: "GitHub Analysis"
author: "Zhian N. Kamvar"
date: "3/30/2020"
output:
md_document:
variant: gfm
params:
build_md: TRUE
open_issue: FALSE
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Setup
The source for all the repositories is located at <https://carpentries.github.io/curriculum-feed/carpentries_lessons.json>. To
avoid having to pull from it multiple times I'm going to set up a folder:
```{r packages, message = FALSE}
library("fs") # Filesystem navigation
library("jsonlite") # parsing JSON files
library("purrr") # handling lists (JSON files)
library("dplyr") # handling data frames and magic
library("tidyr") # separating values
library("ggplot2") # visualization
library("forcats") # ordering factors
library("magrittr") # for the %T>% pipe I love so well
library("gh") # accessing GitHub's API
library("polite") # being respectful when downloading files
library("here") # so I can always remember where I started
library("git2r") # downloading github repositories
library("waldo") # comparing objects
# Github packages
# remotes::install_github("carpentries/pegboard")
# remotes::install_github("fmichonneau/carpenter")
library("pegboard") # parsing and analysis of carpentries episodes
library("carpenter") # create an issue
```
```{r data}
raw_data <- here::here("data", "raw")
raw_lessons <- fs::path(raw_data, "carpentries_lessons.json")
if (fs::file_exists(raw_lessons)) {
} else {
fs::dir_create(raw_data)
res <- download.file(
url = "https://carpentries.github.io/curriculum-feed/carpentries_lessons.json",
destfile = raw_lessons
)
if (res == 0L) {
fs::file_chmod(raw_lessons, "a-wx")
}
}
lessons <- read_json(raw_lessons, simplifyVector = FALSE) %>%
purrr::map_dfr(.f = dplyr::as_tibble) %>%
dplyr::mutate_if(is.character,
~dplyr::if_else(. == "NA", as.character(NA), .)) %>%
dplyr::mutate(life_cycle = forcats::fct_relevel(life_cycle,
"pre-alpha", "alpha", "beta", "stable", "on-hold", "deprecated"
)) %>%
dplyr::mutate(life_cycle = forcats::fct_explicit_na(life_cycle)) %>%
print()
```
# First look at available lessons
I'm going to take a look at all of the software carpentry lessons first. There
are currently `r nrow(lessons)` available lessons, but they are in different
stages of completion:
```{r lesson_vis}
ggplot(lessons, aes(y = life_cycle, fill = curriculum)) +
geom_bar(orientation = "y")
```
Let's inspect the ones that are stable and then look at the ones that are lower
down, first we should filter for those that are stable and actually have GitHub
URLs:
```{r stable}
stable <- lessons %>% filter(
life_cycle == "stable",
curriculum != "instructor-training",
!is.na(curriculum), # these are template repositories
!is.na(URL) # no GitHub URL is not very useful for me
)
stable
```
I end up with `r nrow(stable)` repositories to play with:
```{r curriculum_language}
ggplot(stable, aes(y = curriculum, fill = language)) +
geom_bar(orientation = "y")
```
# Inspection of repository features
I will use the {gh} package to inspect the features of each repository:
- tags
- directory structure
- dependencies
I can use the GitHub API to get the contents of the repositories:
<https://developer.github.com/v3/repos/contents/#get-contents>. The {gh} package
allows me to write the responses to disk so that I don't have to query every
time I want to re-run the analysis. If I wanted to do this with fresh data, all
I would need to do is to clear the data folder.
One of the tripping points here is that not all of the repositories will have
`_episodes_rmd/` directories, so I will need to walk over these with
`purrr::possibly()`, a nice little failsafe function.
```{r get_rmd_episodes}
safegh <- purrr::possibly(gh::gh, otherwise = list(NA))
rmd_episodes <- function(owner, repo) {
safegh <- purrr::slowly(
f = purrr::possibly(gh::gh, otherwise = list(NA)),
rate = rate_delay(pause = 2)
)
OR <- glue::glue("{owner}--{repo}")
safegh("/repos/:owner/:repo/contents/_episodes_rmd",
owner = owner,
repo = repo,
.destfile = here::here(fs::path("data", "rmd_JSON", OR, ext = "json"))
)
}
vorhees <- purrr::possibly(jsonlite::read_json, otherwise = list())
lesson_has_rmd <- . %>%
tidyr::separate(URL, into = c(NA, NA, NA, "user", NA), sep = "/", remove = FALSE) %>%
mutate(user = purrr::walk2(.x = user, .y = repository, .f = rmd_episodes)) %>%
mutate(JSON = purrr::map2(.x = user, .y = repository,
.f = ~vorhees(
here::here("data", "rmd_JSON", glue::glue("{.x}--{.y}.json"))
)
)) %>%
filter(repository == "R-ecology-lesson" | lengths(JSON) > 2)
has_rmd <- stable %>% lesson_has_rmd
has_rmd_all <- lessons %>%
lesson_has_rmd %>%
filter(repository != "lesson-example") %>%
arrange(life_cycle)
```
After parsing, we find that we have RMarkdown files for a grand total of
`r nrow(has_rmd_all)` lessons and `r nrow(has_rmd)` stable lessons. From here,
we can grab these lessons to see if we can build them with our docker container.
## Lessons that use R
Below is a table with all the lessons the use R in our curriculum. There are
links within the table that take you further down this document to display a
diff between output from R 3.6 and R 4.0. One issue with these diffs is that
sometimes the ordering of messages shifts the entire diff, but for the most
part, it's clear where they came from.
```{r, ouput = 'asis'}
has_rmd_all %>%
mutate(results = glue::glue("[results](#lesson-{repository})")) %>%
select(URL, curriculum, life_cycle, language, results) %>%
knitr::kable()
```
The command to use is:
```bash
docker run --rm -it -e USER=$(id -u) -e GROUP=$(id -g) -v ${PWD}:/home/rstudio rocker/verse:4.0.0 make -C /home/rstudio lesson-md
```
Let's download these repos:
```{r rmd_episodes_repos}
rmdpath <- path("data", "rmd-repos")
if (!dir_exists(rmdpath)) {
dir_create(path = rmdpath)
}
g <- function(u, r, path = rmdpath) {
URL <- glue::glue("https://github.com/{u}/{r}.git")
path <- glue::glue("{path}/{u}--{r}R{c(3, 4)}")
if (fs::dir_exists(path[[1]])) {
git2r::pull(repo = path[[1]])
git2r::pull(repo = path[[2]])
} else {
git2r::clone(URL, local_path = path[[1]])
git2r::clone(URL, local_path = path[[2]])
}
}
purrr::walk2(has_rmd_all$user, has_rmd_all$repository, g, rmdpath)
standard_rmd <- has_rmd_all %>% filter(repository != "R-ecology-lesson")
```
Now all the repos have been downloaded, we can render the episodes under
different versions of R and see how they go. Note that I have to use the
geospatial R container in order to get things to work properly.
```{r dokken, results = "hide"}
get_path <- function(u, r) {
fs::path_abs(fs::path("data", "rmd-repos", glue::glue("{u}--{r}")))
}
run_docker <- function(the_path, R_VERSION, cmd = '-C /home/rstudio lesson-md') {
make____it <- glue::glue("install2.r checkpoint && make -B -j 4 {cmd}")
docker_run <- "docker run --rm -it -e USER=$(id -u) -e GROUP=$(id -g) -v {the_path}:/home/rstudio"
contai_ner <- "rocker/geospatial:{R_VERSION} /bin/bash -c '{make____it}'"
system(glue::glue("{glue::glue(docker_run)} {glue::glue(contai_ner)}"))
}
dockin <- function(u, r) {
the_path <- get_path(u, r)
# Run with R 3.6.3 ---------------------------
r3_path <- glue::glue("{the_path}R3")
if (params$build_md) run_docker(r3_path, "3.6.3")
r3 <- try(Lesson$new(r3_path, rmd = FALSE))
# Run with R 4.0.0 ---------------------------
r4_path <- glue::glue("{the_path}R4")
if (params$build_md) run_docker(r4_path, "4.0.0")
r4 <- try(Lesson$new(r4_path, rmd = FALSE))
list(r3, r4)
}
res <- purrr::map2(standard_rmd$user, standard_rmd$repository, dockin)
names(res) <- standard_rmd$repository
```
We can iterate and compare:
```{r compare, results = 'asis'}
if (!dir_exists(path("data", "diffs"))) {
dir_create(path("data", "diffs"))
}
cmpr <- function(lesson, name) {
cat(glue::glue("\n## Lesson: {name}\n\n"))
if (!inherits(lesson[[1]], "Lesson")) {
cat("\n\n----ERRORED----\n\n")
return(invisible(NULL))
}
o1 <- map(lesson[[1]]$episodes, ~setNames(xml2::xml_text(.x$output), xml2::xml_attr(.x$output, "sourcepos")))
o2 <- map(lesson[[2]]$episodes, ~setNames(xml2::xml_text(.x$output), xml2::xml_attr(.x$output, "sourcepos")))
for (i in seq_along(lesson[[1]]$episodes)) {
episode <- glue::glue("{name}--{basename(lesson[[1]]$files[i])}")
dpath <- path("data", "diffs", sub("md$", "diff", episode))
diffcmd <- "git diff --no-index -- {lesson[[1]]$files[[i]]} {lesson[[2]]$files[[i]]} > {dpath}"
system(glue::glue(diffcmd))
cat(glue::glue("\n#### Episode: {episode}\n"))
cat(glue::glue("\n\n[Link to full diff]({dpath})\n"))
cat("\n```diff\n")
print(waldo::compare(o2[[i]], o1[[i]], x_arg = "Rv4", y_arg = "Rv3"))
cat("\n```\n")
}
}
purrr::walk2(res, names(res), cmpr)
```
## Lesson: R-ecology-lesson
> Note, this had a different build sequence, so I had to do this manually
> and could not provide output diffs.
```{r rel, results = 'asis', message = FALSE}
REL <- has_rmd_all %>% filter(repository == "R-ecology-lesson")
cmd <- "-C /home/rstudio pages"
ptl <- get_path(REL$user, REL$repository)
R3 <- glue::glue("{ptl}R3")
R4 <- glue::glue("{ptl}R4")
if (params$build_md) {
run_docker(R3, "3.6.3", cmd)
run_docker(R4, "4.0.0", cmd)
}
R3res <- path(R3, "_site", gsub(".Rmd$", ".html", path_file(dir_ls(R3, glob = "*Rmd"))))
R4res <- path(R4, "_site", gsub(".Rmd$", ".html", path_file(dir_ls(R4, glob = "*Rmd"))))
episodes <- glue::glue("R-ecology-lesson--{path_file(R3res)}")
dpath <- path("data", "diffs", sub("html$", "diff", episodes))
pando <- function(i) glue::glue("$(pandoc -t markdown -f html -o {sub('html', 'md', i)} {i} && echo {sub('html', 'md', i)})")
diffcmd <- "git diff --no-index -- {pando(R3res)} {pando(R4res)} > {dpath}"
walk(glue::glue(diffcmd), system)
walk2(episodes, dpath, ~cat(glue::glue("\n#### Episode: {.x}\n\n\n[Link to full diff]({.y})\n\n")))
```
-------------
2020-07-27: I have created a blog post about this
(https://github.com/carpentries/carpentries.org/pull/830)
and will now reach out to the maintainers of these lessons via github issue
using the `{gh}` package. Note, this will only work if you have a github PAT
associated with your profile that has permission to open issues.
```{r ghissues}
title <- "RFC: Migration to R 4.0"
body <- "
During the June Maintainer meeting, we asked for comments and experiences with the migration to R 4.0 so that we could create guidance for maintainers and instructors. We have drafted a short blog post (https://github.com/carpentries/carpentries.org/pull/830) to be released next week (2020-08-03) that describes our recommendations for migration. You can find a [preview of the blog post here](https://deploy-preview-830--stupefied-einstein-5bde90.netlify.app/blog/2020/08/r-4-migration/). **Please look over the blog posts and make comments by 2020-07-30 so that we can incoroporate any changes before the post goes live.**
To help identify the differences between R 3.6 and R 4.0, I have run this lesson in both versions and [posted the results](https://github.com/zkamvar/postmaul/blob/master/analysis.md#{LESSON}) that show the differences in output chunks and entire markdown files.
"
if (params$open_issue) {
walk2(has_rmd_all$user, has_rmd_all$repository,
~create_issue(
owner = .x,
repo = .y,
title = title,
body = glue::glue(body, LESSON = .y),
labels = NULL
)
)
}
```
# Session Information
```{r sessioninfo}
sessioninfo::session_info()
```