forked from hadley/mastering-shiny
-
Notifications
You must be signed in to change notification settings - Fork 0
/
action-tidy.Rmd
422 lines (315 loc) · 17.3 KB
/
action-tidy.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
# Tidy evaluation {#action-tidy}
```{r, include = FALSE}
source("common.R")
```
If you are using the tidyverse from Shiny, you will almost certainly encounter the challenge of programming with tidy evaluation. Tidy evaluation is one the underlying components of the tidyverse that faciliate interactive data exploration, but poses some additional challenges when you want to refer to variables indirectly.
There are two forms of tidy evaluation that you're likely to come across:
* **Data-masking** allows you to refer to variables within a data frame
without repeating the name of the data frame. For example, this code can
use the `x`, `z`, `carat`, and `price` in from the `diamonds` dataset
because `filter()` and `aes()` use data-masking.
```{r, eval = FALSE}
diamonds %>% filter(x == z)
ggplot(diamonds, aes(carat, price)) + geom_hex()
```
* **Tidy-selection** allows you to concisely select variables by their
position, name, or value. The following code selects all variables from
`iris` that start with "sepal", and all numeric variables in the `diamonds`
dataset.
```{r, eval = FALSE}
iris %>% select(starts_with("sepal"))
diamonds %>% select(is.numeric)
```
In this chapter, you'll learn how to use both data-masking and tidy-selection within your Shiny app. To learn more about using tidy evaluation in a function or package, you'll need to turn to other resources like
[_Using ggplot2 in packages_](http://ggplot2.tidyverse.org/dev/articles/ggplot2-in-packages.html) or [_Programming with dplyr_](http://dplyr.tidyverse.org/dev/articles/programming.html). We'll pair Shiny with ggplot2 and dplyr, the two most popular packages that use tidy evaluation.
```{r setup}
library(shiny)
library(dplyr, warn.conflicts = FALSE)
library(ggplot2)
```
## Data-masking {#tidy-motivation}
Data-masking allows you to use variables in the "current" data frame without any extra syntax. It's used in many dplyr functions like `arrange()`, `filter()`, `group_by()`, `mutate()`, and `summarise()`, and in ggplot2's `aes()`. Data-masking works by blurring the lines between two meanings of "variable":
* Environment variables (env-variables for short) are "programming" variables.
They are usually created with `<-`.
* Data frame variables (data-variables for short) are "statistical" variables
that live inside a data frame. They usually come from data loaded by functions
like `read.csv()`, or a created by manipulating other variables.
To make these terms more concrete, take this piece of code:
```{r}
df <- data.frame(x = runif(3), y = runif(3))
df$x
```
It creates a env-variable, `df`, that contains two data-variables, `x` and `y`. Then it extracts the data-variable `x` out of the `df` using `$`.
Data-masking is useful because it lets you use data-variables without any additional syntax. Take this `filter()` call:
```{r}
filter(diamonds, x == 0 | y == 0)
```
In most (but not all[^subset]) base R functions you need to refer to a data-variable with `$`, leading to code that repeats the name of the data frame many times:
```{r}
diamonds[diamonds$x == 0 | diamonds$y == 0, ]
```
[^subset]: `dplyr::filter()` is inspired by `base::subset()`. `subset()` uses data-masking, but not through tidy evaluation, so unfortunately the techniques discussed in this chapter don't apply to it.
You usually use these verbs purely with data-variables, but they work equally well with env-variables[^fun-scoping]:
```{r}
min_carat <- 1
diamonds %>% filter(carat > min_carat)
```
[^fun-scoping]: Note that an expression like `carat > min_carat` has to look for three things: `carat`, `min_carat`, and `>`. That's because R uses the same rules to look for functions and objects. If `filter()` didn't also look in the environment, it wouldn't be able to find any functions.
### Indirection
The blurring of data-variables and env-variables makes data analysis easier at the cost of making indirection harder. In Shiny[^embracing], indirection happens when you have the name of data-variable stored in an env-variable, and is the key in use data-masking with Shiny apps. Indirection isn't a problem with base R because you can switch from `$` to `[[`:
```{r, results = FALSE}
var <- "carat"
min <- 1
diamonds[diamonds[[var]] > min, ]
```
[^embracing]: There's another form of indirection that happens when you're write functions which is solved using `{{ x }}`, called embracing. You can learn more about that in [_Programming with dplyr_](http://dplyr.tidyverse.org/dev/articles/programming.html).
Data-masking solves the indirection problem in a similar way, by introducing an object, `.data` that you can subset using either `$` or `[[`. To get started we can rewrite our previous `filter()` call to use `.data` to make it clear we're that `carat` is a data-variable[^env]:
```{r, results = FALSE}
diamonds %>% filter(.data$carat > min)
```
[^env]: `.data` is paired with `.env` which is usually less useful, but we'll come back to it later in Section.
This form isn't particularly useful (allow it does allow you to eliminate a pesky `R CMD check` `NOTE`), but because we have some object to index into, we can switch from `$` to `[[`:
```{r, results = FALSE}
diamonds %>% filter(.data[["carat"]] > min)
```
Which in turn allows indirection:
```{r, results = FALSE}
var <- "carat"
diamonds %>% filter(.data[[var]] > min)
```
### In Shiny apps
Let's apply this to a simple Shiny app. The following example tries to let the user to select any variable from `diamonds` and find all rows where that variable is greater than zero:
```{r}
ui <- fluidPage(
selectInput("var", "Variable", choices = names(diamonds)),
tableOutput("output")
)
server <- function(input, output, session) {
data <- reactive(filter(diamonds, input$var > 0))
output$output <- renderTable(head(data()))
}
```
This code doesn't work because `input$var` isn't a data-var: it's an env-var containing the name of a data-var (stored as string). Unfortunately it also fails to work in an uninformative way because `input$var` will be a string like "carat" and:
```{r}
"carat" > 0
```
We can fix the problem, and allow indirection, by using the technique described above:
```{r}
server <- function(input, output, session) {
data <- reactive(filter(diamonds, .data[[input$var]] > 0))
output$output <- renderTable(head(data()))
}
```
### Example: ggplot2
Lets take a look at a more complicated example where we allow the user to create a plot by selecting the variables to appear on the `x` and `y` axes:
```{r}
# NB: needs ggplot >= 3.3.0 to get nice labels
ui <- fluidPage(
selectInput("x", "X variable", choices = names(iris)),
selectInput("y", "Y variable", choices = names(iris)),
plotOutput("plot")
)
server <- function(input, output, session) {
output$plot <- renderPlot({
ggplot(iris, aes(.data[[input$x]], .data[[input$y]])) +
geom_point(position = ggforce::position_auto())
})
}
```
I've used the elegant `ggforce::position_auto()` to automatically spread the points out when one axis is discrete. Once you've mastered the basics of tidy evaluation you'll quickly find that the challenge becomes making your app general enough to work with many different types of variable.
Alternatively, instead of using `position_auto()`, we could allow the user to pick the geom:
```{r}
ui <- fluidPage(
selectInput("x", "X variable", choices = names(iris)),
selectInput("y", "Y variable", choices = names(iris)),
selectInput("geom", "geom", c("point", "smooth", "jitter")),
plotOutput("plot")
)
server <- function(input, output, session) {
plot_geom <- reactive({
switch(input$geom,
point = geom_point(),
smooth = geom_smooth(se = FALSE),
jitter = geom_jitter()
)
})
output$plot <- renderPlot({
ggplot(iris, aes(.data[[input$x]], .data[[input$y]])) +
plot_geom()
})
}
```
### Example: dplyr
The same technique works for dplyr. The following app extends the previous simple example to allow you to choose a variable to filter, a minimum value to select, and a variable to sort by.
```{r}
ui <- fluidPage(
selectInput("var", "Select variable", choices = names(mtcars)),
sliderInput("min", "Minimum value", 0, min = 0, max = 100),
selectInput("sort", "Sort by", choices = names(mtcars)),
tableOutput("data")
)
server <- function(input, output, session) {
observeEvent(input$var, {
rng <- range(mtcars[[input$var]])
updateSliderInput(session, "min", value = rng[[1]], min = rng[[1]], max = rng[[2]])
})
output$data <- renderTable({
mtcars %>%
filter(.data[[input$var]] > input$min) %>%
arrange(.data[[input$sort]])
})
}
```
Most other problems can be solved by combining `.data` with your existing programming skills. For example, what if you wanted to conditionally sort in either ascending or descending order?
```{r}
ui <- fluidPage(
selectInput("var", "Sort by", choices = names(mtcars)),
checkboxInput("desc", "Descending order?"),
tableOutput("data")
)
server <- function(input, output, session) {
sorted <- reactive({
if (input$desc) {
arrange(mtcars, desc(.data[[input$var]]))
} else {
arrange(mtcars, .data[[input$var]])
}
})
output$data <- renderTable(sorted())
}
```
As you provide more control, you'll find the code gets more and more complicated, and it becomes harder and harder to create a user interface that is both comprehensive _and_ user friendly. This is why I've always focussed on code tools for data analysis: creating good UIs is really really hard!
### User supplied data
There is one additional complication when you're working with user supplied data. Take the following app: it allows the user to upload a tsv file, then select a variable and filter by it. It will work for the vast majority of inputs that you might try it with:
```{r}
ui <- fluidPage(
fileInput("data", "dataset", accept = ".tsv"),
selectInput("var", "var", character()),
numericInput("min", "min", 1, min = 0, step = 1),
tableOutput("output")
)
server <- function(input, output, session) {
data <- reactive({
req(input$data)
vroom::vroom(input$data$datapath)
})
observeEvent(data(), {
updateSelectInput(session, "var", choices = names(data()))
})
observeEvent(input$var, {
val <- data()[[input$var]]
updateNumericInput(session, "min", value = min(val))
})
output$output <- renderTable({
req(input$var)
data() %>%
filter(.data[[input$var]] > input$min) %>%
arrange(.data[[input$var]]) %>%
head(10)
})
}
```
There is a subtle problem with the use of `filter()` here. Let's pull out that bit of the code so we can play around with directly outside the app:
```{r}
df <- data.frame(x = 1, y = 2)
input <- list(var = "x", min = 0)
df %>% filter(.data[[input$var]] > input$min)
```
If you experiment with this code, you'll find that it appears to work just fine for vast majority of data frames. However, there's a subtle issue: what happens if the data frame contains a variable called `input`?
```{r, error = TRUE}
df <- data.frame(x = 1, y = 2, input = 3)
df %>% filter(.data[[input$var]] > input$min)
```
We get an error message because `filter()` is attempting to evaluate `df$input$min`:
```{r, error = TRUE}
df$input$min
```
This problem is again due to the ambiguity of data-variables and env-variables. Tidy evaluation always prefers to use a data-variable if both are available. We can resolve the amibugity by using `.env`[^inception] to tell `filter()` only look for env-variables called min:
```{r}
df %>% filter(.data[[input$var]] > .env$input$min)
```
[^inception]: You might wonder if the same problem applies to variables called `.data` and `.env`. In the unlikely event of having columns with those names you'll need to refer to them with explicitly `.data$.data` and `.data$.env`.
Note that you only need to worry about this probelm when working with user supplied data; when working with your own data, you can ensure the names of the data-variables don't clash with the names of env-variables.
### Why not use base R?
At this point you might wonder if you're better off without `filter()`, and instead convert your code to use the equivalent base R code:
```{r}
df[df[[input$var]] > input$min, ]
```
That's a totally legitimate position, as long as you're aware of the work that `filter()` does for you so you can generate the equivalent base R code. In this case:
* You'll need `drop = FALSE` if `df` contains a single column (otherwise you'll
get a vector instead of a data frame).
* You'll need to use `which()` or similar to drop any missing values.
* You can't do group-wise filtering (e.g. `filter(df, n() == 1)`).
In general, if you're using dplyr for very simple cases, you might find it easier to use base R functions that don't use data-masking. However, in my opinion, one of the advantages of the tidyverse is the careful thought that has been applied to edge cases so that functions work more consistently. I don't want to oversell this, but at the same time, it's easy to forget the quirks of specific base R functions, and write code that works 95% of the time, but fails in unusual ways the other 5% of the time.
## Tidy-selection
Tidy-selection provides a concise way of selecting columns by position, name, or type. It's used in `dplyr::select()` and `dplyr::across()`, and many functions from tidyr like `pivot_longer()`, `pivot_wider()`, `separate()`, `extract()`, and `unite()` functions.
### Indirection
To refer to variables indirectly use `any_of()` or `all_of()`[one-of]: both expect an characater vector env-variable containing data-variable names. The only difference is what happens if you supply a variable named that doesn't exist in the input: `all_of()` will throw an error, while `any_of()` will silently ignore it.
[^one-of]: In older versions of tidyselect and dplyr, you'll need to use `one_of()`. It has the same semantics as `any_of()`, but a less informative name.
The following app lets the user select any number of variables using a multi-select input, along with `all_of()`:
```{r}
ui <- fluidPage(
selectInput("vars", "Variables", names(mtcars), multiple = TRUE),
tableOutput("data")
)
server <- function(input, output, session) {
output$data <- renderTable({
req(input$vars)
mtcars %>% select(all_of(input$vars))
})
}
```
### Tidy-selection and data-masking
Working with multiple variables is trivial when you're working with a function that uses selection semantics: you can just pass a character vector of variable names in to `any_of()`/`all_of()`. Wouldn't it be nice if we could do that in data-masking functions too?
That's the idea of the `across()` function, which appeared in dplyr 1.0.0. It allows you to access tidy-selection inside of the data-masking functions. It is commonly used with either one or two arguments. The first argument selects variables, so is useful in functions like `group_by()` or `distinct()`. For example, the following app allows you to select any number of variables and count their unique values.
```{r}
ui <- fluidPage(
selectInput("vars", "Variables", names(mtcars), multiple = TRUE),
tableOutput("count")
)
server <- function(input, output, session) {
output$count <- renderTable({
req(input$vars)
mtcars %>%
group_by(across(all_of(input$vars))) %>%
summarise(n = n())
})
}
```
The second argument is applied to each select column in turn. That makes it a good fit for `mutate()` and `summarise()` where you typically want to transform each variable in some way:
```{r}
ui <- fluidPage(
selectInput("vars_g", "Group by", names(mtcars), multiple = TRUE),
selectInput("vars_s", "Summarise", names(mtcars), multiple = TRUE),
tableOutput("data")
)
server <- function(input, output, session) {
output$data <- renderTable({
mtcars %>%
group_by(across(all_of(input$vars_g))) %>%
summarise(across(all_of(input$vars_s), mean), n = n())
})
}
```
If you need your code to work with older version of dplyr, you'll need a slightly different approach. In older versions of dplyr, every data-masking function is paired with a tidy-selection variant that has the suffix `_at`. That approach yields the following code for the two server functions above:
```{r}
server <- function(input, output, session) {
output$count <- renderTable({
req(input$vars)
mtcars %>%
group_by_at(input$vars) %>%
summarise(n = n())
})
}
server <- function(input, output, session) {
output$data <- renderTable({
mtcars %>%
group_by_at(input$vars_g) %>%
summarise_at(input$vars_s, mean)
})
}
```
## `parse()`
Before we go, it's worth talking about `paste()` + `parse()` + `eval()`. If you have no idea what this combination it, you can skip this section, but if you have used it in the past I'd recommend switching to the approaches described up.
It's a tempting approach because it requires learning very few new ideas. But it has some major downsides: because you are pasting strings together, it's very easy to accidentally create invalid code, or code that can be abused to do something that you didn't want. This isn't super important if its a Shiny app that only you use, but it's a good habit to get into --- otherwise it's very easy to accidentally create a security hole in an app that you share more widely.
(You shouldn't feel bad if this is the only way you can figure out to solve a problem, but when you have a bit more mental space, I'd recommend spending some time figuring out how to do it without string manipulation. This will help you to become a better R programmer.)