-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
139 lines (96 loc) · 6.58 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
output: github_document
editor_options:
chunk_output_type: console
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/",
out.width = "70%",
fig.align = "center"
)
```
# <img src="man/figures/logo.png" align="right" height="138" /> **{lay}**
<!-- badges: start -->
[](https://CRAN.R-project.org/package=lay)
[](https://github.com/courtiol/lay/actions/workflows/R-CMD-check.yaml)
[](https://github.com/courtiol/lay/actions/workflows/test-coverage.yaml)
[](https://lifecycle.r-lib.org/articles/stages.html#experimental)
<!-- badges: end -->
## An R package for simple but efficient rowwise jobs
The function `lay()` -- the only function of the package **{lay}** -- is intended to be used to apply a function on each row of a data frame or tibble, independently, and across multiple columns containing values of the same class (e.g. all numeric).
Implementing rowwise operations for tabular data is notoriously awkward in R.
Many options have been proposed, but they tend to be complicated, inefficient, or both.
Instead `lay()` aims at reaching a sweet spot between simplicity and efficiency.
The function has been specifically designed to be combined with functions from [**{dplyr}**](https://dplyr.tidyverse.org/) and to feel as if
it was part of it (but you can use `lay()` without [**{dplyr}**](https://dplyr.tidyverse.org/)).
There is hardly any code behind `lay()` (it can be coded in 3 lines), so this package may just be an interim solution before an established package fulfills the need... Time will tell.
### Installation
You can install the current CRAN version of **{lay}** with:
``` r
install.packages("lay")
```
Alternatively, you can install the development version of **{lay}** using [**{remotes}**](https://remotes.r-lib.org/):
``` r
remotes::install_github("courtiol/lay") ## requires to have installed {remotes}
```
### Motivation
Consider the following dataset, which contains information about the use of pain relievers for non medical purpose.
```{r motivation}
library(lay) ## requires to have installed {lay}
drugs
```
The dataset is [tidy](https://vita.had.co.nz/papers/tidy-data.pdf): each row represents one individual and each variable forms a column.
Imagine now that you would like to know if each individual did use any of these pain relievers.
How would you proceed?
### Our solution: `lay()`
This is how you would achieve our goal using `lay()`:
```{r lay}
library(dplyr, warn.conflicts = FALSE) ## requires to have installed {dplyr}
drugs_full |>
mutate(everused = lay(pick(-caseid), any))
```
We used `mutate()` from [**{dplyr}**](https://dplyr.tidyverse.org/) to create a new column called *everused*, and we used `pick()` from that same package to remove the column *caseid* when laying down each row of the data and applying the function `any()`.
When combining `lay()` and [**{dplyr}**](https://dplyr.tidyverse.org/), you should always use `pick()` or `across()`. The functions `pick()` and `across()` let you pick among many [selection helpers](https://tidyselect.r-lib.org/reference/language.html) from the package [**{tidyselect}**](https://tidyselect.r-lib.org/), which makes it easy to specify which columns to consider.
Our function `lay()` is quite flexible! For example, you can pass argument(s) of the function you wish to apply rowwise (here `any()`):
```{r NA}
drugs_with_NA <- drugs ## create a copy of the dataset
drugs_with_NA[1, 2] <- NA ## introduce a missing value
drugs_with_NA |>
mutate(everused = lay(pick(-caseid), any)) |> ## without additional argument
slice(1) ## keep first row only
drugs_with_NA |>
mutate(everused = lay(pick(-caseid), any, na.rm = TRUE)) |> ## with additional argument
slice(1)
```
Since one of the backbones of `lay()` is [**{rlang}**](https://rlang.r-lib.org), you can use the so-called [*lambda* syntax](https://rlang.r-lib.org/reference/as_function.html) to define anonymous functions on the fly:
```{r lambda}
drugs_with_NA |>
mutate(everused = lay(pick(-caseid), ~ any(.x, na.rm = TRUE))) ## same as above, different syntax
```
We can also apply many functions at once, as exemplified with another dataset:
```{r worldbank}
data("world_bank_pop", package = "tidyr") ## requires to have installed {tidyr}
world_bank_pop |>
filter(indicator == "SP.POP.TOTL") |>
mutate(lay(pick(matches("\\d")),
~ tibble(min = min(.x), mean = mean(.x), max = max(.x))), .after = indicator)
```
Since the other backbone of `lay()` is [**{vctrs}**](https://vctrs.r-lib.org), the splicing happens automatically (unless the output of the call is used to create a named column). This is why, in the last chunk of code, three different columns (*min*, *mean* and *max*) are directly created.
**Important:** when using `lay()` the function you want to use for the rowwise job must output a scalar (vector of length 1), or a tibble or data frame with a single row.
We can apply a function that returns a vector of length > 1 by turning such a vector into a tibble using `as_tibble_row()` from [**{tibble}**](https://tibble.tidyverse.org/):
```{r worldbank2}
world_bank_pop |>
filter(indicator == "SP.POP.TOTL") |>
mutate(lay(pick(matches("\\d")),
~ as_tibble_row(quantile(.x, na.rm = TRUE))), .after = indicator)
```
### History
<img src="https://github.com/courtiol/lay/raw/main/.github/pics/lay_history.png" alt="lay_history" align="right" width="400">
The first draft of this package has been created by **@romainfrancois** as a reply to a tweet I (Alexandre Courtiol) posted under **@rdataberlin** in February 2020.
At the time I was exploring different ways to perform rowwise jobs in R and I was experimenting with various ideas on how to exploit the fact that the newly introduced function `across()` from [**{dplyr}**](https://dplyr.tidyverse.org/) creates tibbles on which one can easily apply a function.
Romain came up with `lay()` as the better solution, making good use of [**{rlang}**](https://rlang.r-lib.org/) & [**{vctrs}**](https://vctrs.r-lib.org/).
The verb `lay()` never made it to be integrated within [**{dplyr}**](https://dplyr.tidyverse.org/), but, so far, I still find `lay()` superior than most alternatives, which is why I decided to document and maintain this package.