-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy path_211_tidy_data_bonus.Rmd
118 lines (63 loc) · 2.58 KB
/
_211_tidy_data_bonus.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
<!-- This file by Martin Monkman is licensed under a Creative Commons Attribution 4.0 International License https://creativecommons.org/licenses/by/4.0/
Some material is adapted from the Data Carpentry "R for Social Science" lessons, to which is applied the following license:
"All Software Carpentry, Data Carpentry, and Library Carpentry instructional material is
made available under the Creative Commons Attribution license."
The associated link specifies the CC BY 4.0 license.
https://github.com/datacarpentry/r-socialsci/blob/master/LICENSE.md -->
# Tidy data {#tidy-data}
### A longer example
First step: review the structure of the `mpg` data set:
```{r}
mpg
```
Run the chunk below to create the `displ_class_by_cyl` table:
* group the cars by class and cylinder size, and
* show the mean displacement (engine size)
```{r}
# summary table - class by cylinder
displ_class_by_cyl <- mpg |>
group_by(class, cyl) |>
summarise(displ_mean = mean(displ)) |>
arrange(cyl, class) |>
pivot_wider(names_from = cyl, values_from = displ_mean) |>
pivot_longer(-class, names_to = "cyl", values_to = "displ_mean")
displ_class_by_cyl
```
Calculate the mean of `displ_mean`:
```{r}
# example
mean(displ_class_by_cyl$displ_mean)
```
The "NA" values get in the way of the calculation. If `na.rm = TRUE` is added to the `mean()` function, R will calculate the value for us by removing the "NA" values.
```{r}
# solution
mean(displ_class_by_cyl$displ_mean, na.rm = TRUE)
```
An alternative solution: use a filter with `!na` to remove the records with `NA` values:
```{r}
# example
displ_class_by_cyl |>
summarise(displ_mean_all = mean(displ_mean))
# solution
displ_class_by_cyl |>
filter(!is.na(displ_mean)) |>
summarise(displ_mean_all = mean(displ_mean))
```
### Summarize with `group()` and `ungroup()`
You'll notice in the example above that when we summarize `displ_class_by_cyl` it gives the mean values by class, even though we didn't use any grouping variable.
This is because when we ran the code to create the `displ_class_by_cyl` table, we grouped by `class` and `cyl`. Running the `summarise()` function is applied, it removes one level of the grouping (in this case, `cyl`):
```{r}
# example
displ_class_by_cyl
displ_class_by_cyl |>
filter(!is.na(displ_mean)) |>
summarise(displ_mean_all = mean(displ_mean))
```
If you want the mean of _all_ the values, you have to use `ungroup()` before `summarise()`, to "peel off" `class`.
```{r}
# solution
displ_class_by_cyl |>
filter(!is.na(displ_mean)) |>
ungroup() |>
summarise(displ_mean_all = mean(displ_mean))
```