_211_tidy_data_bonus.Rmd


<!-- This file by Martin Monkman is licensed under a Creative Commons Attribution 4.0 International License https://creativecommons.org/licenses/by/4.0/  

Some material is adapted from the Data Carpentry "R for Social Science" lessons, to which is applied the following license: 
"All Software Carpentry, Data Carpentry, and Library Carpentry instructional material is
made available under the Creative Commons Attribution license." 
The associated link specifies the CC BY 4.0 license.
https://github.com/datacarpentry/r-socialsci/blob/master/LICENSE.md -->


# Tidy data {#tidy-data}


### A longer example

First step: review the structure of the `mpg` data set:

```{r}
mpg
```


Run the chunk below to create the `displ_class_by_cyl` table:

* group the cars by class and cylinder size, and 

* show the mean displacement (engine size)

```{r}
# summary table - class by cylinder
displ_class_by_cyl <- mpg |>
  group_by(class, cyl) |>
  summarise(displ_mean = mean(displ)) |>
  arrange(cyl, class) |>
  pivot_wider(names_from = cyl, values_from = displ_mean) |>
  pivot_longer(-class, names_to = "cyl", values_to = "displ_mean")

displ_class_by_cyl

```

Calculate the mean of `displ_mean`:

```{r}
# example
mean(displ_class_by_cyl$displ_mean)

```

The "NA" values get in the way of the calculation. If `na.rm = TRUE` is added to the `mean()` function, R will calculate the value for us by removing the "NA" values.

```{r}

# solution
mean(displ_class_by_cyl$displ_mean, na.rm = TRUE)

```


An alternative solution: use a filter with `!na` to remove the records with `NA` values:

```{r}
# example
displ_class_by_cyl |>
  summarise(displ_mean_all = mean(displ_mean))

# solution
displ_class_by_cyl |>
  filter(!is.na(displ_mean)) |>
  summarise(displ_mean_all = mean(displ_mean))

```


### Summarize with `group()` and `ungroup()`


You'll notice in the example above that when we summarize `displ_class_by_cyl` it gives the mean values by class, even though we didn't use any grouping variable.

This is because when we ran the code to create the `displ_class_by_cyl` table, we grouped by `class` and `cyl`. Running the `summarise()` function is applied, it removes one level of the grouping (in this case, `cyl`): 

```{r}

# example
displ_class_by_cyl

displ_class_by_cyl |>
  filter(!is.na(displ_mean)) |>
  summarise(displ_mean_all = mean(displ_mean))


```


If you want the mean of _all_ the values, you have to use `ungroup()` before `summarise()`, to "peel off" `class`.


```{r}

# solution
displ_class_by_cyl |>
  filter(!is.na(displ_mean)) |>
  ungroup() |>
  summarise(displ_mean_all = mean(displ_mean))


```