610_data_wrangling_2.Rmd

<!-- 
This file by Martin Monkman 
is licensed under a Creative Commons Attribution 4.0 International License
https://creativecommons.org/licenses/by/4.0/  
-->


# Data wrangling—continued {#wrangling2}

## Setup

This chunk of R code loads the packages that we will be using.

```{r setup_610, eval=FALSE}

library(tidyverse)
library(gapminder)
```


## Introduction  {#wrangling2-intro}

This chapter revisits the process of data wrangling with some additional techniques.


### `ungroup()` 


When we use a `group_by()` and `summarize()` pairing, we can calculate summary statistics for each group in our data.

In this example, we calculate the difference in a country's life expectancy from the continent's mean life expectancy
 - for example, the difference in 2007 between life expectancy in Canada and the mean life expectancy of countries in the Americas

What we had seen before was the use of `group_by() |> summarize()` to create summary statistics by a group:

```{r}
gapminder |>
  filter(year == 2007) |> 
  group_by(continent) |>
  summarize(lifeExp_mean = mean(lifeExp))
```

But how can we compare the Canada value to the Americas mean shown here? One strategy would be to join this table to the original, using "continent" as the key value.

But there is another solution: `group_by() |> mutate()`

Step 1:

* Filter for 2007

* Then group by continent and 

* _Mutate_ to get the mean continental life expectancy

**Note:** the tibble that results from this retains the "Groups:". It is important to be aware of this, since any future manipulations of this table will be based on that grouping.  

```{r}
gm_life_2007 <- 
gapminder |>
  # select and filter
  select(country, continent, year, lifeExp) |> 
  filter(year == 2007) |> 
  # group_by |> mutate
  group_by(continent) |>
  mutate(lifeExp_con_mean = mean(lifeExp)) 

gm_life_2007
```

To that data frame, we can append another mutate to calculate the difference between the individual country and the continent.

Note that addition of the the `ungroup()` function, which removes the grouping.

In the resulting table, we see the observations for 2007, with all the countries in the Americas. Note that they all have the same value of "lifeExp_con_mean", which is subtracted from "lifeExp" to get the "lifeExp_diff" variable. Countries which have a life expectancy above the continental mean show positive values, while those where the life expectancy is below, have negative values.

```{r}
gm_life_2007 |>
  ungroup() |> 
  # subtract continent mean from individual country
  mutate(lifeExp_diff = lifeExp - lifeExp_con_mean) |> 
  # select and filter for just the Americas
  filter(continent == "Americas")

```

In the version below, it's a single pipe that uses both year and continent as the grouping variables. This gets the same result with the added comparison of the individual country to the world average for that year, for every year in the data:

```{r}
gm_life <- gapminder |>
  # select
  select(country, continent, year, lifeExp) |> 
  # continent life expectancy
  group_by(year, continent) |>
  mutate(lifeExp_con_mean = mean(lifeExp)) |>
  ungroup() |> 
  mutate(lifeExp_con_diff = lifeExp - lifeExp_con_mean) |> 
  # world life expectancy
  group_by(year) |>
  mutate(lifeExp_earth_mean = mean(lifeExp)) |>
  ungroup() |> 
  mutate(lifeExp_earth_diff = lifeExp - lifeExp_earth_mean)

gm_life |> 
  filter(year == 2007 & continent == "Americas")

```

Now we can create a plot to compare Canada to the averages of the continent and the world over time:

```{r}
gm_life |> 
  filter(country == "Canada") |> 
ggplot() +
  geom_line(aes(x = year, y = lifeExp_con_diff), colour = "red") +
  geom_line(aes(x = year, y = lifeExp_earth_diff), colour = "blue")

```

This isn't a story about Canada's life expectancy getting worse—life expectancy in Canada improved from 68.8 in 1952 to 80.7 in 2007.

What the chart shows are the substantial improvements in the life expectancy for people elsewhere around the world (the blue line) and in the Americas (the red line). While life expectancy in Canada improved and remains well above average, for many countries of the world, life expectancy is much, much longer than it was 70 years ago.


#### Moneyball data

A similar approach was used to create the "pay_index" variable in the Moneyball assignment.

First, we read the data in the file:

```{r}
mlb_pay_wl <- read_csv("data/mlb_pay_wl_original.csv")
mlb_pay_wl
```

You will see that the file contains the team payroll, wins and losses, but not the two calculated variables that we used in the linear regression model.

To replicate the file we used in the assignment, we need to add 

1. a win-loss percentage


```{r}
# win-loss percentage is the number of wins, divided by the total number of games played
mlb_pay_wl <- mlb_pay_wl |> 
  mutate(wl_pct = round(w / (w + l), 3))

mlb_pay_wl
```

2. the pay index, comparing the individual team payroll to the average of all the teams that season, where 100 is the league average.

Note that this requires a group/ungroup within the pipe.

```{r}
mlb_pay_wl <- mlb_pay_wl |> 
  group_by(year_num) |> 
  mutate(league_avg_pay = mean(est_payroll)) |> 
  ungroup() |> 
  mutate(pay_pct_league = est_payroll / league_avg_pay * 100)

mlb_pay_wl

```

Let's create a histogram to show the distribution of team payroll, relative to the league average:

```{r}

mlb_pay_wl |> 
  ggplot(aes(x = pay_pct_league)) +
    geom_histogram()

```


### `across()` 

The `across()` function: 

>makes it easy to apply the same transformation to multiple columns, allowing you to use select() semantics inside in "data-masking" functions like summarise() and mutate(). See 

These examples are variations of the ones from the [{dplyr} reference "Apply a function (or functions) across multiple columns"](https://dplyr.tidyverse.org/reference/across.html). That resource, and the related vignette  ["colwise"](https://dplyr.tidyverse.org/articles/colwise.html) have more examples and details.

For our examples, we will use the `penguins` data table from the [{palmerpenguins} package](https://allisonhorst.github.io/palmerpenguins/).


```{r}
library(palmerpenguins)

penguins
```

In this example, the rounding function is applied to the two bill measurements: 

```{r}
penguins |>
  mutate(across(c(bill_length_mm, bill_depth_mm), round))
```

Since those measures are of type "double" and the other two numeric measures are integer, we could identify the variables we want to round using the type, via the `is.double` argument:

```{r}
penguins |>
  mutate(across(where(is.double), round))
```

It's also possible to use the `across()` function to apply the same function(s) to multiple variables. Here we will use the `group-by() |> summarize()` functions to calculate the mean of the bill measurements. 

* Note also the `starts_with()` function that is used to identify the columns of interest.

* The `mean()` function has a tilde in front. In R, the "~" is used to indicate a function, so 

`~mean(.x, na.rm = TRUE)`

is a shortcut for

`function(.x){mean(x, na.rm = TRUE)}` where ".x" is a placeholder for every variable we have defined in our `across()` function.

(It might help to think of this like a looping function—for every variable, the `summarise()` loops through the `mean()` function.)

```{r}

penguins |> 
  group_by(species) |> 
  summarise(across(starts_with("bill_"), 
                   ~mean(.x, na.rm = TRUE)))

```


In this example, the `list` argument defines a list of functions to be applied. By putting the name of the function ahead of the function (e.g. `mean = mean`), that term gets appended to the name of the variables created by the `summarise()` function. (You may want to see what happens when you change this to `mean, sd` or `cat = mean, dog = sd`!)

```{r}

penguins |> 
  group_by(species) |> 
  summarise(across(starts_with("bill_"), 
                   list(mean = mean, sd = sd), 
                   na.rm = TRUE))

```


### Calculations across rows 

Another common data manipulation outcome we seek is the average _across_ our variables...so far everything we've done has been _down_ the variables. R is really good at the latter. Here's some techniques to apply a function across a row.

(This is drawn from the {dplyr} vignette ["Row-wise operations"](https://dplyr.tidyverse.org/articles/rowwise.html))

First we create a little tibble:

```{r}
df <- tibble(x = 1:2, y = 3:4, z = 5:6)
df
```
Then we apply the `rowwise()` function:

```{r}
df |> rowwise()
```

It looks just the same, but you will note the "Rowwise:" indicator above the table. This means that any functions that are applied will run across the rows instead of the usual columnwise direction.


```{r}
df |> 
  rowwise() |> 
  mutate(m = mean(c(x, y, z)))
```

Without the `rowwise()` function we get 3.5 for both rows. This the mean of "x" and "y" and "z" (that is, the mean of the integers 1 through 6):

```{r}
df |> 
  mutate(m = mean(c(x, y, z)))
```


In the examples below, we will use a bigger tibble:

```{r}

df2 <- tibble(id = letters[1:6], w = 10:15, x = 20:25, y = 30:35, z = 40:45)
df2
```

To calculate the sum of the columns, we could `mutate()` to get a new column, or `summarize()` for just the total.

By putting the "id" in the `rowwise()` function, it acts as a grouping variable _across_ the rows.

```{r}
# mutate for full table
df2 |> 
  rowwise(id) |> 
  mutate(total = sum(c(w, x, y, z))) 
```

```{r}
# summarize for the total only
df2 |> 
  rowwise(id) |> 
  summarize(total = sum(c(w, x, y, z))) 

```

To streamline the specification of the variables we use, the `c_across()` function can be applied, indicating the range of the variables we want to sum and average:

```{r}

df2 |> 
  rowwise(id) |> 
  summarize(total = sum(c_across(w:z)),
            average = mean(c_across(w:z))) 


```

There are two other functions that streamline this syntax still further: `rowSums` and `rowMeans`.

**Note:** this function has the `rowwise()` built in!

```{r}
df2 |> 
  mutate(total = rowSums(across(where(is.numeric))))
```


### Comparing values between rows

It is common in time series analysis to compare values that are in different rows. For example, in a data table with quarterly revenue data, you might want to report on the change from one quarter to the next. Or with monthly data where there is a strong seasonal trend comparing one month to the next might not be all that useful (think of the difference in retail sales in January compared to December), so a comparison to the same month of the previous year would reveal more insight. The [New Housing Price Index example](#walkthrough) we saw earlier incorporates both month-over-month and year-over-year comparisions.

The {dplyr} package has two functions, `lag()` and `lead()`, that allow us access to the value in a different row.


#### Exercies: 

Using the {gapminder} data, find the country with biggest jump in life expectancy (between any two consecutive records) for each continent.

```{r}
# One of many solutions
gapminder |>
  group_by(country) |>
  mutate(prev_lifeExp = lag(lifeExp),
         jump = lifeExp - prev_lifeExp) |>
  arrange(desc(jump), continent)

# Another solution
gapminder |>
  group_by(country) |>
  mutate(prev_lifeExp = lag(lifeExp),
         jump = lifeExp - prev_lifeExp) |>
  ungroup() |> 
  group_by(continent) |> 
  slice(which.max(jump)) |> 
  arrange(desc(jump))

# Another solution
gapminder |>
  group_by(country) |>
  mutate(prev_lifeExp = lag(lifeExp),
         jump = lifeExp - prev_lifeExp) |>
  ungroup() |> 
  group_by(continent) |> 
  slice_max(jump) |> 
  arrange(desc(jump))

```


## Recode variables with `case_when()`

We sometimes find ourselves in a situation where we want to recode our existing variables into other categories. For example, sometimes you have a handful of categories that make up the bulk of the total, and summarizing the smaller categories into an "all other" makes the table or plot easier to read. Think of the populations of the 13 Canadian provinces and territories: the four most populous provinces (Ontario, Quebec, British Columbia, and Alberta) account for roughly 85% of the country's population...the other 15% are spread across nine provinces and territories. We might want to show a table with only five rows, with the smallest nine provinces and territories grouped into a single row.

We can do this with a function in {dplyr}, `case_when()`.

```{r}
canpop <- read_csv("data/canpop.csv")
canpop
```

In this first solution, we name the largest provinces separately:

```{r}

canpop |> 
  mutate(pt_grp = case_when(
    # evaluation ~ new value
    province_territory == "Ontario" ~ "Ontario",
    province_territory == "Quebec" ~ "Quebec",
    province_territory == "British Columbia" ~ "British Columbia",
    province_territory == "Alberta" ~ "Alberta",
    # all others get recoded as "other"
    TRUE ~ "other"
  ))

```

But that's a lot of typing. A more streamlined approach is to put the name of our provinces in a list, and then recode our new variable with the value from the original variable:

```{r}

canpop |> 
  mutate(pt_grp = case_when(
    province_territory %in% c("Ontario", "Quebec", "British Columbia", "Alberta") ~ province_territory,
    TRUE ~ "other"
  ))

```

But we can do even better, by using a comparison to create a population threshold:

```{r}

canpop <- canpop |> 
  mutate(pt_grp = case_when(
    population > 4000000 ~ province_territory,
    TRUE ~ "other"
  ))

canpop

```

Now we can use that new variable to group a table:

```{r}

canpop |> 
  group_by(pt_grp) |> 
  summarise(pt_pop = sum(population)) |> 
  arrange(desc(pt_pop))

```


In another example, we might want to group a series of dates by decade.

```{r}
date_list <- tribble(
  ~ref_date,
  "2000-02-29",
  "2004-02-29", 
  "2011-01-01",
  "2014-03-31",
  "2019-09-01",
  "2023-03-28"
) |> 
  mutate(ref_date = ymd(ref_date))

date_list
```


To create the decade groupings, the `case_when()` follows a `mutate()` function. Note that we are using `lubridate::year()` to extract the year value for our comparison.

```{r}

date_list <- date_list |> 
  mutate(ref_decade = case_when(
    year(ref_date) < 2010 ~ "aughts",
    year(ref_date) < 2020 ~ "teens",
    TRUE ~ "other"
  ))

date_list

```


Another option for this problem would be to round the year value down to the nearest 10. This uses the `round_any()` function from the {plyr} package. (The `round()` function from base R rounds to the nearest 10, so a year like 2019 gets rounded up to the 2020s. The `floor()` function, related to `round()`, rounds down, but only to the nearest integer.)

```{r}

# to round year value down to first year of the decade
date_list |> 
  mutate(ref_decade = plyr::round_any(year(ref_date), accuracy = 10, f = floor))

```

### Multiple comparisons

In this example from the [`case_when()` reference page:](https://dplyr.tidyverse.org/reference/case_when.html), it uses different combinations of three variables to create a single fourth variable. It uses the built-in data set "starwars", a list of characters from the movies.

```{r}
# case_when is particularly useful inside mutate when you want to
# create a new variable that relies on a complex combination of existing
# variables
starwars

```

For this example, we mutate a new variable based on whether a character is a robot, or whether they are large (defined by height or mass).

```{r}

starwars |>
  select(name:mass, gender, species) |>
  mutate(
    type = case_when(
      species == "Droid"        ~ "robot",
      height > 150 | mass > 200 ~ "large",
      TRUE                      ~ "other"
    )
  )

```


And order matters! The function `case_when()` finds all of the cases that meet the first criteria, and then applies the second criteria to the remaining observations. If we switch the order and move "robot" below "large", we find (as one example) that C-3PO is now in the large category, as they are over 150 cm in height.

```{r}
# create "type" variable
starwars |>
  select(name:mass, species) |>
  mutate(
    type = case_when(
      # large first, then robot
      height > 150 | mass > 200 ~ "large",
      species == "Droid"        ~ "robot",
      TRUE                      ~ "other"
    )
  )

```


-30-