18-missing_values.Rmd

# Missing values

**Learning objectives:**

*  Filling and indicating explicit missing values
*  Making implicit missing values more explicit
*  Displaying empty groups if needed

## Introduction

We encountered missing values in previous chapters.

You first saw them in Chapter 1 where they resulted in a warning when making a plot 

```{r}
#| echo: true
#| warning: true
#| fig-height: 8
#| fig-alt: "A scatterplot of penguin's body mass in grams vs flipper length in mm."

ggplot2::ggplot(
  data = palmerpenguins::penguins,
  mapping = ggplot2::aes(
      x = .data[["flipper_length_mm"]], 
      y = .data[["body_mass_g"]]
      )
) + 
ggplot2::geom_point()
```

```{r}
#| echo: true

palmerpenguins::penguins |> 
  dplyr::filter(
    is.na(flipper_length_mm) | is.na(body_mass_g)
  ) |> 
  reactable::reactable(
    theme = reactablefmtr::dark(),
  )
```

```{r}
#| echo: true

nycflights13::flights |> 
  dplyr::group_by(.data[["month"]]) |> 
  dplyr::summarize(
    avg_delay = mean(.data[["dep_delay"]])
  ) |> 
  reactable::reactable(
    theme = reactablefmtr::dark(),
    defaultPageSize = 5
  )

```

In Section 3.5.2 where they interfered with computing summary statistics

```{r}
#| echo: true

nycflights13::flights |> 
  dplyr::group_by(.data[["month"]]) |> 
  dplyr::summarize(
    avg_delay = mean(.data[["dep_delay"]], 
                     na.rm = FALSE),
    avg_delay_corrected = mean(.data[["dep_delay"]], 
                     na.rm = TRUE)
  ) |> 
  reactable::reactable(
    theme = reactablefmtr::dark(),
    defaultPageSize = 5
  )

```

Their infectious nature and how to check for their presence in Section 12.2.2

```{r}
#| echo: true

NA > 5

10 == NA

NA == NA

is.na(NA)

```

We learn more of the details in this chapter, covering additional tools (besides `is.na` and `na.rm` argument) for working with missing values

*  Explicit missing values
*  Implicit missing values
*  Empty groups

## Explicit missing values

When data is entered by hand, missing values sometimes indicate that the value in the previous row has been repeated (or carried forward). We can fill down in these missing values with [`tidyr::fill()`](https://tidyr.tidyverse.org/reference/fill.html)

```{r}
#| echo: true

treatment <- tibble::tribble(
  ~person,           ~treatment, ~response,
  "Derrick Whitmore", 1,         7,
  NA,                 2,         10,
  NA,                 3,         NA,
  "Katherine Burke",  1,         4
)

print(treatment)
```

```{r}
#| echo: true

treatment |>
  tidyr::fill(
    dplyr::everything(),
    .direction = "down"
)

```

Missing values may need to be represented with some fixed and known value, most commonly 0. You can use [`dplyr::coalesce()`](https://dplyr.tidyverse.org/reference/coalesce.html) to replace them

```{r}
#| echo: true

x <- c(1, 4, 5, 7, NA)
dplyr::coalesce(x, 0)

y <- c(1, 2, NA, NA, 5)
z <- c(NA, NA, 3, 4, 5)
dplyr::coalesce(y, z)

```

If we need to replace na for multiple columns, [`tidyr::replace_na`](https://tidyr.tidyverse.org/reference/replace_na.html) is more useful.

```{r}
#| echo: true

df <- tibble::tibble(x = c(1, 2, NA), y = c("a", NA, "b"))

df

df |> tidyr::replace_na(list(x = 0, y = "unknown"))

```

On the other hand, some concrete value actually represents a missing value. This typically arises in data generated by older software that doesn’t have a proper way to represent missing values, so it must instead use some special value like 99 or -999.

If possible, handle this when reading in the data, for example, by using the `na` argument to [`readr::read_csv()`](https://readr.tidyverse.org/reference/read_delim.html), e.g., `read_csv(path, na = "99")`

If you discover the problem later, or your data source doesn’t provide a way to handle it on read, you can use [`dplyr::na_if()`](https://dplyr.tidyverse.org/reference/na_if.html):

```{r}
#| echo: true

x <- c(1, 4, 5, 7, -99)
dplyr::na_if(x, -99)

```

R has one special type of missing value called `NaN` (pronounced “nan”), or **n**ot **a** **n**umber. NaN occurs when a mathematical operation that has an indeterminate result:

```{r}
#| echo: true

0 / 0

0 * Inf

Inf - Inf

sqrt(-1)
```

`NaN` generally behaves just like `NA`.

```{r}
#| echo: true

x <- c(NA, NaN)

x * 10

x == 1

is.na(x)

```

In the rare case you need to distinguish an `NA` from a `NaN`, you can use `is.nan(x)`.

```{r}

is.nan(x)

```

## Implicit missing values

### Implicit missing values

Consider a simple dataset that records the price of some stock each quarter:

```{r}
#| echo: true

stocks <- tibble::tibble(
  year  = c(2020, 2020, 2020, 2020, 2021, 2021, 2021),
  qtr   = c(   1,    2,    3,    4,    2,    3,    4),
  price = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)

```

This dataset has two missing observations:

-  The price in the fourth quarter of 2020 is explicitly missing, because its value is NA.

-  The price for the first quarter of 2021 is implicitly missing, because it simply does not appear in the dataset.

If there is a need to make implicit missing values explicit, we can pivot the data using [`tidyr::pivot_wider`](https://tidyr.tidyverse.org/reference/pivot_wider.html).

```{r}
#| echo: true

wide_stocks <-  stocks |>
  tidyr::pivot_wider(
    names_from = "qtr", 
    values_from = "price"
  )

wide_stocks
```

By default, making data longer using [`tidyr::pivot_longer`](https://tidyr.tidyverse.org/reference/pivot_longer.html) preserves explicit missing values. We can drop them (make them implicit) by setting `values_drop_na = TRUE`.

```{r}
#| echo: true

wide_stocks |>
  tidyr::pivot_longer(
    cols = -c("year"),
    names_to = "qtr", 
    values_to = "price"
  )

```


```{r}
#| echo: true

wide_stocks |>
  tidyr::pivot_longer(
    cols = -c("year"),
    names_to = "qtr", 
    values_to = "price",
    values_drop_na = TRUE
  )

```

[`tidyr::complete()`](https://tidyr.tidyverse.org/reference/complete.html) turns implicit missing values into explicit missing values based on combination values from its input columns.

```{r}
#| echo: true

stocks |>
  reactable::reactable(
    theme = reactablefmtr::dark(),
    defaultPageSize = 7
  )

```

```{r}
#| echo: true

stocks |>
  tidyr::complete(
    .data[["year"]], 
    .data[["qtr"]]) |> 
  reactable::reactable(
    theme = reactablefmtr::dark(),
    defaultPageSize = 4
  )

```

Sometimes the individual variables are themselves incomplete and they is a need to provide your own data. For example, if we know that the stocks dataset is supposed to run from 2019 to 2021, we could explicitly supply those values for year.

```{r}
#| echo: true

stocks |>
  reactable::reactable(
    theme = reactablefmtr::dark(),
    defaultPageSize = 7
  )

```

```{r}
#| echo: true

stocks |>
  tidyr::complete(
    `year` = 2019:2021, 
    .data[["qtr"]]) |> 
  reactable::reactable(
    theme = reactablefmtr::dark(),
    defaultPageSize = 4
  )

```

Another way to reveal implicitly missing observations is by using [`dplyr::anti_join`](https://dplyr.tidyverse.org/reference/filter-joins.html). Here, four of the destinations do not have any [airport](https://nycflights13.tidyverse.org/reference/airports.html) metadata information.

```{r}
#| echo: true

# Get unique destination and rename to faa

dest_flights <- nycflights13::flights |> 
  dplyr::distinct(faa = .data[["dest"]])

dest_flights |> 
  reactable::reactable(
    theme = reactablefmtr::dark(),
    defaultPageSize = 5
  )

```

```{r}
#| echo: true

dest_flights |> 
  dplyr::anti_join(
    y = nycflights13::airports,
    by = dplyr::join_by("faa")
 ) |> 
  reactable::reactable(
    theme = reactablefmtr::dark(),
    defaultPageSize = 5
  )

```

Here, 722 planes do not have any [planes](https://nycflights13.tidyverse.org/reference/planes.html) metadata information.

```{r}
#| echo: true

# Get unique tail numbers

tailnum_flights <- nycflights13::flights |> 
  dplyr::distinct(.data[["tailnum"]])

tailnum_flights |> 
  reactable::reactable(
    theme = reactablefmtr::dark(),
    defaultPageSize = 5
  )

```

```{r}
#| echo: true

tailnum_flights |> 
  dplyr::anti_join(
    y = nycflights13::planes,
    by = dplyr::join_by("tailnum")
 ) |> 
  reactable::reactable(
    theme = reactablefmtr::dark(),
    defaultPageSize = 5
  )

```

### `dplyr::antijoin` Extra

Use [`dplyr::anti_join`](https://dplyr.tidyverse.org/reference/filter-joins.html) to isolate rows causing [`dplyr::inner_join`](https://dplyr.tidyverse.org/reference/mutate-joins.html) error.

Extra Weight Case:

```{r}
#| echo: true
#| warning: true
#| error: true

three_penguins <- tibble::tribble(
  ~samp_id, ~species,    ~island,
  1,        "Adelie",    "Torgersen",
  2,        "Gentoo",    "Biscoe",
  3,        "Chinstrap", "Dream"
)

weight_extra <- tibble::tribble(
  ~samp_id,  ~body_mass_g,
  0,         1500,
  1,         3220,
  2,         4730,
  3,         4000,
  4,         1000,
  5,         1100
)

three_penguins |> 
  dplyr::inner_join(
    y = weight_extra,
    by = dplyr::join_by("samp_id"),
    unmatched = "error"
 ) 

```

```{r}
#| echo: true
#| warning: true

weight_extra <- tibble::tribble(
  ~samp_id,  ~body_mass_g,
  0,         1500,
  1,         3220,
  2,         4730,
  3,         4000,
  4,         1000,
  5,         1100
)

weight_extra |> 
  dplyr::anti_join(
    y = three_penguins,
    by = dplyr::join_by("samp_id")
 ) 

```

Weight 3 Missing Case:

```{r}
#| echo: true
#| error: true
#| warning: true

three_penguins <- tibble::tribble(
  ~samp_id, ~species,    ~island,
  1,        "Adelie",    "Torgersen",
  2,        "Gentoo",    "Biscoe",
  3,        "Chinstrap", "Dream"
)

weight_no_3 <- tibble::tribble(
  ~samp_id,  ~body_mass_g,
  1,         3220,
  2,         4730
)

three_penguins |> 
  dplyr::inner_join(
    y = weight_no_3,
    by = dplyr::join_by("samp_id"),
    unmatched = "error"
 ) 

```

```{r}
#| echo: true
#| warning: true


three_penguins |> 
  dplyr::anti_join(
    y = weight_no_3,
    by = dplyr::join_by("samp_id")
 ) 


```

Unfortunately cannot resolve multiple matches. Use argument both `relationship = "one-to-one"` and `unmatched = "error"` to ensure one row from x matches with exactly one row of y.  

```{r}
#| echo: true
#| warning: true
#| error: true
#| output-location: column

three_penguins <- tibble::tribble(
  ~samp_id, ~species,    ~island,
  1,        "Adelie",    "Torgersen",
  2,        "Gentoo",    "Biscoe",
  3,        "Chinstrap", "Dream"
)

weight_extra_2 <- tibble::tribble(
  ~samp_id,  ~body_mass_g,
  1,         3220,
  2,         4730,
  2,         4725,
  3,         4000
)

three_penguins |> 
  dplyr::inner_join(
    y = weight_extra_2,
    by = dplyr::join_by("samp_id"),
    relationship = "one-to-one",
    unmatched = "error"
 ) 

```

### Exercises

Can you find any relationship between the carrier and the rows that appear to be missing from `planes` ?

We first get all distinct carriers and tail numbers. We do a left join with the [`nycflights13::airlines`](https://nycflights13.tidyverse.org/reference/airlines.html) so that we know what the carrier abbreviation means.

```{r}
#| echo: true

tailnum_carrier_flights <- nycflights13::flights |> 
  dplyr::distinct(.data[["tailnum"]], .data[["carrier"]]) |> 
  dplyr::arrange(.data[["carrier"]]) |> 
  dplyr::left_join(
    nycflights13::airlines,
    by = dplyr::join_by("carrier")
  )

tailnum_carrier_flights |> 
  reactable::reactable(
    theme = reactablefmtr::dark(),
    filterable = TRUE,
    defaultPageSize = 5
  )

```

We now use [`dplyr::anti_join`](https://dplyr.tidyverse.org/reference/filter-joins.html) with [`nycflights13::planes`](https://nycflights13.tidyverse.org/reference/planes.html) to identify tail numbers that have no plane information. 

We can see that most of them come from either AA (American Airlines Inc.) or MQ (Envoy Air)

```{r}
#| echo: true

missing_tailnum_carrier_flights <- tailnum_carrier_flights |> 
  dplyr::anti_join(
    y = nycflights13::planes,
    by = dplyr::join_by("tailnum")
 ) 

missing_tailnum_carrier_flights[["carrier"]] |> 
  table()

```

```{r}
#| echo: true

missing_tailnum_carrier_flights |> 
  reactable::reactable(
    theme = reactablefmtr::dark(),
    filterable = TRUE,
    defaultPageSize = 5
  )

```

##  Factors and empty groups

###  Factors and empty groups

A final type of missingness is the empty group, a group that doesn’t contain any observations, which can arise when working with factors.

Here is a dataset that contains some health information about people.

```{r}
#| echo: true

health <- tibble::tibble(
  name   = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"),
  smoker = factor(c("no", "no", "no", "no", "no"), 
                  levels = c("yes", "no")),
  age    = c(34, 88, 75, 47, 56),
)

```

We want to count the number of smokers and non-smokers with [`dplyr::count()`](https://dplyr.tidyverse.org/reference/count.html) but it only gives us the amount of smokers because  the group of smokers is empty

```{r}
#| echo: true

health |> dplyr::count(smoker)

```

We can request count() to keep all the groups, even those not seen in the data by using  `.drop = FALSE`:

```{r}
#| echo: true

health |> dplyr::count(smoker,
                       .drop = FALSE)

```

The same principle applies to `ggplot2’s` discrete axes, which will also drop levels that don’t have any values. You can force them to display by supplying `drop = FALSE` to the appropriate discrete axis

```{r}
#| echo: true
#| fig-alt: "A barchart of the number of smoker and non-smokers. The number of smoker is not presented as there are no smoker in the dataset."

ggplot2::ggplot(
  data = health, 
  mapping = ggplot2::aes(
    x = .data[["smoker"]])
  ) +
  ggplot2::geom_bar() +
  ggplot2::scale_x_discrete()

```

```{r}
#| echo: true
#| fig-alt: "A barchart of the number of smoker and non-smokers. Thanks to the argument drop=FALSE in the scale_x_discrete function, the number of smoker is presented even though there are no smoker in the dataset."

ggplot2::ggplot(
  data = health, 
  mapping = ggplot2::aes(
    x = .data[["smoker"]])
  ) +
  ggplot2::geom_bar() +
  ggplot2::scale_x_discrete(drop = FALSE)

```

The same problem comes up more generally with [`dplyr::group_by()`](https://dplyr.tidyverse.org/reference/group_by.html). And again you can use `.drop = FALSE` to preserve all factor levels:

```{r}
#| echo: true

health |> 
  dplyr::group_by(
    .data[["smoker"]]
  ) |> 
  dplyr::summarize(
    n = dplyr::n(),
    mean_age = mean(.data[["age"]]),
    min_age = min(.data[["age"]]),
    max_age = max(.data[["age"]]),
    sd_age = sd(.data[["age"]])
  )

```

```{r}
#| echo: true

health |> 
  dplyr::group_by(
    .data[["smoker"]], 
    .drop = FALSE) |> 
  dplyr::summarize(
    n = dplyr::n(),
    mean_age = mean(.data[["age"]]),
    min_age = min(.data[["age"]]),
    max_age = max(.data[["age"]]),
    sd_age = sd(.data[["age"]])
  )

```

We get some interesting results here because when summarizing an empty group, the summary functions are applied to zero-length vectors

Here we see `mean({zero_vec})` returning `NaN` because 

`mean({zero_vec}) =` `sum({zero_vec})/length({zero_vec})` 

which is 0/0. 

`max()` and `min()` return `-Inf` and `Inf` for empty vectors.

```{r}
#| echo: true

health |> 
  dplyr::group_by(
    .data[["smoker"]], 
    .drop = FALSE) |> 
  dplyr::summarize(
    n = dplyr::n(),
    mean_age = mean(.data[["age"]]),
    min_age = min(.data[["age"]]),
    max_age = max(.data[["age"]]),
    sd_age = sd(.data[["age"]])
  )

```

Instead of `.drop = FALSE`, we can use [`tidyr::complete()`](https://tidyr.tidyverse.org/reference/complete.html)  to the implicit missing values explicit. The main drawback of this approach is that you get an `NA` for the count, even though you know that it should be zero.

```{r}
#| echo: true

health |> 
  dplyr::group_by(
    .data[["smoker"]]
  ) |> 
  dplyr::summarize(
    n = dplyr::n(),
    mean_age = mean(.data[["age"]]),
    min_age = min(.data[["age"]]),
    max_age = max(.data[["age"]]),
    sd_age = sd(.data[["age"]])
  ) |> 
  tidyr::complete(.data[["smoker"]])

```

### `forcats 1.0.0` Extra

Adapted from [`forcats 1.0.0` blog](https://www.tidyverse.org/blog/2023/01/forcats-1-0-0/)

There are two ways to represent a missing value in a factor:

NA as values:

```{r}
#| echo: true

f1 <- factor(c("x", "y", NA, NA, "x"), 
             exclude = NA)

levels(f1)

```

NA as factors:

```{r}
#| echo: true

f2 <- factor(c("x", "y", NA, NA, "x"), 
             exclude = NULL)

levels(f2)

```

They provide different behaviour when `is.na` and `as.integer` are applied

NA as values:

`NA`s in the values tend to be best for data analysis.

```{r}
#| echo: true

f1 <- factor(c("x", "y", NA, NA, "x"), 
             exclude = NA)

is.na(f1)

as.integer(f1)

```

NA as factors:

`NA`s in the levels are useful if you need to control where missing values are shown in a table or a plot

```{r}
#| echo: true

f2 <- factor(c("x", "y", NA, NA, "x"), 
             exclude = NULL)

is.na(f2)

as.integer(f2)

```

To make it easier to switch between these forms, forcats now comes [`fct_na_value_to_level()`](https://forcats.tidyverse.org/reference/fct_na_value_to_level.html) and [`fct_na_level_to_value()`](https://forcats.tidyverse.org/reference/fct_na_value_to_level.html).

In the plot below, we use [`fct_infreq()`](https://forcats.tidyverse.org/reference/fct_inorder.html) to reorder the levels of the factor so that the highest frequency levels are at the top of the bar chart. However, because the `NA`s are stored in the values, [`fct_infreq()`](https://forcats.tidyverse.org/reference/fct_inorder.html) has no ability to affect them, so they appear in their "default" position. 

```{r}
#| echo: true
#| output-location: column
#| fig-alt: "A barchart showing the number of hair colour type in the modified starwars hair colour dataset. We can see that the missing group types are not consolidated together."

example <- data.frame(
  hair_color = c(dplyr::starwars$hair_color, 
                 rep("missing", 10), 
                 rep("don't know", 5))
 ) |> 
  dplyr::mutate(
    hair_color = .data[["hair_color"]] |> 
      # Reorder factor by frequency
      forcats::fct_infreq() |> 
      # Group hair colours with less than 2 observations as Other
      forcats::fct_lump_min(2, other_level = "(Other)") |>
      forcats::fct_rev()
  ) 

example |> 
  ggplot2::ggplot(
    mapping = ggplot2::aes(
      y = .data[["hair_color"]]
    )
  ) + 
  ggplot2::geom_bar() + 
  ggplot2::labs(y = "Hair color")


```

To consolidate all missing values, 

-  Use [`fct_recode`](https://forcats.tidyverse.org/reference/fct_recode.html) to convert "don't know" to the value "missing".

-  Use [`fct_na_level_to_value()`](https://forcats.tidyverse.org/reference/fct_na_value_to_level.html) to convert NA  as a factor called "missing".

-  Use [`fct_na_value_to_level()`](https://forcats.tidyverse.org/reference/fct_na_value_to_level.html) to convert NA to the value "missing".

```{r}
#| echo: true
#| output-location: column
#| code-line-numbers: "|10-15"
#| fig-alt: "A barchart showing the number of hair colour type in the modified starwars hair colour dataset. We can see that the missing group types are consolidated together."

example <- data.frame(
  hair_color = c(dplyr::starwars$hair_color, 
                 rep("missing", 10), 
                 rep("don't know", 5))
 ) |> 
  dplyr::mutate(
    hair_color = .data[["hair_color"]] |> 
      # Reorder factor by frequency
      forcats::fct_infreq() |> 
      forcats::fct_recode(
        missing = "don't know") |> 
      forcats::fct_na_level_to_value(
        extra_levels = "missing") |>
      forcats::fct_na_value_to_level(
        level = "(Missing)") |>
      # Group hair colours with less than 2 observations as Other
      forcats::fct_lump_min(2, other_level = "(Other)") |>
      forcats::fct_rev()
  )

example |> 
  ggplot2::ggplot(
    mapping = ggplot2::aes(
      y = .data[["hair_color"]]
    )
  ) + 
  ggplot2::geom_bar() + 
  ggplot2::labs(y = "Hair color")


```

## Meeting Videos

### Cohort 7

`r knitr::include_url("https://www.youtube.com/embed/H9vdMIyEtfA")`

### Cohort 8

`r knitr::include_url("https://www.youtube.com/embed/oL8fMJqBWVs")`