outliers.qmd

# Detect and correct outliers in signals

This chapter describes functionality for detecting and correcting outliers in
signals in the `detect_outlr()` and `correct_outlr()` functions provided in the
`epiprocess` package. These functions is designed to be modular and extendable,
so that you can define your own outlier detection and correction routines and
apply them to `epi_df` objects. We'll demonstrate this using state-level daily
reported COVID-19 case counts from FL and NJ.

```{r, include=FALSE}
source("_common.R")
```

```{r}
incidence_num_outlier_example
```

```{r, warning=FALSE, message=FALSE}
#| code-fold: true
ggplot(incidence_num_outlier_example, aes(x = time_value, y = cases, color = geo_value)) +
  geom_line() +
  scale_color_manual(values = c(3, 6)) +
  geom_hline(yintercept = 0, linetype = 3) +
  facet_wrap(~ geo_value, scales = "free_y", ncol = 1) +
  scale_x_date(minor_breaks = "month", date_labels = "%b %Y") +
  labs(x = "Date", y = "Reported COVID-19 counts")
```

There are multiple outliers in these data that a modeler may want to detect and
correct. We'll discuss those two tasks in turn.

## Outlier detection

The `detect_outlr()` function allows us to run multiple outlier detection
methods on a given signal, and then (optionally) combine the results from those
methods. Here, we'll investigate outlier detection results from the following
methods.

1. Detection based on a rolling median, using `detect_outlr_rm()`, which
  computes a rolling median on with a default window size of `n` time points
  centered at the time point under consideration, and then computes thresholds
  based on a multiplier times a rolling IQR computed on the residuals.
2. Detection based on a seasonal-trend decomposition using LOESS (STL), using
  `detect_outlr_stl()`, which is similar to the rolling median method but
  replaces the rolling median with fitted values from STL.
3. Detection based on an STL decomposition, but without seasonality term, which
  amounts to smoothing using LOESS.

The outlier detection methods are specified using a `tibble` that is passed to
`detect_outlr()`, with one row per method, and whose columms specify the
outlier detection function, any input arguments (only nondefault values need to
be supplied), and an abbreviated name for the method used in tracking results.
Abbreviations "rm" and "stl" can be used for the built-in detection functions
`detect_outlr_rm()` and `detect_outlr_stl()`, respectively.

```{r}
detection_methods = bind_rows(
  tibble(method = "rm",
         args = list(list(detect_negatives = TRUE,
                          detection_multiplier = 2.5)),
         abbr = "rm"),
  tibble(method = "stl",
         args = list(list(detect_negatives = TRUE,
                          detection_multiplier = 2.5,
                          seasonal_period = 7)),
         abbr = "stl_seasonal"))

detection_methods
```

Additionally, we'll form combined lower and upper thresholds, calculated as the
median of the lower and upper thresholds from the methods at each time point.
Note that using this combined median threshold is equivalent to using a majority
vote across the base methods to determine whether a value is an outlier.

```{r}
x <- incidence_num_outlier_example %>%
  group_by(geo_value) %>%
  mutate(
    outlier_info  = detect_outlr(
      x = time_value, y = cases,
      methods = detection_methods,
      combiner = "median")
  ) %>%
  unpack(outlier_info) %>%
  ungroup()

x
```

To visualize the results, we define a convenience function for and call it on
each state separately (hidden below the fold).

```{r}
#| code-fold: true
# Plot outlier detection bands and/or points identified as outliers
plot_outlr <- function(
    x, signal, method_abbr, bands = TRUE, points = TRUE,
    facet_vars = vars(geo_value), nrow = NULL, ncol = NULL,
    scales = "fixed") {

  # Convert outlier detection results to long format
  signal <- rlang::enquo(signal)
  x_long <- x %>%
    pivot_longer(
      cols = starts_with(method_abbr),
      names_to = c("method", ".value"),
      names_pattern = "(.+)_(.+)")

  # Start of plot with observed data
  p <- ggplot() +
    geom_line(data = x, mapping = aes(x = time_value, y = !!signal))

  # If requested, add bands
  if (bands)
    p <- p + geom_ribbon(data = x_long,
                         aes(x = time_value, ymin = lower, ymax = upper,
                             color = method), fill = NA)

  # If requested, add points
  if (points) {
    x_detected <- x_long %>% filter((!!signal < lower) | (!!signal > upper))
    p <- p + geom_point(data = x_detected,
                        aes(x = time_value, y = !!signal, color = method,
                            shape = method))
  }

  # If requested, add faceting
  if (!is.null(facet_vars))
    p <- p + facet_wrap(facet_vars, nrow = nrow, ncol = ncol, scales = scales)

  return(p)
}
```

Now we produce plots for each state at a time, faceting by the detection method.

```{r, fig.width = 8, fig.height = 6}
#| code-fold: true
method_abbr <- c(detection_methods$abbr, "combined")

plot_outlr(x %>% filter(geo_value == "fl"), cases, method_abbr,
           facet_vars = vars(method), scales = "free_y", ncol = 2) +
  scale_x_date(minor_breaks = "month", date_labels = "%b %Y") +
  labs(x = "Date", y = "Reported COVID-19 counts", color  = "Method",
       shape = "Method") +
  scale_color_brewer(palette = "Set1") +
  ggtitle("Florida") +
  theme(legend.position = "bottom")

plot_outlr(x %>% filter(geo_value == "nj"), cases, method_abbr,
           facet_vars = vars(method), scales = "free_y", ncol = 2) +
  scale_x_date(minor_breaks = "month", date_labels = "%b %Y") +
  labs(x = "Date", y = "Reported COVID-19 counts", color  = "Method",
       shape = "Method") +
  scale_color_brewer(palette = "Set1") +
  ggtitle("New Jersey") +
  theme(legend.position = "bottom")
```

## Outlier correction

Finally, in order to correct outliers, we can use the posited replacement values
returned by each outlier detection method. Below we use the replacement value
from the combined method, which is defined by the median of replacement values
from the base methods at each time point.

```{r, fig.width = 8, fig.height = 7}
y <- x %>%
  mutate(cases_corrected = combined_replacement) %>%
  select(geo_value, time_value, cases, cases_corrected)

y %>% filter(cases != cases_corrected)
```

```{r, fig.height=5}
#| code-fold: true
y %>%
  pivot_longer(starts_with("cases")) %>%
  ggplot(aes(x = time_value)) +
  geom_line(aes(y = value, color = name, linetype = name)) +
  scale_color_brewer(palette = "Set1") +
  scale_linetype_manual(values = c(2, 1)) +
  geom_hline(yintercept = 0) +
  facet_wrap(vars(geo_value), scales = "free_y", ncol = 1) +
  scale_x_date(minor_breaks = "month", date_labels = "%b %Y") +
  labs(x = "Date", y = "Reported COVID-19 counts") +
  theme(legend.position = "bottom", legend.title = element_blank())
```

More advanced correction functionality will be coming at some point in the
future.