-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathoutliers.qmd
193 lines (160 loc) · 6.85 KB
/
outliers.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
# Detect and correct outliers in signals
This chapter describes functionality for detecting and correcting outliers in
signals in the `detect_outlr()` and `correct_outlr()` functions provided in the
`epiprocess` package. These functions is designed to be modular and extendable,
so that you can define your own outlier detection and correction routines and
apply them to `epi_df` objects. We'll demonstrate this using state-level daily
reported COVID-19 case counts from FL and NJ.
```{r, include=FALSE}
source("_common.R")
```
```{r}
incidence_num_outlier_example
```
```{r, warning=FALSE, message=FALSE}
#| code-fold: true
ggplot(incidence_num_outlier_example, aes(x = time_value, y = cases, color = geo_value)) +
geom_line() +
scale_color_manual(values = c(3, 6)) +
geom_hline(yintercept = 0, linetype = 3) +
facet_wrap(~ geo_value, scales = "free_y", ncol = 1) +
scale_x_date(minor_breaks = "month", date_labels = "%b %Y") +
labs(x = "Date", y = "Reported COVID-19 counts")
```
There are multiple outliers in these data that a modeler may want to detect and
correct. We'll discuss those two tasks in turn.
## Outlier detection
The `detect_outlr()` function allows us to run multiple outlier detection
methods on a given signal, and then (optionally) combine the results from those
methods. Here, we'll investigate outlier detection results from the following
methods.
1. Detection based on a rolling median, using `detect_outlr_rm()`, which
computes a rolling median on with a default window size of `n` time points
centered at the time point under consideration, and then computes thresholds
based on a multiplier times a rolling IQR computed on the residuals.
2. Detection based on a seasonal-trend decomposition using LOESS (STL), using
`detect_outlr_stl()`, which is similar to the rolling median method but
replaces the rolling median with fitted values from STL.
3. Detection based on an STL decomposition, but without seasonality term, which
amounts to smoothing using LOESS.
The outlier detection methods are specified using a `tibble` that is passed to
`detect_outlr()`, with one row per method, and whose columms specify the
outlier detection function, any input arguments (only nondefault values need to
be supplied), and an abbreviated name for the method used in tracking results.
Abbreviations "rm" and "stl" can be used for the built-in detection functions
`detect_outlr_rm()` and `detect_outlr_stl()`, respectively.
```{r}
detection_methods = bind_rows(
tibble(method = "rm",
args = list(list(detect_negatives = TRUE,
detection_multiplier = 2.5)),
abbr = "rm"),
tibble(method = "stl",
args = list(list(detect_negatives = TRUE,
detection_multiplier = 2.5,
seasonal_period = 7)),
abbr = "stl_seasonal"))
detection_methods
```
Additionally, we'll form combined lower and upper thresholds, calculated as the
median of the lower and upper thresholds from the methods at each time point.
Note that using this combined median threshold is equivalent to using a majority
vote across the base methods to determine whether a value is an outlier.
```{r}
x <- incidence_num_outlier_example %>%
group_by(geo_value) %>%
mutate(
outlier_info = detect_outlr(
x = time_value, y = cases,
methods = detection_methods,
combiner = "median")
) %>%
unpack(outlier_info) %>%
ungroup()
x
```
To visualize the results, we define a convenience function for and call it on
each state separately (hidden below the fold).
```{r}
#| code-fold: true
# Plot outlier detection bands and/or points identified as outliers
plot_outlr <- function(
x, signal, method_abbr, bands = TRUE, points = TRUE,
facet_vars = vars(geo_value), nrow = NULL, ncol = NULL,
scales = "fixed") {
# Convert outlier detection results to long format
signal <- rlang::enquo(signal)
x_long <- x %>%
pivot_longer(
cols = starts_with(method_abbr),
names_to = c("method", ".value"),
names_pattern = "(.+)_(.+)")
# Start of plot with observed data
p <- ggplot() +
geom_line(data = x, mapping = aes(x = time_value, y = !!signal))
# If requested, add bands
if (bands)
p <- p + geom_ribbon(data = x_long,
aes(x = time_value, ymin = lower, ymax = upper,
color = method), fill = NA)
# If requested, add points
if (points) {
x_detected <- x_long %>% filter((!!signal < lower) | (!!signal > upper))
p <- p + geom_point(data = x_detected,
aes(x = time_value, y = !!signal, color = method,
shape = method))
}
# If requested, add faceting
if (!is.null(facet_vars))
p <- p + facet_wrap(facet_vars, nrow = nrow, ncol = ncol, scales = scales)
return(p)
}
```
Now we produce plots for each state at a time, faceting by the detection method.
```{r, fig.width = 8, fig.height = 6}
#| code-fold: true
method_abbr <- c(detection_methods$abbr, "combined")
plot_outlr(x %>% filter(geo_value == "fl"), cases, method_abbr,
facet_vars = vars(method), scales = "free_y", ncol = 2) +
scale_x_date(minor_breaks = "month", date_labels = "%b %Y") +
labs(x = "Date", y = "Reported COVID-19 counts", color = "Method",
shape = "Method") +
scale_color_brewer(palette = "Set1") +
ggtitle("Florida") +
theme(legend.position = "bottom")
plot_outlr(x %>% filter(geo_value == "nj"), cases, method_abbr,
facet_vars = vars(method), scales = "free_y", ncol = 2) +
scale_x_date(minor_breaks = "month", date_labels = "%b %Y") +
labs(x = "Date", y = "Reported COVID-19 counts", color = "Method",
shape = "Method") +
scale_color_brewer(palette = "Set1") +
ggtitle("New Jersey") +
theme(legend.position = "bottom")
```
## Outlier correction
Finally, in order to correct outliers, we can use the posited replacement values
returned by each outlier detection method. Below we use the replacement value
from the combined method, which is defined by the median of replacement values
from the base methods at each time point.
```{r, fig.width = 8, fig.height = 7}
y <- x %>%
mutate(cases_corrected = combined_replacement) %>%
select(geo_value, time_value, cases, cases_corrected)
y %>% filter(cases != cases_corrected)
```
```{r, fig.height=5}
#| code-fold: true
y %>%
pivot_longer(starts_with("cases")) %>%
ggplot(aes(x = time_value)) +
geom_line(aes(y = value, color = name, linetype = name)) +
scale_color_brewer(palette = "Set1") +
scale_linetype_manual(values = c(2, 1)) +
geom_hline(yintercept = 0) +
facet_wrap(vars(geo_value), scales = "free_y", ncol = 1) +
scale_x_date(minor_breaks = "month", date_labels = "%b %Y") +
labs(x = "Date", y = "Reported COVID-19 counts") +
theme(legend.position = "bottom", legend.title = element_blank())
```
More advanced correction functionality will be coming at some point in the
future.