Skip to content

Commit

Permalink
feat: add vignette snapshots
Browse files Browse the repository at this point in the history
  • Loading branch information
dshemetov committed Mar 21, 2024
1 parent e74b7e7 commit 6f1891d
Show file tree
Hide file tree
Showing 13 changed files with 4,996 additions and 50 deletions.
824 changes: 824 additions & 0 deletions tests/testthat/_snaps/vignette-snapshot/advanced.html

Large diffs are not rendered by default.

825 changes: 825 additions & 0 deletions tests/testthat/_snaps/vignette-snapshot/advanced.new.html

Large diffs are not rendered by default.

613 changes: 613 additions & 0 deletions tests/testthat/_snaps/vignette-snapshot/aggregation.html

Large diffs are not rendered by default.

699 changes: 699 additions & 0 deletions tests/testthat/_snaps/vignette-snapshot/archive.html

Large diffs are not rendered by default.

696 changes: 696 additions & 0 deletions tests/testthat/_snaps/vignette-snapshot/archive.new.html

Large diffs are not rendered by default.

672 changes: 672 additions & 0 deletions tests/testthat/_snaps/vignette-snapshot/epiprocess.html

Large diffs are not rendered by default.

614 changes: 614 additions & 0 deletions tests/testthat/_snaps/vignette-snapshot/slide.html

Large diffs are not rendered by default.

18 changes: 18 additions & 0 deletions tests/testthat/test-vignette-snapshot.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Vignettes that use epi_archives or epi_dfs.
vignettes <- paste0(here::here("vignettes/"), c(
"advanced.Rmd",
"aggregation.Rmd",
"archive.Rmd",
"epiprocess.Rmd",
"slide.Rmd"
))
for (input_file in vignettes) {
test_that(paste0("snapshot vignette ", basename(input_file)), {
# skip("Skipping snapshot tests by default, as they are slow.")

Check warning on line 11 in tests/testthat/test-vignette-snapshot.R

View workflow job for this annotation

GitHub Actions / lint

file=tests/testthat/test-vignette-snapshot.R,line=11,col=7,[commented_code_linter] Commented code should be removed.
output_file <- sub("\\.Rmd$", ".html", input_file)
withr::with_file(output_file, {
devtools::build_rmd(input_file)
expect_snapshot_file(output_file)
})
})
}
18 changes: 7 additions & 11 deletions vignettes/advanced.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,13 @@ vignette: >

In this vignette, we discuss how to use the sliding functionality in the
`epiprocess` package with less common grouping schemes or with computations that
have advanced output structures.
The output of a slide computation should either be an atomic value/vector, or a
data frame. This data frame can have multiple columns, multiple rows, or both.
have advanced output structures. The output of a slide computation should either
be an atomic value/vector, or a data frame. This data frame can have multiple
columns, multiple rows, or both.

During basic usage (e.g., when all optional arguments are set to their defaults):

* `epi_slide(edf, <computation>, .....)`:

* keeps **all** columns of `edf`, adds computed column(s)
* outputs **one row per row in `edf`** (recycling outputs from
computations appropriately if there are multiple time series bundled
Expand All @@ -26,9 +25,7 @@ During basic usage (e.g., when all optional arguments are set to their defaults)
`dplyr::arrange(time_value, .by_group = TRUE)`**
* outputs an **`epi_df`** if the required columns are present, otherwise a
tibble

* `epix_slide(ea, <computation>, .....)`:

* keeps **grouping and `time_value`** columns of `ea`, adds computed
column(s)
* outputs **any number of rows** (computations are allowed to output any
Expand All @@ -40,6 +37,7 @@ During basic usage (e.g., when all optional arguments are set to their defaults)
* outputs a **tibble**

These differences in basic behavior make some common slide operations require less boilerplate:

* predictors and targets calculated with `epi_slide` are automatically lined up
with each other and with the signals from which they were calculated; and
* computations for an `epix_slide` can output data frames with any number of
Expand Down Expand Up @@ -84,13 +82,14 @@ simple synthetic example.
```{r message = FALSE}
library(epiprocess)
library(dplyr)
set.seed(123)
edf <- tibble(
geo_value = rep(c("ca", "fl", "pa"), each = 3),
time_value = rep(seq(as.Date("2020-06-01"), as.Date("2020-06-03"), by = "day"), length.out = length(geo_value)),
x = seq_along(geo_value) + 0.01 * rnorm(length(geo_value)),
) %>%
as_epi_df()
as_epi_df(as_of = as.Date("2024-03-20"))
# 2-day trailing average, per geo value
edf %>%
Expand Down Expand Up @@ -338,7 +337,7 @@ library(data.table)
library(ggplot2)
theme_set(theme_bw())
x <- archive_cases_dv_subset_2$DT %>%
x <- archive_cases_dv_subset$DT %>%
filter(geo_value %in% c("ca", "fl")) %>%
as_epi_archive(compactify = FALSE)
```
Expand Down Expand Up @@ -525,10 +524,7 @@ separate ARX model on each state. As in the archive vignette, we can see a
difference between version-aware (right column) and -unaware (left column)
forecasting, as well.


## Attribution
The `case_rate_7d_av` data used in this document is a modified part of the [COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University](https://github.com/CSSEGISandData/COVID-19) as [republished in the COVIDcast Epidata API](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html). This data set is licensed under the terms of the [Creative Commons Attribution 4.0 International license](https://creativecommons.org/licenses/by/4.0/) by the Johns Hopkins University on behalf of its Center for Systems Science in Engineering. Copyright Johns Hopkins University 2020.

The `percent_cli` data is a modified part of the [COVIDcast Epidata API Doctor Visits data](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/doctor-visits.html). This dataset is licensed under the terms of the [Creative Commons Attribution 4.0 International license](https://creativecommons.org/licenses/by/4.0/). Copyright Delphi Research Group at Carnegie Mellon University 2020.


6 changes: 3 additions & 3 deletions vignettes/aggregation.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ x <- pub_covidcast(
) %>%
select(geo_value, time_value, cases = value) %>%
full_join(y, by = "geo_value") %>%
as_epi_df()
as_epi_df(as_of = as.Date("2024-03-20"))
```

The data contains 16,212 rows and 5 columns.
Expand Down Expand Up @@ -192,7 +192,7 @@ running `epi_slide()` on the zero-filled data brings these trailing averages

```{r}
xt %>%
as_epi_df() %>%
as_epi_df(as_of = as.Date("2024-03-20")) %>%
group_by(geo_value) %>%
epi_slide(cases_7dav = mean(cases), before = 6) %>%
ungroup() %>%
Expand All @@ -203,7 +203,7 @@ xt %>%
print(n = 7)
xt_filled %>%
as_epi_df() %>%
as_epi_df(as_of = as.Date("2024-03-20")) %>%
group_by(geo_value) %>%
epi_slide(cases_7dav = mean(cases), before = 6) %>%
ungroup() %>%
Expand Down
47 changes: 19 additions & 28 deletions vignettes/archive.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,16 +7,16 @@ vignette: >
%\VignetteEncoding{UTF-8}
---

In addition to the `epi_df` data structure, which we have been working with all
along in these vignettes, the `epiprocess` package has a companion structure
called `epi_archive`. In comparison to an `epi_df` object, which can be seen as
storing a single snapshot of a data set with the most up-to-date signal values
as of some given time, an `epi_archive` object stores the full version history
of a data set. Many signals of interest for epidemiological tracking are subject
to revision (some more than others), and paying attention to data revisions can
be important for all sorts of downstream data analysis and modeling tasks.

This vignette walks through working with `epi_archive` objects and demonstrates
In addition to the `epi_df` data structure, the `epiprocess` package has a
companion structure called `epi_archive`. In comparison to an `epi_df` object,
which can be seen as storing a single snapshot of a data set with the most
up-to-date signal values as of some given time, an `epi_archive` object stores
the full version history of a data set. Many signals of interest for
epidemiological tracking are subject to revision (some more than others) and
paying attention to data revisions can be important for all sorts of downstream
data analysis and modeling tasks.

This vignette walks through working with `epi_archive()` objects and demonstrates
some of their key functionality. We'll work with a signal on the percentage of
doctor's visits with CLI (COVID-like illness) computed from medical insurance
claims, available through the [COVIDcast
Expand Down Expand Up @@ -55,9 +55,8 @@ library(ggplot2)

## Getting data into `epi_archive` format

An <code><a href="../reference/epi_archive.html">epi_archive</a></code> object
can be constructed from a data frame, data table, or tibble, provided that it
has (at least) the following columns:
An `epi_archive()` object can be constructed from a data frame, data table, or
tibble, provided that it has (at least) the following columns:

* `geo_value`: the geographic value associated with each row of measurements.
* `time_value`: the time value associated with each row of measurements.
Expand All @@ -71,7 +70,7 @@ As we can see from the above, the data frame returned by
format, with `issue` playing the role of `version`. We can now use
`as_epi_archive()` to bring it into `epi_archive` format. For removal of
redundant version updates in `as_epi_archive` using compactify, please refer to
the compactify vignette.
the [compactify vignette](articles/compactify.html).

```{r, eval=FALSE}
x <- dv %>%
Expand All @@ -91,10 +90,10 @@ class(x)
print(x)
```

An `epi_archive` is special kind of class called an R6 class. Its primary field
is a data table `DT`, which is of class `data.table` (from the `data.table`
package), and has columns `geo_value`, `time_value`, `version`, as well as any
number of additional columns.
An `epi_archive` is consists of a primary field `DT`, which is a data table
(from the `data.table` package) that has the columns `geo_value`, `time_value`,
`version` (and possibly additional ones), and other metadata fields, such as
`geo_type` and `time_type`.

```{r}
class(x$DT)
Expand All @@ -112,9 +111,7 @@ key(x$DT)
```

In general, the last version of each observation is carried forward (LOCF) to
fill in data between recorded versions. **A word of caution:** R6 objects,
unlike most other objects in R, have reference semantics. An important
consequence of this is that objects are not copied when modified.
fill in data between recorded versions.

```{r}
original_value <- x$DT$percent_cli[1]
Expand Down Expand Up @@ -144,10 +141,7 @@ call (as it did in the case above).

A key method of an `epi_archive` class is `as_of()`, which generates a snapshot
of the archive in `epi_df` format. This represents the most up-to-date values of
the signal variables as of a given version. This can be accessed via `x$as_of()`
for an `epi_archive` object `x`, but the package also provides a simple wrapper
function `epix_as_of()` since this is likely a more familiar interface for users
not familiar with R6 (or object-oriented programming).
the signal variables as of a given version.

```{r}
x_snapshot <- epix_as_of(x, max_version = as.Date("2021-06-01"))
Expand Down Expand Up @@ -215,9 +209,6 @@ they overestimate it (both states towards the beginning of 2021), though not
quite as dramatically. Modeling the revision process, which is often called
*backfill modeling*, is an important statistical problem in it of itself.

<!-- todo: refer to some project/code? perhaps we should think about writing a
function in `epiprocess` or even a separate package? -->

## Merging `epi_archive` objects

Now we demonstrate how to merge two `epi_archive` objects together, e.g., so
Expand Down
1 change: 1 addition & 0 deletions vignettes/compactify.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,7 @@ slide_median <- function(my_ea) {
speeds <- rbind(speeds, speed_test(slide_median, "slide_median"))
```

Here is a detailed performance comparison:

```{r}
Expand Down
13 changes: 5 additions & 8 deletions vignettes/epiprocess.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ and `time_value` columns, respectively, but inferring the `as_of` field is not
as easy. See the documentation for `as_epi_df()` more details.

```{r}
x <- as_epi_df(cases) %>%
x <- as_epi_df(cases, as_of = as.Date("2024-03-20")) %>%
select(geo_value, time_value, total_cases = value)
attributes(x)$metadata
Expand Down Expand Up @@ -169,7 +169,7 @@ data.frame(
# misnamed
reported_date = rep(seq(as.Date("2020-06-01"), as.Date("2020-06-03"), by = "day"), length.out = length(geo_value)),
value = seq_along(geo_value) + 0.01 * withr::with_rng_version("3.0.0", withr::with_seed(42, length(geo_value)))
) %>% as_epi_df()
) %>% as_epi_df(as_of = as.Date("2024-03-20"))
```

The columns can be renamed to match `epi_df` format. In the example below, notice there is also an additional key `pol`.
Expand Down Expand Up @@ -220,7 +220,7 @@ ex3 <- ex3 %>%
state = rep(tolower("MA"), 6),
pol = rep(c("blue", "swing", "swing"), each = 2)
) %>%
as_epi_df(additional_metadata = list(other_keys = c("state", "pol")))
as_epi_df(additional_metadata = list(other_keys = c("state", "pol")), as_of = as.Date("2024-03-20"))
attr(ex3, "metadata")
```
Expand Down Expand Up @@ -256,7 +256,7 @@ cases in Canada in 2003, from the
x <- outbreaks::sars_canada_2003 %>%
mutate(geo_value = "ca") %>%
select(geo_value, time_value = date, starts_with("cases")) %>%
as_epi_df(geo_type = "nation")
as_epi_df(geo_type = "nation", as_of = as.Date("2024-03-20"))
head(x)
Expand Down Expand Up @@ -303,7 +303,7 @@ x <- outbreaks::ebola_sierraleone_2014 %>%
filter(cases == 1) %>%
group_by(geo_value, time_value) %>%
summarise(cases = sum(cases)) %>%
as_epi_df(geo_type = "province")
as_epi_df(geo_type = "province", as_of = as.Date("2024-03-20"))
ggplot(x, aes(x = time_value, y = cases)) +
geom_col(aes(fill = geo_value), show.legend = FALSE) +
Expand All @@ -312,11 +312,8 @@ ggplot(x, aes(x = time_value, y = cases)) +
labs(x = "Date", y = "Confirmed cases of Ebola in Sierra Leone")
```



## Attribution
This document contains a dataset that is a modified part of the [COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University](https://github.com/CSSEGISandData/COVID-19) as [republished in the COVIDcast Epidata API](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html). This data set is licensed under the terms of the [Creative Commons Attribution 4.0 International license](https://creativecommons.org/licenses/by/4.0/) by the Johns Hopkins University on behalf of its Center for Systems Science in Engineering. Copyright Johns Hopkins University 2020.

[From the COVIDcast Epidata API](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html):
These signals are taken directly from the JHU CSSE [COVID-19 GitHub repository](https://github.com/CSSEGISandData/COVID-19) without changes.

0 comments on commit 6f1891d

Please sign in to comment.