feat: add vignette snapshots

cmu-delphi · Mar 21, 2024 · 6f1891d · 6f1891d
1 parent e74b7e7
commit 6f1891d
Show file tree

Hide file tree

Showing 13 changed files with 4,996 additions and 50 deletions.
diff --git a/tests/testthat/_snaps/vignette-snapshot/advanced.html b/tests/testthat/_snaps/vignette-snapshot/advanced.html
diff --git a/tests/testthat/_snaps/vignette-snapshot/advanced.new.html b/tests/testthat/_snaps/vignette-snapshot/advanced.new.html
diff --git a/tests/testthat/_snaps/vignette-snapshot/aggregation.html b/tests/testthat/_snaps/vignette-snapshot/aggregation.html
diff --git a/tests/testthat/_snaps/vignette-snapshot/archive.html b/tests/testthat/_snaps/vignette-snapshot/archive.html
diff --git a/tests/testthat/_snaps/vignette-snapshot/archive.new.html b/tests/testthat/_snaps/vignette-snapshot/archive.new.html
diff --git a/tests/testthat/_snaps/vignette-snapshot/epiprocess.html b/tests/testthat/_snaps/vignette-snapshot/epiprocess.html
diff --git a/tests/testthat/_snaps/vignette-snapshot/slide.html b/tests/testthat/_snaps/vignette-snapshot/slide.html
diff --git a/tests/testthat/test-vignette-snapshot.R b/tests/testthat/test-vignette-snapshot.R
@@ -0,0 +1,18 @@
+# Vignettes that use epi_archives or epi_dfs.
+vignettes <- paste0(here::here("vignettes/"), c(
+  "advanced.Rmd",
+  "aggregation.Rmd",
+  "archive.Rmd",
+  "epiprocess.Rmd",
+  "slide.Rmd"
+))
+for (input_file in vignettes) {
+  test_that(paste0("snapshot vignette ", basename(input_file)), {
+    # skip("Skipping snapshot tests by default, as they are slow.")
+    output_file <- sub("\\.Rmd$", ".html", input_file)
+    withr::with_file(output_file, {
+      devtools::build_rmd(input_file)
+      expect_snapshot_file(output_file)
+    })
+  })
+}
diff --git a/vignettes/advanced.Rmd b/vignettes/advanced.Rmd
@@ -9,14 +9,13 @@ vignette: >
 
 In this vignette, we discuss how to use the sliding functionality in the
 `epiprocess` package with less common grouping schemes or with computations that
-have advanced output structures.
-The output of a slide computation should either be an atomic value/vector, or a
-data frame. This data frame can have multiple columns, multiple rows, or both.
+have advanced output structures. The output of a slide computation should either
+be an atomic value/vector, or a data frame. This data frame can have multiple
+columns, multiple rows, or both.
 
 During basic usage (e.g., when all optional arguments are set to their defaults):
 
 * `epi_slide(edf, <computation>, .....)`:
-
   * keeps **all** columns of `edf`, adds computed column(s)
   * outputs **one row per row in `edf`** (recycling outputs from
     computations appropriately if there are multiple time series bundled
@@ -26,9 +25,7 @@ During basic usage (e.g., when all optional arguments are set to their defaults)
     `dplyr::arrange(time_value, .by_group = TRUE)`**
   * outputs an **`epi_df`** if the required columns are present, otherwise a
     tibble
-
 * `epix_slide(ea, <computation>, .....)`:
-
   * keeps **grouping and `time_value`** columns of `ea`, adds computed
     column(s)
   * outputs **any number of rows** (computations are allowed to output any
@@ -40,6 +37,7 @@ During basic usage (e.g., when all optional arguments are set to their defaults)
   * outputs a **tibble**
 
 These differences in basic behavior make some common slide operations require less boilerplate:
+
 * predictors and targets calculated with `epi_slide` are automatically lined up
   with each other and with the signals from which they were calculated; and
 * computations for an `epix_slide` can output data frames with any number of
@@ -84,13 +82,14 @@ simple synthetic example.
 ```{r message = FALSE}
 library(epiprocess)
 library(dplyr)
+set.seed(123)
 
 edf <- tibble(
   geo_value = rep(c("ca", "fl", "pa"), each = 3),
   time_value = rep(seq(as.Date("2020-06-01"), as.Date("2020-06-03"), by = "day"), length.out = length(geo_value)),
   x = seq_along(geo_value) + 0.01 * rnorm(length(geo_value)),
 ) %>%
-  as_epi_df()
+  as_epi_df(as_of = as.Date("2024-03-20"))
 
 # 2-day trailing average, per geo value
 edf %>%
@@ -338,7 +337,7 @@ library(data.table)
 library(ggplot2)
 theme_set(theme_bw())
 
-x <- archive_cases_dv_subset_2$DT %>%
+x <- archive_cases_dv_subset$DT %>%
   filter(geo_value %in% c("ca", "fl")) %>%
   as_epi_archive(compactify = FALSE)
 ```
@@ -525,10 +524,7 @@ separate ARX model on each state. As in the archive vignette, we can see a
 difference between version-aware (right column) and -unaware (left column)
 forecasting, as well.
 
-
 ## Attribution
 The `case_rate_7d_av` data used in this document is a modified part of the [COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University](https://github.com/CSSEGISandData/COVID-19) as [republished in the COVIDcast Epidata API](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html). This data set is licensed under the terms of the [Creative Commons Attribution 4.0 International license](https://creativecommons.org/licenses/by/4.0/) by the Johns Hopkins University on behalf of its Center for Systems Science in Engineering. Copyright Johns Hopkins University 2020.
 
 The `percent_cli` data is a modified part of the [COVIDcast Epidata API Doctor Visits data](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/doctor-visits.html). This dataset is licensed under the terms of the [Creative Commons Attribution 4.0 International license](https://creativecommons.org/licenses/by/4.0/). Copyright Delphi Research Group at Carnegie Mellon University 2020.
-
-
diff --git a/vignettes/aggregation.Rmd b/vignettes/aggregation.Rmd
@@ -34,7 +34,7 @@ x <- pub_covidcast(
 ) %>%
   select(geo_value, time_value, cases = value) %>%
   full_join(y, by = "geo_value") %>%
-  as_epi_df()
+  as_epi_df(as_of = as.Date("2024-03-20"))
 ```
 
 The data contains 16,212 rows and 5 columns.
@@ -192,7 +192,7 @@ running `epi_slide()` on the zero-filled data brings these trailing averages
 
 ```{r}
 xt %>%
-  as_epi_df() %>%
+  as_epi_df(as_of = as.Date("2024-03-20")) %>%
   group_by(geo_value) %>%
   epi_slide(cases_7dav = mean(cases), before = 6) %>%
   ungroup() %>%
@@ -203,7 +203,7 @@ xt %>%
   print(n = 7)
 
 xt_filled %>%
-  as_epi_df() %>%
+  as_epi_df(as_of = as.Date("2024-03-20")) %>%
   group_by(geo_value) %>%
   epi_slide(cases_7dav = mean(cases), before = 6) %>%
   ungroup() %>%

diff --git a/vignettes/archive.Rmd b/vignettes/archive.Rmd
@@ -7,16 +7,16 @@ vignette: >
   %\VignetteEncoding{UTF-8}
 ---
 
-In addition to the `epi_df` data structure, which we have been working with all
-along in these vignettes, the `epiprocess` package has a companion structure
-called `epi_archive`. In comparison to an `epi_df` object, which can be seen as
-storing a single snapshot of a data set with the most up-to-date signal values
-as of some given time, an `epi_archive` object stores the full version history
-of a data set. Many signals of interest for epidemiological tracking are subject
-to revision (some more than others), and paying attention to data revisions can
-be important for all sorts of downstream data analysis and modeling tasks.
-
-This vignette walks through working with `epi_archive` objects and demonstrates
+In addition to the `epi_df` data structure, the `epiprocess` package has a
+companion structure called `epi_archive`. In comparison to an `epi_df` object,
+which can be seen as storing a single snapshot of a data set with the most
+up-to-date signal values as of some given time, an `epi_archive` object stores
+the full version history of a data set. Many signals of interest for
+epidemiological tracking are subject to revision (some more than others) and
+paying attention to data revisions can be important for all sorts of downstream
+data analysis and modeling tasks.
+
+This vignette walks through working with `epi_archive()` objects and demonstrates
 some of their key functionality. We'll work with a signal on the percentage of
 doctor's visits with CLI (COVID-like illness) computed from medical insurance
 claims, available through the [COVIDcast
@@ -55,9 +55,8 @@ library(ggplot2)
 
 ## Getting data into `epi_archive` format
 
-An <code><a href="../reference/epi_archive.html">epi_archive</a></code> object
-can be constructed from a data frame, data table, or tibble, provided that it
-has (at least) the following columns:
+An `epi_archive()` object can be constructed from a data frame, data table, or
+tibble, provided that it has (at least) the following columns:
 
 * `geo_value`: the geographic value associated with each row of measurements.
 * `time_value`: the time value associated with each row of measurements.
@@ -71,7 +70,7 @@ As we can see from the above, the data frame returned by
 format, with `issue` playing the role of `version`. We can now use
 `as_epi_archive()` to bring it into `epi_archive` format. For removal of
 redundant version updates in `as_epi_archive` using compactify, please refer to
-the compactify vignette.
+the [compactify vignette](articles/compactify.html).
 
 ```{r, eval=FALSE}
 x <- dv %>%
@@ -91,10 +90,10 @@ class(x)
 print(x)
 ```
 
-An `epi_archive` is special kind of class called an R6 class. Its primary field
-is a data table `DT`, which is of class `data.table` (from the `data.table`
-package), and has columns `geo_value`, `time_value`, `version`, as well as any
-number of additional columns.
+An `epi_archive` is consists of a primary field `DT`, which is a data table
+(from the `data.table` package) that has the columns `geo_value`, `time_value`,
+`version` (and possibly additional ones), and other metadata fields, such as
+`geo_type` and `time_type`.
 
 ```{r}
 class(x$DT)
@@ -112,9 +111,7 @@ key(x$DT)
 ```
 
 In general, the last version of each observation is carried forward (LOCF) to
-fill in data between recorded versions. **A word of caution:** R6 objects,
-unlike most other objects in R, have reference semantics. An important
-consequence of this is that objects are not copied when modified.
+fill in data between recorded versions.
 
 ```{r}
 original_value <- x$DT$percent_cli[1]
@@ -144,10 +141,7 @@ call (as it did in the case above).
 
 A key method of an `epi_archive` class is `as_of()`, which generates a snapshot
 of the archive in `epi_df` format. This represents the most up-to-date values of
-the signal variables as of a given version. This can be accessed via `x$as_of()`
-for an `epi_archive` object `x`, but the package also provides a simple wrapper
-function `epix_as_of()` since this is likely a more familiar interface for users
-not familiar with R6 (or object-oriented programming).
+the signal variables as of a given version.
 
 ```{r}
 x_snapshot <- epix_as_of(x, max_version = as.Date("2021-06-01"))
@@ -215,9 +209,6 @@ they overestimate it (both states towards the beginning of 2021), though not
 quite as dramatically. Modeling the revision process, which is often called
 *backfill modeling*, is an important statistical problem in it of itself.
 
-<!-- todo: refer to some project/code? perhaps we should think about writing a
- function in `epiprocess` or even a separate package? -->
-
 ## Merging `epi_archive` objects
 
 Now we demonstrate how to merge two `epi_archive` objects together, e.g., so

diff --git a/vignettes/compactify.Rmd b/vignettes/compactify.Rmd
@@ -106,6 +106,7 @@ slide_median <- function(my_ea) {
 
 speeds <- rbind(speeds, speed_test(slide_median, "slide_median"))
 ```
+
 Here is a detailed performance comparison:
 
 ```{r}

diff --git a/vignettes/epiprocess.Rmd b/vignettes/epiprocess.Rmd
@@ -125,7 +125,7 @@ and `time_value` columns, respectively, but inferring the `as_of` field is not
 as easy. See the documentation for `as_epi_df()` more details.
 
 ```{r}
-x <- as_epi_df(cases) %>%
+x <- as_epi_df(cases, as_of = as.Date("2024-03-20")) %>%
   select(geo_value, time_value, total_cases = value)
 
 attributes(x)$metadata
@@ -169,7 +169,7 @@ data.frame(
   # misnamed
   reported_date = rep(seq(as.Date("2020-06-01"), as.Date("2020-06-03"), by = "day"), length.out = length(geo_value)),
   value = seq_along(geo_value) + 0.01 * withr::with_rng_version("3.0.0", withr::with_seed(42, length(geo_value)))
-) %>% as_epi_df()
+) %>% as_epi_df(as_of = as.Date("2024-03-20"))
 ```
 
 The columns can be renamed to match `epi_df` format. In the example below, notice there is also an additional key `pol`.
@@ -220,7 +220,7 @@ ex3 <- ex3 %>%
     state = rep(tolower("MA"), 6),
     pol = rep(c("blue", "swing", "swing"), each = 2)
   ) %>%
-  as_epi_df(additional_metadata = list(other_keys = c("state", "pol")))
+  as_epi_df(additional_metadata = list(other_keys = c("state", "pol")), as_of = as.Date("2024-03-20"))
 
 attr(ex3, "metadata")
 ```
@@ -256,7 +256,7 @@ cases in Canada in 2003, from the
 x <- outbreaks::sars_canada_2003 %>%
   mutate(geo_value = "ca") %>%
   select(geo_value, time_value = date, starts_with("cases")) %>%
-  as_epi_df(geo_type = "nation")
+  as_epi_df(geo_type = "nation", as_of = as.Date("2024-03-20"))
 
 head(x)
 
@@ -303,7 +303,7 @@ x <- outbreaks::ebola_sierraleone_2014 %>%
   filter(cases == 1) %>%
   group_by(geo_value, time_value) %>%
   summarise(cases = sum(cases)) %>%
-  as_epi_df(geo_type = "province")
+  as_epi_df(geo_type = "province", as_of = as.Date("2024-03-20"))
 
 ggplot(x, aes(x = time_value, y = cases)) +
   geom_col(aes(fill = geo_value), show.legend = FALSE) +
@@ -312,11 +312,8 @@ ggplot(x, aes(x = time_value, y = cases)) +
   labs(x = "Date", y = "Confirmed cases of Ebola in Sierra Leone")
 ```
 
-
-
 ## Attribution
 This document contains a dataset that is a modified part of the [COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University](https://github.com/CSSEGISandData/COVID-19) as [republished in the COVIDcast Epidata API](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html). This data set is licensed under the terms of the [Creative Commons Attribution 4.0 International license](https://creativecommons.org/licenses/by/4.0/) by the Johns Hopkins University on behalf of its Center for Systems Science in Engineering. Copyright Johns Hopkins University 2020.
 
 [From the COVIDcast Epidata API](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html):
  These signals are taken directly from the JHU CSSE [COVID-19 GitHub repository](https://github.com/CSSEGISandData/COVID-19) without changes.
-
-Original file line number
+Diff line change
@@ Expand Up / @@ -106,6 +106,7 @@ slide_median <- function(my_ea) { @@
     speeds <- rbind(speeds, speed_test(slide_median, "slide_median"))
     ```
     Here is a detailed performance comparison:
     ```{r}
@@ Expand Down @@