Skip to content

Commit

Permalink
feat: update for epiprocess R6 refactor
Browse files Browse the repository at this point in the history
* remove references to R6 and mutation
* use epiprocess correctly
* fix the authors section of DESCRIPTION
* upgrade renv
* update all packages in renv
* integrate Rprofile with user Rprofile
  • Loading branch information
dshemetov committed May 1, 2024
1 parent 4c3830c commit 1ac91a2
Show file tree
Hide file tree
Showing 6 changed files with 790 additions and 679 deletions.
6 changes: 6 additions & 0 deletions .Rprofile
Original file line number Diff line number Diff line change
@@ -1 +1,7 @@
source("renv/activate.R")

# Check if user .Rprofile exists
if (file.exists("~/.Rprofile")) {
# Source user .Rprofile
source("~/.Rprofile")
}
11 changes: 6 additions & 5 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,12 @@ Package: delphitoolingbook
Title: Delphi Tooling
Version: 0.0.0.9999
Authors@R: c(
person("Daniel", "McDonald", "J.", "daniel@stat.ubc.ca", role = c("cre", "aut"),
person("Logan", "Brooks", role = c("cre","aut"),
person("Rachel", "Lobay", role = "aut"))
person("Ryan", "Tibshirani", "J.", "ryantibs@berkeley.edu", role = "aut"),
Description:
person("Daniel", "McDonald", "J.", "daniel@stat.ubc.ca", role = c("cre", "aut")),
person("Logan", "Brooks", role = c("cre","aut")),
person("Rachel", "Lobay", role = "aut"),
person("Ryan", "Tibshirani", "J.", "ryantibs@berkeley.edu", role = "aut")
)
Description:
| This book is a longform introduction to analysing and forecasting epidemiological data.
License: MIT + file LICENSE
Imports:
Expand Down
91 changes: 30 additions & 61 deletions archive.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,8 @@ source("_common.R")

## Getting data into `epi_archive` format

An `epi_archive` object
can be constructed from a data frame, data table, or tibble, provided that it
has (at least) the following columns:
An `epi_archive` object can be constructed from a data frame, data table, or
tibble, provided that it has (at least) the following columns:

* `geo_value`: the geographic value associated with each row of measurements.
* `time_value`: the time value associated with each row of measurements.
Expand Down Expand Up @@ -55,10 +54,10 @@ class(x)
print(x)
```

An `epi_archive` is special kind of class called an R6 class. Its primary field
is a data table `DT`, which is of class `data.table` (from the `data.table`
package), and has columns `geo_value`, `time_value`, `version`, as well as any
number of additional columns.
An `epi_archive` is an S3 class. Its primary field is a data table `DT`, which
is of class `data.table` (from the `data.table` package), and has columns
`geo_value`, `time_value`, `version`, as well as any number of additional
columns.

```{r}
class(x$DT)
Expand All @@ -70,33 +69,18 @@ for the data table, as well as any other specified in the metadata (described
below). There can only be a single row per unique combination of key variables,
and therefore the key variables are critical for figuring out how to generate a
snapshot of data from the archive, as of a given version (also described below).

```{r, error=TRUE}
key(x$DT)
```

In general, the last version of each observation is carried forward (LOCF) to
fill in data between recorded versions. **A word of caution:** R6 objects,
unlike most other objects in R, have reference semantics. An important
consequence of this is that objects are not copied when modified.

```{r}
original_value <- x$DT$percent_cli[1]
y <- x # This DOES NOT make a copy of x
y$DT$percent_cli[1] = 0
head(y$DT)
head(x$DT)
x$DT$percent_cli[1] <- original_value
```

To make a copy, we can use the `clone()` method for an R6 class, as in `y <-
x$clone()`. You can read more about reference semantics in Hadley Wickham's
[Advanced R](https://adv-r.hadley.nz/r6.html#r6-semantics) book.
In general, the last version of each observation is carried forward (LOCF) to
fill in data between recorded versions.

## Some details on metadata

The following pieces of metadata are included as fields in an `epi_archive`
object:
object:

* `geo_type`: the type for the geo values.
* `time_type`: the type for the time values.
Expand All @@ -112,10 +96,8 @@ call (as it did in the case above).

A key method of an `epi_archive` class is `as_of()`, which generates a snapshot
of the archive in `epi_df` format. This represents the most up-to-date values of
the signal variables as of a given version. This can be accessed via `x$as_of()`
for an `epi_archive` object `x`, but the package also provides a simple wrapper
function `epix_as_of()` since this is likely a more familiar interface for users
not familiar with R6 (or object-oriented programming).
the signal variables as of a given version. This can be accessed via
`epix_as_of()`.

```{r}
x_snapshot <- epix_as_of(x, max_version = as.Date("2021-06-01"))
Expand All @@ -125,7 +107,7 @@ max(x_snapshot$time_value)
attributes(x_snapshot)$metadata$as_of
```

We can see that the max time value in the `epi_df` object `x_snapshot` that was
We can see that the max time value in the `epi_df` object `x_snapshot` that was
generated from the archive is May 29, 2021, even though the specified version
date was June 1, 2021. From this we can infer that the doctor's visits signal
was 2 days latent on June 1. Also, we can see that the metadata in the `epi_df`
Expand All @@ -134,7 +116,7 @@ object has the version date recorded in the `as_of` field.
By default, using the maximum of the `version` column in the underlying data table in an
`epi_archive` object itself generates a snapshot of the latest values of signal
variables in the entire archive. The `epix_as_of()` function issues a warning in
this case, since updates to the current version may still come in at a later
this case, since updates to the current version may still come in at a later
point in time, due to various reasons, such as synchronization issues.

```{r}
Expand All @@ -143,15 +125,15 @@ x_latest <- epix_as_of(x, max_version = max(x$DT$version))

Below, we pull several snapshots from the archive, spaced one month apart. We
overlay the corresponding signal curves as colored lines, with the version dates
marked by dotted vertical lines, and draw the latest curve in black (from the
marked by dotted vertical lines, and draw the latest curve in black (from the
latest snapshot `x_latest` that the archive can provide).

```{r, fig.width = 8, fig.height = 7}
self_max <- max(x$DT$version)
versions <- seq(as.Date("2020-06-01"), self_max - 1, by = "1 month")
snapshots <- map(
versions,
function(v) {
versions,
function(v) {
epix_as_of(x, max_version = v) %>% mutate(version = v)
}) %>%
list_rbind() %>%
Expand All @@ -162,37 +144,35 @@ snapshots <- map(
```{r, fig.height=7}
#| code-fold: true
ggplot(snapshots %>% filter(!latest),
aes(x = time_value, y = percent_cli)) +
geom_line(aes(color = factor(version)), na.rm = TRUE) +
aes(x = time_value, y = percent_cli)) +
geom_line(aes(color = factor(version)), na.rm = TRUE) +
geom_vline(aes(color = factor(version), xintercept = version), lty = 2) +
facet_wrap(~ geo_value, scales = "free_y", ncol = 1) +
scale_x_date(minor_breaks = "month", date_labels = "%b %Y") +
scale_color_viridis_d(option = "A", end = .9) +
labs(x = "Date", y = "% of doctor's visits with CLI") +
labs(x = "Date", y = "% of doctor's visits with CLI") +
theme(legend.position = "none") +
geom_line(data = snapshots %>% filter(latest),
aes(x = time_value, y = percent_cli),
aes(x = time_value, y = percent_cli),
inherit.aes = FALSE, color = "black", na.rm = TRUE)
```

We can see some interesting and highly nontrivial revision behavior: at some
points in time the provisional data snapshots grossly underestimate the latest
curve (look in particular at Florida close to the end of 2021), and at others
they overestimate it (both states towards the beginning of 2021), though not
they overestimate it (both states towards the beginning of 2021), though not
quite as dramatically. Modeling the revision process, which is often called
*backfill modeling*, is an important statistical problem in it of itself.


## Merging `epi_archive` objects
## Merging `epi_archive` objects

Now we demonstrate how to merge two `epi_archive` objects together, e.g., so
that grabbing data from multiple sources as of a particular version can be
performed with a single `as_of` call. The `epi_archive` class provides a method
`merge()` precisely for this purpose. The wrapper function is called
`epix_merge()`; this wrapper avoids mutating its inputs, while `x$merge` will
mutate `x`. Below we merge the working `epi_archive` of versioned percentage CLI
from outpatient visits to another one of versioned COVID-19 case reporting data,
which we fetch the from the [COVIDcast
performed with a single `as_of` call. The `epiprocess` packages provides
`epix_merge()` for this purpose. Below we merge the working `epi_archive` of
versioned percentage CLI from outpatient visits to another one of versioned
COVID-19 case reporting data, which we fetch the from the [COVIDcast
API](https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html/), on the
rate scale (counts per 100,000 people in the population).

Expand All @@ -209,7 +189,7 @@ When merging archives, unless the archives have identical data release patterns,
the other).

```{r, message = FALSE, warning = FALSE,eval=FALSE}
# This code is for illustration and doesn't run.
# This code is for illustration and doesn't run.
# The result is saved/loaded in the (hidden) next chunk from `{epidatasets}`
y <- covidcast(
data_source = "jhu-csse",
Expand All @@ -224,24 +204,13 @@ y <- covidcast(
select(geo_value, time_value, version = issue, case_rate_7d_av = value) %>%
as_epi_archive(compactify = TRUE)
x$merge(y, sync = "locf", compactify = FALSE)
x <- epix_merge(x, y, sync = "locf", compactify = FALSE)
print(x)
head(x$DT)
```

```{r, echo=FALSE}
x <- archive_cases_dv_subset
print(x)
head(x$DT)
```

Importantly, see that `x$merge` mutated `x` to hold the result of the merge. We
could also have used `xy = epix_merge(x, y)` to avoid mutating `x`. See the
documentation for either for more detailed descriptions of what mutation,
pointer aliasing, and pointer reseating is possible.

## Sliding version-aware computations

::: {.callout-note}
TODO: need a simple example here.
:::
18 changes: 8 additions & 10 deletions epiprocess.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -15,17 +15,17 @@ contains the most up-to-date values of the signals variables, as of a given
time.

By convention, functions in the `epiprocess` package that operate on `epi_df`
objects begin with `epi`. For example:
objects begin with `epi`. For example:

- `epi_slide()`, for iteratively applying a custom computation to a variable in
an `epi_df` object over sliding windows in time;

- `epi_cor()`, for computing lagged correlations between variables in an
`epi_df` object, (allowing for grouping by geo value, time value, or any other
variables).

Functions in the package that operate directly on given variables do not begin
with `epi`. For example:
with `epi`. For example:

- `growth_rate()`, for estimating the growth rate of a given signal at given
time values, using various methodologies;
Expand All @@ -35,20 +35,18 @@ Functions in the package that operate directly on given variables do not begin

## `epi_archive`: full version history of a data set

The second main data structure in the package is called
[`epi_archive`]. This is a special class (R6 format)
wrapped around a data table that stores the archive (version history) of some
signal variables of interest.
The second main data structure in the package is called [`epi_archive`]. This is
an S3 class containing a data table that stores the archive (version history) of
some signal variables of interest.

By convention, functions in the `epiprocess` package that operate on
`epi_archive` objects begin with `epix` (the "x" is meant to remind you of
"archive"). These are just wrapper functions around the public methods for the
`epi_archive` R6 class. For example:
"archive"). For example:

- `epix_as_of()`, for generating a snapshot in `epi_df` format from the data
archive, which represents the most up-to-date values of the signal variables,
as of the specified version;

- `epix_fill_through_version()`, for filling in some fake version data following
simple rules, for use when downstream methods expect an archive that is more
up-to-date (e.g., if it is a forecasting deadline date and one of our data
Expand Down
Loading

0 comments on commit 1ac91a2

Please sign in to comment.