Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue20 #33

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
## covidData 0.1.3
## covidData 0.1.4

This is the first version of the package with a 0.x release.

### Feature updates
- details on new features will be listed here for future updates
- current key features include `load_jhu_data` and `load_healthdata_data` functions to load versioned counts of cases, deaths, and hospitalizations due to COVID-19
- current key features include `load_data` function to load versioned counts of cases, deaths, and hospitalizations due to COVID-19

### package updates
- details on other changes will be listed here for future updates
Expand All @@ -23,4 +23,7 @@ This is the first version of the package with a 0.x release.

### v 0.1.3
- handle errors about SSL certificates expired when pulling data from HealthData.gov
- handle the fact that HealthData.gov posted data with upload date of 12/21 and going through 12/28 (since corrected on their site). We now manually force the issue date (based on the upload date) to be at least as large as the last date in the data file.
- handle the fact that HealthData.gov posted data with upload date of 12/21 and going through 12/28 (since corrected on their site). We now manually force the issue date (based on the upload date) to be at least as large as the last date in the data file.

### v 0.1.4
- create `load_data` function to replace functions `load_jhu_data` and `load_healthdata_data`
280 changes: 257 additions & 23 deletions vignettes/covidData.Rmd
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
---
title: "covidData"
author: "Evan Ray"
date: "12/3/2020"
author: "Evan Ray, Ariane Stark"
date: "`r format(Sys.time(), '%d %B %Y')`"
output: html_document
---

<!-- code to run rmarkdown::render(input="./vignettes/covidData.Rmd")
-->

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE)
knitr::opts_chunk$set(echo = TRUE, warning = FALSE)
```


Expand All @@ -16,6 +19,16 @@ To use the `covidData` package, you must first do some set-up to install the pac
It is not as straight-forward as installing a normal R package.
The latest instructions on how to install the package can be found on the package's [GitHub page](https://github.com/reichlab/covidData/).

## Overview of Package Functionality

This R package provides versioned time series data for COVID-19 hospitalizations, cases, and deaths (`measure`).

* `issues` is a vector of dates that pertains to data that was reported or updated exactly on the specified date(s)
* `as_of` is a vector of dates that pertains to the latest data that was reported or updated on or before the specified date(s)
* `spatial_resolution` refers to the spatial unit and is county, state, or national (hospitalizations does not support county spatial unit)
* `temporal_resolution` is daily or weekly

There will be examples using these parameters to follow.

## Code to retrieve and plot data

Expand All @@ -29,34 +42,64 @@ library(ggplot2)
library(covidData)
```

```{r message=FALSE}
#### Daily incident cases, hospitalizations and deaths at the state and national level

Data shown are the incident cases, deaths, or hospitalizations per day at the state and national levels, as reported on December 2, 2020. A single call to `load_data` retrieves data for one of these measures.

```{r}
# Load incident cases, hospitalizations, and deaths data at state and national level
combined_data <- dplyr::bind_rows(
load_jhu_data(
issue_date = "2020-12-05",
uncombined_cases <- load_data(
issues = "2020-12-02",
spatial_resolution = c("state", "national"),
temporal_resolution = "daily",
measure = "cases"
) %>%
dplyr::mutate(measure = "cases"),
load_jhu_data(
issue_date = "2020-12-05",
)

uncombined_deaths <- load_data(
issues = "2020-12-02",
spatial_resolution = c("state", "national"),
temporal_resolution = "daily",
measure = "deaths"
) %>%
dplyr::mutate(measure = "deaths"),
load_healthdata_data(
issue_date = "2020-12-05",
)

uncombined_hospitalizations <- load_data(
issues = "2020-12-02",
spatial_resolution = c("state", "national"),
temporal_resolution = "daily",
measure = "hospitalizations"
) %>%
)
```

```{r}
# View the separate data frames

tail(uncombined_cases)
tail(uncombined_deaths)
tail(uncombined_hospitalizations)
```


So we bind the results of three separate calls together to create a unified data frame to use for the plot. The columns in the resulting data frame are date, which represents which day the data corresponds to. The column cum, which represents the cumulative incidence of the measure, and another column inc, which represents the daily/weekly incidence of the measure. The measure columns states whether the data represents cases, deaths, or hospitalizations. The location column in the output from `load_data` represents locations using their FIPS codes, which are alpha-numeric codes uniquely identifying locations. More human-readable representations of the location names are contained in the fips_codes data frame provided by the `covidData` package and are joined with the data set.

```{r message=FALSE}
# Bind incident cases, hospitalizations, and deaths data at state and national level
combined_data <- dplyr::bind_rows(
uncombined_cases %>%
dplyr::mutate(measure = "cases"),
uncombined_deaths %>%
dplyr::mutate(measure = "deaths"),
uncombined_hospitalizations %>%
dplyr::mutate(measure = "hospitalizations")
)
```

```{r}
head(combined_data)
```



```{r message=FALSE}
# Add more human readable location names,
# set location abbreviation as a factor with US first
combined_data <- combined_data %>%
Expand All @@ -69,16 +112,207 @@ combined_data <- combined_data %>%
)
```

```{r}
head(combined_data)
```
The results of `load_data` are then passed in to a `ggplot` call and produce the below graph showing cases, deaths, and hospitalizations nationally and individually for each state.

```{r message=FALSE, fig.width=10, fig.height=32}
# Plot the data
ggplot(
data = combined_data,
#data = filter(combined_data, abbreviation %in% c("MA", "SD", "TX")),
mapping = aes(x = date, y = inc, color = measure)) +
geom_smooth(se=FALSE, span=.25) +
geom_point(alpha=.2) +
facet_wrap( ~ abbreviation, ncol = 3, scales = "free_y") +
data = combined_data,
# data = filter(combined_data, abbreviation %in% c("MA", "SD", "TX")),
mapping = aes(x = date, y = inc, color = measure)
) +
geom_smooth(se = FALSE, span = .25) +
geom_point(alpha = .2) +
facet_wrap(~abbreviation, ncol = 3, scales = "free_y") +
scale_y_log10() +
# scale_x_date(limits=c(as.Date("2020-07-01"), Sys.Date())) +
theme_bw()
```

#### Daily cummulative deaths for national and select states

Using the same data set as above we can also look at cumulative deaths at the state and national level. The data set was filtered to only include the entries that correspond to deaths and using the abbreviation was filtered to only include the national data and state data for Georgia, New York, and Massachusetts. Since we are interested in cumulative deaths we are plotting results from the cum column.

```{r message=FALSE}
# Plot cumulative deaths at the state and national level
combined_data %>%
filter(measure == "deaths") %>%
filter(abbreviation %in% c("US", "NY", "GA", "MA")) %>%
ggplot(
mapping = aes(x = date, y = cum)
) +
geom_point(alpha = .2) +
facet_wrap(~abbreviation, ncol = 2, scales = "free_y") +
#scale_y_log10() +
theme_bw()
```

#### County level incident cases and deaths for the 14 counties in Massachusetts

Data shown are the incident cases and deaths, per day at the county levels, as reported on January 1, 2021. We do not have county level hospitalization data so this data set only receives data from JHU CSSE.

Note JHU does not report data for Nantucket County or Dukes County in Massachusetts so the plots corresponding to these counties will be empty despite having some cases.

```{r message=FALSE}
# Load incident cases, and deaths data at county level
county_data <- dplyr::bind_rows(
load_data(
issues = "2021-01-01",
spatial_resolution = "county",
temporal_resolution = "daily",
measure = "cases"
) %>%
dplyr::mutate(measure = "cases"),
load_data(
issues = "2021-01-01",
spatial_resolution = "county",
temporal_resolution = "daily",
measure = "deaths"
) %>%
dplyr::mutate(measure = "deaths")
)
```

```{r message=FALSE}
# Add more human readable location names,
# set location abbreviation as a factor with US first
county_data <- county_data %>%
dplyr::left_join(
covidData::fips_codes,
by = "location"
) %>%
dplyr::mutate(
abbreviation = forcats::fct_relevel(factor(abbreviation), "US")
)
```

All counties in Massachusetts are prefaced by 25 and then 3 digits for the county in their FIPS code, so our data is filtered to only include counties with FIPS codes in the 25000's.

```{r fig.height=8, fig.width=10, message=FALSE}
# Look at county level data for Massachusetts
county_data %>%
dplyr::filter(location > 25000 & location < 26000) %>% # FIPS codes for MA
ggplot(
mapping = aes(x = date, y = inc, color = measure)
) +
geom_smooth(se = FALSE, span = .25) +
geom_point(alpha = .2) +
facet_wrap(~location_name, ncol = 3, scales = "free_y") +
scale_y_log10() +
theme_bw()
```

#### Weekly incidents cases, hospitalizations, and deaths for select states

Data shown are the incident cases, deaths, or hospitalizations per week at the state level (can look at national as well), as reported on December 31, 2020. Using the weekly data decreases the noise in the graphs.

```{r message=FALSE}
# Load weekly incident cases, hospitalizations, and deaths data at state level
weekly_data <- dplyr::bind_rows(
load_data(
issues = "2020-12-31",
spatial_resolution = c("state"),
temporal_resolution = "weekly",
measure = "cases"
) %>%
dplyr::mutate(measure = "cases"),
load_data(
issues = "2020-12-31",
spatial_resolution = c("state"),
temporal_resolution = "weekly",
measure = "deaths"
) %>%
dplyr::mutate(measure = "deaths"),
load_data(
issues = "2020-12-31",
spatial_resolution = c("state"),
temporal_resolution = "weekly",
measure = "hospitalizations"
) %>%
dplyr::mutate(measure = "hospitalizations")
)
```


```{r message=FALSE}
# Add more human readable location names,
# set location abbreviation as a factor
weekly_data <- weekly_data %>%
dplyr::left_join(
covidData::fips_codes,
by = "location"
) %>%
dplyr::mutate(
abbreviation = forcats::fct_relevel(factor(abbreviation))
)
```

The weekly data set is filtered to show only the data for Maine, Maryland, Massachusetts, and Michigan in the plot.

```{r message=FALSE, warning=FALSE}
# Plot the data
ggplot(
data = filter(weekly_data, abbreviation %in% c("ME", "MD", "MA", "MI")),
mapping = aes(x = date, y = inc, color = measure)
) +
geom_smooth(se = FALSE, span = .25) +
geom_point(alpha = .2) +
facet_wrap(~location_name, ncol = 2, scales = "free_y") +
scale_y_log10() +
theme_bw()
```

#### View discrepancies between incident deaths in New Jersey
On August 2, 2020 New Jersey updated their prior incident deaths. Therefore, we can use different `as_of` dates to see the differing values. `as_of` dates used are August 1, 2020 and August 2, 2020. The data was loaded via `load_data` with a data frame per issue date and filtered to only include the daily deaths for New Jersey. An extra column was added to the data sets called issue_date which for all entries states the issue date for that entry. Both individual data sets are combined into one bigger data set which will be plotted from and why the additional issue_date column is important as it identifies what data comes from each issue date.

```{r message=FALSE}
# Comparison of incident deaths reported in NJ between
# "2020-08-01" and "2020-08-02"


NJ_issue_day_1 <- load_data(
issues = "2020-08-01",
spatial_resolution = c("state"),
temporal_resolution = "daily",
measure = "deaths"
) %>%
dplyr::left_join(
covidData::fips_codes,
by = "location"
) %>%
dplyr::filter(abbreviation == "NJ") %>%
dplyr::mutate(issue_date = "2020-08-01")

NJ_issue_day_2 <- load_data(
issues = "2020-08-02",
spatial_resolution = c("state"),
temporal_resolution = "daily",
measure = "deaths"
) %>%
dplyr::left_join(
covidData::fips_codes,
by = "location"
) %>%
dplyr::filter(abbreviation == "NJ") %>%
dplyr::mutate(issue_date = "2020-08-02")

NJ_issue_date_comparison <- dplyr::full_join(NJ_issue_day_1, NJ_issue_day_2)
```

To illustrate the importance of issue date and difference in daily deaths on these two consecutive issue dates in New Jersey the daily incidence death data is plotted for each issue date.


```{r message=FALSE}
# Plot the differences in daily incidence between 2 consecutive issue dates in New Jersey
ggplot(
data = NJ_issue_date_comparison,
mapping = aes(x = date, y = inc, color = issue_date)
) +
geom_smooth(se = FALSE, span = .25) +
geom_point(alpha = .2) +
scale_y_log10() +
#scale_x_date(limits=c(as.Date("2020-07-01"), Sys.Date())) +
theme_bw()
```
485 changes: 453 additions & 32 deletions vignettes/covidData.html

Large diffs are not rendered by default.