Skip to content

Latest commit

 

History

History
310 lines (239 loc) · 16.1 KB

covidcast.md

File metadata and controls

310 lines (239 loc) · 16.1 KB
title has_children nav_order
Main Endpoint (COVIDcast)
true
2

Main Epidata API

This endpoint was previously known as COVIDcast.

This is the documentation for accessing Delphi's COVID-19 indicators via the covidcast endpoint of Delphi's epidemiological data API. This API provides data on the spread and impact of the COVID-19 pandemic across the United States, most of which is available at the county level and updated daily. This data powers our public COVIDcast map which includes testing, cases, and death data, as well as unique healthcare and survey data Delphi acquires through its partners. The API allows users to select specific signals and download data for selected geographical areas---counties, states, metropolitan statistical areas, and other divisions.

Get updates: Delphi operates a mailing list for users of the COVIDcast API. We will use the list to announce API changes, corrections to data, and new features; API users may also use the mailing list to ask general questions about its use. If you use the API, we strongly encourage you to subscribe.

Table of contents

{: .no_toc .text-delta}

  1. TOC {:toc}

Licensing

Like all other Delphi Epidata datasets, our COVIDcast data is freely available to the public. However, our COVID-19 indicators include data from many different sources, with data licensing handled separately for each source. For a summary of the licenses used and a list of the indicators each license applies to, we suggest users visit our COVIDcast licensing page. Licensing information is also summarized on each indicator's details page. We encourage academic users to cite the data if they use it in any publications. Our data ingestion code and API server code is open-source, and contributions are welcome. Further documentation on Delphi's APIs is available in the API overview.

Accessing the Data

Our COVIDcast site provides an interactive visualization of a select set of the data signals available in the COVIDcast API, and provides a data export feature to download any data range as a CSV file.

Several API clients are available for common programming languages, so you do not need to construct API calls yourself to obtain COVIDcast data. Once you install the appropriate client for your programming language, accessing data is as easy as, in R:

library(epidatr)

data <- pub_covidcast('fb-survey', 'smoothed_cli', 'county', 'day', geo_values = '06001',
                     time_values = c(20200401, 20200405:20200414))

or, in Python:

from epidatpy import EpiDataContext, EpiRange

epidata = EpiDataContext()
apicall = epidata.pub_covidcast(
    data_source="fb-survey",
    signals="smoothed_cli",
    geo_type="county",
    time_type="day",
    geo_values="*",
    time_values=EpiRange(20200501, 20200507),
)
data = apicall.df()

The API clients have extensive documentation providing further examples.

Alternately, to construct URLs and parse responses to access data manually, see below for details.

Data Sources and Signals

The API provides multiple data sources, each with several signals. Each source represents one provider of data, such as a medical testing provider or a symptom survey, and each signal represents one quantity computed from data provided by that source. Our sources provide detailed data about COVID-related topics, including confirmed cases, symptom-related search queries, hospitalizations, outpatient doctor's visits, and other sources. Many of these are publicly available only through the COVIDcast API.

Delphi's COVID-19 Surveillance Streams data includes the following data sources. Data from most of these sources is typically updated daily. You can use the covidcast_meta endpoint to get summary information about the ranges of the different attributes for the different data sources.

The API for retrieving data from these sources is described in the COVIDcast endpoint documentation. Changes and corrections to data from this endpoint are listed in the changelog.

To obtain many of these signals and update them daily, Delphi has written extensive software to obtain data from various sources, aggregate the data, calculate statistical estimates, and format the data to be shared through the COVIDcast endpoint of the Delphi Epidata API. This code is open source and available on GitHub, and contributions are welcome.

COVIDcast Dashboard Signals

The following signals are currently displayed on the public COVIDcast dashboard:

Kind Name Source Signal
Public Behavior People Wearing Masks fb-survey smoothed_wwearing_mask_7d
Public Behavior Vaccine Acceptance fb-survey smoothed_wcovid_vaccinated_appointment_or_accept
Public Behavior COVID Symptom Searches on Google google-symptoms sum_anosmia_ageusia_smoothed_search
Early Indicators COVID-Like Symptoms fb-survey smoothed_wcli
Early Indicators COVID-Like Symptoms in Community fb-survey smoothed_whh_cmnty_cli
Early Indicators COVID-Related Doctor Visits doctor-visits smoothed_adj_cli
Cases and Testing COVID Cases jhu-csse confirmed_7dav_incidence_prop
Late Indicators COVID Hospital Admissions hhs confirmed_admissions_covid_1d_prop_7dav
Late Indicators Deaths jhu-csse deaths_7dav_incidence_prop

All Available Sources and Signals

Beyond the signals available on the COVIDcast dashboard, numerous other signals are available through our data export tool or directly through the API.

Constructing API Queries

The COVIDcast API is based on HTTP GET queries and returns data in JSON form. The base URL is https://api.delphi.cmu.edu/epidata/covidcast/.

See this documentation for details on specifying epiweeks, dates, and lists.

Query Parameters

Required

Parameter Description Type
data_source name of upstream data source (e.g., doctor-visits or fb-survey; see full list) string
signal name of signal derived from upstream data (see notes below) string
time_type temporal resolution of the signal (e.g., day, week; see date coding details) string
geo_type spatial resolution of the signal (e.g., county, hrr, msa, dma, state) string
time_values time unit (e.g., date) over which underlying events happened list of time values (e.g., 20200401)
geo_value unique code for each location, depending on geo_type (see geographic coding details), or * for all string

The current set of signals available for each data source is returned by the covidcast_meta endpoint.

Alternate Required Parameters

The following parameters help specify multiple source-signal, timetype-timevalue or geotype-geovalue pairs. Use them instead of the usual required parameters.

Parameter Replaces Format Description Example
signal data_source, signal signal={source}:{signal1},{signal2} Specify multiple source-signal pairs, grouped by source signal=src1:sig1, signal=src1:sig1,sig2, signal=src1:*, signal=src1:sig1;src2:sig3
time time_type, time_values time={timetype}:{timevalue1},{timevalue2} Specify multiple timetype-timevalue pairs, grouped by timetype time=day:*, time=day:20201201, time=day:20201201,20201202, time=day:20201201-20201204
geo geo_type, geo_value geo={geotype}:{geovalue1},{geovalue2} Specify multiple geotype-geovalue pairs, grouped by geotype geo=fips:*, geo=fips:04019, geo=fips:04019,19143, geo=fips:04019;msa:40660, geo=fips:*;msa:*

Optional

Estimates for a specific time_value and geo_value are sometimes updated after they are first published. Many of our data sources issue corrections or backfill estimates as data arrives; see the documentation for each source for details.

The default API behavior is to return the most recently issued value for each time_value selected.

We also provide access to previous versions of data using the optional query parameters below.

Parameter Description Type
as_of maximum time unit (e.g., date) when the signal data were published (return most recent for each time_value) time value (e.g., 20200401)
issues time unit (e.g., date) when the signal data were published (return all matching records for each time_value) list of time values (e.g., 20200401)
lag time delta (e.g. days) between when the underlying events happened and when the data were published integer

Use cases:

  • To pretend like you queried the API on June 1, such that the returned results do not include any updates that became available after June 1, use as_of=20200601.
  • To retrieve only data that was published or updated on June 1, and exclude records whose most recent update occurred earlier than June 1, use issues=20200601.
  • To retrieve all data that was published between May 1 and June 1, and exclude records whose most recent update occurred earlier than May 1, use issues=20200501-20200601. The results will include all matching issues for each time_value, not just the most recent.
  • To retrieve only data that was published or updated exactly 3 days after the underlying events occurred, use lag=3.

You should specify only one of these three parameters in any given query.

Note: Each issue in the versioning system contains only the records added or updated during that time unit; we exclude records whose values remain the same as a previous issue. If you have a research problem that would require knowing when we last confirmed an unchanged value, please get in touch.

Response

Field Description Type
result result code: 1 = success, 2 = too many results, -2 = no results integer
epidata list of results, 1 per geo/time pair array of objects
epidata[].source selected data_source string
epidata[].signal selected signal string
epidata[].geo_type selected geo_type string
epidata[].geo_value location code, depending on geo_type string
epidata[].time_type selected time_type string
epidata[].time_value time unit (e.g. date) over which underlying events happened (see date coding details) integer
epidata[].value value (statistic) derived from the underlying data source float
epidata[].stderr approximate standard error of the statistic with respect to its sampling distribution, null when not applicable float
epidata[].direction trend classifier (+1 -> increasing, 0 -> steady or not determined, -1 -> decreasing) integer
epidata[].sample_size number of "data points" used in computing the statistic, null when not applicable float
epidata[].issue time unit (e.g. date) when this statistic was published integer
epidata[].lag time delta (e.g. days) between when the underlying events happened and when this statistic was published integer
epidata[].missing_value an integer code that is zero when the value field is present and non-zero when the data is missing (see missing codes) integer
epidata[].missing_stderr an integer code that is zero when the stderr field is present and non-zero when the data is missing (see missing codes) integer
epidata[].missing_sample_size an integer code that is zero when the sample_size field is present and non-zero when the data is missing (see missing codes) integer
message success or error message string

Note: result code 2, "too many results", means that the number of results you requested was greater than the API's maximum results limit. Results will be returned, but not all of the results you requested. API clients should check the results code and consider breaking up requests for e.g. large time intervals into multiple API calls.

Alternative Response Formats

In addition to the default EpiData Response format, users can customize the response format using the format= parameter.

JSON List Response

When setting the format parameter to format=json, it will return a plain list of the epidata response objects without the result and message wrapper. The status of the query is returned via HTTP status codes. For example, a status code of 200 means the query succeeded, while 400 indicates that the query has a missing, misspelled, or otherwise invalid parameter. For all status codes != 200, the returned JSON includes details about what part of the query couldn't be interpreted.

CSV File Response

When setting the format parameter to format=csv, it will return a CSV file with same columns as the response objects. HTTP status codes are used to communicate success/failure, similar to format=json.

JSON New Lines Response

When setting the format parameter to format=jsonl, it will return each row as an JSON file separated by a single new line character \n. This format is useful for incremental streaming of the results. Similar to the JSON list response status codes are used.

Limit Returned Fields

The fields parameter can be used to limit which fields are included in each returned row. This is useful in web applications to reduce the amount of data transmitted. The fields parameter supports two syntaxes: allow and deny. Using allowlist syntax, only the listed fields will be returned. For example, fields=geo_value,value will drop all fields from the returned data except for geo_value and value. To use denylist syntax instead, prefix each field name with a dash (-) to exclude it from the results. For example, fields=-direction will include all fields in the returned data except for the direction field.

Example URLs

Facebook Survey CLI on 2020-04-06 to 2010-04-10 (county 06001)

https://api.delphi.cmu.edu/epidata/covidcast/?data_source=fb-survey&signal=smoothed_cli&time_type=day&geo_type=county&time_values=20200406-20200410&geo_value=06001

or

https://api.delphi.cmu.edu/epidata/covidcast/?signal=fb-survey:smoothed_cli&time=day:20200406-20200410&geo=county:06001

Both of these URLs are equivalent and can be used to get the following result:

{
  "result": 1,
  "epidata": [
    {
      "geo_value": "06001",
      "time_value": 20200407,
      "direction": null,
      "value": 1.1293550689064,
      "stderr": 0.53185454111042,
      "sample_size": 281.0245
    },
    ...
  ],
  "message": "success"
}

Facebook Survey CLI on 2020-04-06 (all counties)

https://api.delphi.cmu.edu/epidata/covidcast/?data_source=fb-survey&signal=smoothed_cli&time_type=day&geo_type=county&time_values=20200406&geo_value=*

{
  "result": 1,
  "epidata": [
    {
      "geo_value": "01000",
      "time_value": 20200406,
      "direction": null,
      "value": 1.1693378,
      "stderr": 0.1909232,
      "sample_size": 1451.0327
    },
    ...
  ],
  "message": "success"
}