cmu-delphi · krivard · Oct 21, 2020 · Oct 15, 2020 · Oct 19, 2020 · Oct 19, 2020
diff --git a/changehc/README.md b/changehc/README.md
@@ -0,0 +1,77 @@
+# Change Healthcare Indicator
+
+COVID-19 indicator using outpatient visits from Change Healthcare claims data.
+Reads claims data into pandas dataframe.
+Makes appropriate date shifts, adjusts for backfilling, and smooths estimates.
+Writes results to csvs.
+
+
+## Running the Indicator
+
+The indicator is run by directly executing the Python module contained in this
+directory. The safest way to do this is to create a virtual environment,
+installed the common DELPHI tools, and then install the module and its
+dependencies. To do this, run the following code from this directory:
+
+```
+python -m venv env
+source env/bin/activate
+pip install ../_delphi_utils_python/.
+pip install .
+```
+
+*Note*: you may need to install blas, in Ubuntu do
+```
+sudo apt-get install libatlas-base-dev gfortran
+```
+
+All of the user-changable parameters are stored in `params.json`. To execute
+the module and produce the output datasets (by default, in `receiving`), run
+the following:
+
+```
+env/bin/python -m delphi_changehc
+```
+
+Once you are finished with the code, you can deactivate the virtual environment
+and (optionally) remove the environment itself.
+
+```
+deactivate
+rm -r env
+```
+
+## Testing the code
+
+To do a static test of the code style, it is recommended to run **pylint** on
+the module. To do this, run the following from the main module directory:
+
+```
+env/bin/pylint delphi_changehc
+```
+
+The most aggressive checks are turned off; only relatively important issues
+should be raised and they should be manually checked (or better, fixed).
+
+Unit tests are also included in the module. To execute these, run the following
+command from this directory:
+
+```
+(cd tests && ../env/bin/pytest --cov=delphi_changehc --cov-report=term-missing)
+```
+
+The output will show the number of unit tests that passed and failed, along
+with the percentage of code covered by the tests. None of the tests should
+fail and the code lines that are not covered by unit tests should be small and
+should not include critical sub-routines.
+
+## Code tour
+
+- update_sensor.py: CHCSensorUpdator: reads the data, makes transformations, writes results to file
+- sensor.py: CHCSensor: methods for transforming data, including backfill and smoothing
+- smooth.py: implements local linear left Gaussian filter
+- load_data.py: methods for loading denominator and covid data
+- config.py: Config: constants for reading data and transformations, Constants: constants for sanity checks
+- constants.py: constants for signal names
+- weekday.py: Weekday: Adjusts for weekday effect
+
diff --git a/changehc/REVIEW.md b/changehc/REVIEW.md
@@ -0,0 +1,39 @@
+## Code Review (Python)
+
+A code review of this module should include a careful look at the code and the
+output. To assist in the process, but certainly not in replace of it, please
+check the following items.
+
+**Documentation**
+
+- [ ] the README.md file template is filled out and currently accurate; it is
+possible to load and test the code using only the instructions given
+- [ ] minimal docstrings (one line describing what the function does) are
+included for all functions; full docstrings describing the inputs and expected
+outputs should be given for non-trivial functions
+
+**Structure**
+
+- [ ] code should use 4 spaces for indentation; other style decisions are
+flexible, but be consistent within a module
+- [ ] any required metadata files are checked into the repository and placed
+within the directory `static`
+- [ ] any intermediate files that are created and stored by the module should
+be placed in the directory `cache`
+- [ ] final expected output files to be uploaded to the API are placed in the
+`receiving` directory; output files should not be committed to the respository
+- [ ] all options and API keys are passed through the file `params.json`
+- [ ] template parameter file (`params.json.template`) is checked into the
+code; no personal (i.e., usernames) or private (i.e., API keys) information is
+included in this template file
+
+**Testing**
+
+- [ ] module can be installed in a new virtual environment
+- [ ] pylint with the default `.pylint` settings run over the module produces
+minimal warnings; warnings that do exist have been confirmed as false positives
+- [ ] reasonably high level of unit test coverage covering all of the main logic
+of the code (e.g., missing coverage for raised errors that do not currently seem
+possible to reach are okay; missing coverage for options that will be needed are
+not)
+- [ ] all unit tests run without errors
diff --git a/changehc/cache/.gitignore b/changehc/cache/.gitignore
diff --git a/changehc/delphi_changehc/__init__.py b/changehc/delphi_changehc/__init__.py
@@ -0,0 +1,19 @@
+# -*- coding: utf-8 -*-
+"""Module to pull and clean indicators from the CHC source.
+
+This file defines the functions that are made public by the module. As the
+module is intended to be executed though the main method, these are primarily
+for testing.
+"""
+
+from __future__ import absolute_import
+
+from . import config
+from . import load_data
+from . import run
+from . import sensor
+from . import smooth
+from . import update_sensor
+from . import weekday
+
+__version__ = "0.0.0"
diff --git a/changehc/delphi_changehc/__main__.py b/changehc/delphi_changehc/__main__.py
@@ -0,0 +1,11 @@
+# -*- coding: utf-8 -*-
+"""Call the function run_module when executed.
+
+This file indicates that calling the module (`python -m MODULE_NAME`) will
+call the function `run_module` found within the run.py file. There should be
+no need to change this template.
+"""
+
+from .run import run_module  # pragma: no cover
+
+run_module()  # pragma: no cover
diff --git a/changehc/delphi_changehc/config.py b/changehc/delphi_changehc/config.py
@@ -0,0 +1,61 @@
+"""
+This file contains configuration variables used to generate the CHC signal.
+
+Author: Aaron Rumack
+Created: 2020-10-14
+"""
+
+from datetime import datetime, timedelta
+import numpy as np
+
+
+class Config:
+    """Static configuration variables.
+    """
+
+    ## dates
+    FIRST_DATA_DATE = datetime(2020, 1, 1)
+
+    # number of days training needs to produce estimate
+    # (one day needed for smoother to produce values)
+    BURN_IN_PERIOD = timedelta(days=1)
+
+    # shift dates forward for labeling purposes
+    DAY_SHIFT = timedelta(days=1)
+
+    ## data columns
+    COVID_COL = "COVID"
+    DENOM_COL = "Denominator"
+    COUNT_COLS = ["COVID"] + ["Denominator"]
+    DATE_COL = "date"
+    GEO_COL = "fips"
+    ID_COLS = [DATE_COL] + [GEO_COL]
+    FILT_COLS = ID_COLS + COUNT_COLS
+    DENOM_COLS = [GEO_COL, DATE_COL, DENOM_COL]
+    COVID_COLS = [GEO_COL, DATE_COL, COVID_COL]
+    DENOM_DTYPES = {"date": str, "Denominator": str, "fips": str}
+    COVID_DTYPES = {"date": str, "COVID": str, "fips": str}
+
+    SMOOTHER_BANDWIDTH = 100  # bandwidth for the linear left Gaussian filter
+    MIN_DEN = 100  # number of total visits needed to produce a sensor
+    MAX_BACKFILL_WINDOW = (
+        7  # maximum number of days used to average a backfill correction
+    )
+    MIN_CUM_VISITS = 500  # need to observe at least 500 counts before averaging
+
+
+class Constants:
+    """
+    Contains the maximum number of geo units for each geo type
+    Used for sanity checks
+    """
+    # number of counties in usa, including megacounties
+    NUM_COUNTIES = 3141 + 52
+    NUM_HRRS = 308
+    NUM_MSAS = 392 + 52  # MSA + States
+    NUM_STATES = 52  # including DC and PR
+
+    MAX_GEO = {"county": NUM_COUNTIES,
+               "hrr": NUM_HRRS,
+               "msa": NUM_MSAS,
+               "state": NUM_STATES}
diff --git a/changehc/delphi_changehc/constants.py b/changehc/delphi_changehc/constants.py
@@ -0,0 +1,7 @@
+"""Registry for signal names and geo types"""
+SMOOTHED = "smoothed_chc"
+SMOOTHED_ADJ = "smoothed_adj_chc"
+SIGNALS = [SMOOTHED, SMOOTHED_ADJ]
+NA = "NA"
+HRR = "hrr"
+FIPS = "fips"
diff --git a/changehc/delphi_changehc/load_data.py b/changehc/delphi_changehc/load_data.py
@@ -0,0 +1,147 @@
+"""
+Load CHC data.
+
+Author: Aaron Rumack
+Created: 2020-10-14
+"""
+
+# third party
+import pandas as pd
+
+# first party
+from .config import Config
+
+
+def load_denom_data(denom_filepath, dropdate, base_geo):
+    """Load in and set up denominator data.
+
+    Args:
+        denom_filepath: path to the aggregated denominator data
+        dropdate: data drop date (datetime object)
+        base_geo: base geographic unit before aggregation ('fips')
+
+    Returns:
+        cleaned denominator dataframe
+    """
+    assert base_geo == "fips", "base unit must be 'fips'"
+
+    denom_suffix = denom_filepath.split("/")[-1].split(".")[0][9:]
+    assert denom_suffix == "All_Outpatients_By_County"
+    denom_filetype = denom_filepath.split("/")[-1].split(".")[1]
+    assert denom_filetype == "dat"
+
+    denom_data = pd.read_csv(
+        denom_filepath,
+        sep="|",
+        header=None,
+        names=Config.DENOM_COLS,
+        dtype=Config.DENOM_DTYPES,
+    )
+
+    denom_data[Config.DATE_COL] = \
+        pd.to_datetime(denom_data[Config.DATE_COL],errors="coerce")
+
+    # restrict to start and end date
+    denom_data = denom_data[
+        (denom_data[Config.DATE_COL] >= Config.FIRST_DATA_DATE) &
+        (denom_data[Config.DATE_COL] < dropdate)
+        ]
+
+    # counts between 1 and 3 are coded as "3 or less", we convert to 1
+    denom_data[Config.DENOM_COL][
+        denom_data[Config.DENOM_COL] == "3 or less"
+        ] = "1"
+    denom_data[Config.DENOM_COL] = denom_data[Config.DENOM_COL].astype(int)
+
+    assert (
+        (denom_data[Config.DENOM_COL] >= 0).all().all()
+    ), "Denominator counts must be nonnegative"
+
+    # aggregate age groups (so data is unique by date and base geography)
+    denom_data = denom_data.groupby([base_geo, Config.DATE_COL]).sum()
+    denom_data.dropna(inplace=True)  # drop rows with any missing entries
+
+    return denom_data
+
+def load_covid_data(covid_filepath, dropdate, base_geo):
+    """Load in and set up denominator data.
+
+    Args:
+        covid_filepath: path to the aggregated covid data
+        dropdate: data drop date (datetime object)
+        base_geo: base geographic unit before aggregation ('fips')
+
+    Returns:
+        cleaned denominator dataframe
+    """
+    assert base_geo == "fips", "base unit must be 'fips'"
+
+    covid_suffix = covid_filepath.split("/")[-1].split(".")[0][9:]
+    assert covid_suffix == "Covid_Outpatients_By_County"
+    covid_filetype = covid_filepath.split("/")[-1].split(".")[1]
+    assert covid_filetype == "dat"
+
+    covid_data = pd.read_csv(
+        covid_filepath,
+        sep="|",
+        header=None,
+        names=Config.COVID_COLS,
+        dtype=Config.COVID_DTYPES,
+        parse_dates=[Config.DATE_COL]
+    )
+
+    covid_data[Config.DATE_COL] = \
+        pd.to_datetime(covid_data[Config.DATE_COL],errors="coerce")
+
+    # restrict to start and end date
+    covid_data = covid_data[
+        (covid_data[Config.DATE_COL] >= Config.FIRST_DATA_DATE) &
+        (covid_data[Config.DATE_COL] < dropdate)
+        ]
+
+    # counts between 1 and 3 are coded as "3 or less", we convert to 1
+    covid_data[Config.COVID_COL][
+        covid_data[Config.COVID_COL] == "3 or less"
+        ] = "1"
+    covid_data[Config.COVID_COL] = covid_data[Config.COVID_COL].astype(int)
+
+    assert (
+        (covid_data[Config.COVID_COL] >= 0).all().all()
+    ), "COVID counts must be nonnegative"
+
+    # aggregate age groups (so data is unique by date and base geography)
+    covid_data = covid_data.groupby([base_geo, Config.DATE_COL]).sum()
+    covid_data.dropna(inplace=True)  # drop rows with any missing entries
+
+    return covid_data
+
+
+def load_combined_data(denom_filepath, covid_filepath, dropdate, base_geo):
+    """Load in denominator and covid data, and combine them.
+
+    Args:
+        denom_filepath: path to the aggregated denominator data
+        covid_filepath: path to the aggregated covid data
+        dropdate: data drop date (datetime object)
+        base_geo: base geographic unit before aggregation ('fips')
+
+    Returns:
+        combined multiindexed dataframe, index 0 is geo_base, index 1 is date
+    """
+    assert base_geo == "fips", "base unit must be 'fips'"
+
+    # load each data stream
+    denom_data = load_denom_data(denom_filepath, dropdate, base_geo)
+    covid_data = load_covid_data(covid_filepath, dropdate, base_geo)
+
+    # merge data
+    data = denom_data.merge(covid_data, how="outer", left_index=True, right_index=True)
+    assert data.isna().all(axis=1).sum() == 0, "entire row is NA after merge"
+
+    # calculate combined numerator and denominator
+    data.fillna(0, inplace=True)
+    data["num"] = data[Config.COVID_COL]
+    data["den"] = data[Config.DENOM_COL]
+    data = data[["num", "den"]]
+
+    return data