Add `percent_of_expected_deaths` signal and dry-run mode to NCHS mortality data pipeline #233

jingjtang · 2020-08-20T21:36:15Z

Closes #119

Add percent_of_expected_deaths signal
- values are around 1
- no metric level 2 (num or prop) for this signal, just report the raw values
Add a dry-run mode, so that the test cases are independent from the token

I noticed that they actually update the dataset every weekday. (The column data_as_of is always the date of today) Maybe we want to run the pipeline everyday in order to get the backfill info.(The pipeline will run for only less than 10 seconds every time)

krivard · 2020-08-25T18:27:59Z

Issue to the API weekly, but track daily updates in S3 using the diff-based archive utility (weekly updates would be tracked in S3 anyway). This will require extending the utility to handle this weird case.

"Get up-to-date utils"

krivard · 2020-09-09T19:34:46Z

Separate diff tracking into daily diffs and weekly diffs.

dshemetov · 2020-10-07T18:50:02Z

Checking code coverage in the tests:

----------- coverage: platform linux, python 3.8.5-final-0 -----------
Name                                                                                       Stmts   Miss  Cover   Missing
------------------------------------------------------------------------------------------------------------------------
/home/dskel/Code/covidcast-indicators-1/nchs_mortality/delphi_nchs_mortality/pull.py          38      9    76%   46-48, 54-55, 76, 81, 90-91
/home/dskel/Code/covidcast-indicators-1/nchs_mortality/delphi_nchs_mortality/run.py           72      6    92%   37-39, 123, 141-142, 146
------------------------------------------------------------------------------------------------------------------------
TOTAL                                                                                        127     16    87%

There are a few ValueExceptions in pull.py that haven't been tested: 54-55, 76, 81, 90-91. It may be worth it to add a simple test case for each of those.
The run.py lines are mostly archiver cases, which I think we can assume to have been tested as part of the archiver util.

dshemetov · 2020-10-07T18:56:34Z

Not part of this set of commits, so maybe this belongs to another issue, but I'm wondering if the lines 63-67 in pull.py are necessary. Doesn't groupby automatically group the unique values and make a multi-index?

    state_list = df["state"].unique()
    date_list = df["timestamp"].unique()
    index_df = pd.MultiIndex.from_product(
        [state_list, date_list], names=['state', 'timestamp']
    )
    df = df.groupby(
            ["state", "timestamp"]).sum().reindex(index_df).reset_index()

jingjtang · 2020-10-07T19:00:05Z

Not part of this set of commits, so maybe this belongs to another issue, but I'm wondering if the lines 63-67 in pull.py are necessary. Doesn't groupby automatically group the unique values and make a multi-index?
    state_list = df["state"].unique()
    date_list = df["timestamp"].unique()
    index_df = pd.MultiIndex.from_product(
        [state_list, date_list], names=['state', 'timestamp']
    )
    df = df.groupby(
            ["state", "timestamp"]).sum().reindex(index_df).reset_index()

This is added in case the values for some states are missing for certain dates.

dshemetov · 2020-10-07T19:03:38Z

This is added in case the values for some states are missing for certain dates.

I see, so this is to make sure that every state has the same number of reported dates?

jingjtang · 2020-10-07T19:04:31Z

This is added in case the values for some states are missing for certain dates.

I see, so this is to make sure that every state has the same number of reported dates?

Yes.

dshemetov · 2020-10-07T19:05:14Z

Should FIPS here be replaced by state? Line 74 in pull.py

    # each FIPS has same number of rows
    if (len(days_by_states) > 1) or (days_by_states[0] != len(unique_days)):
        raise ValueError("Differing number of days by fips")

dshemetov

Mostly small linter fixes here.

My only other comment is that the weekly vs daily updates change was tough to understand reading this thread or the code. I think adding some elaboration to the documentation for that would help people coming to this codebase later.

Other than that, all tests pass, so after the linter fixes, I think this PR is good to go.

dshemetov · 2020-10-09T22:34:20Z

nchs_mortality/tests/test_pull.py

+                'pneumonia_deaths', 'pneumonia_and_covid_deaths',
+                'influenza_deaths', 'pneumonia_influenza_or_covid_19_deaths',
+                "timestamp", "geo_id", "population"]).all()


The linter asks for there to be a new line at the end of this file.

dshemetov · 2020-10-09T22:34:36Z

nchs_mortality/tests/test_run.py

@@ -1,49 +1,67 @@
 import pytest

-from os import listdir
+import datetime as dt
+from os import listdir, remove


Linter says remove is no longer needed.

dshemetov · 2020-10-09T22:34:55Z

nchs_mortality/tests/test_run.py

+            ]
+            metrics = [
+                    'covid_deaths', 'total_deaths', 'pneumonia_deaths',
+                    'pneumonia_and_covid_deaths', 'influenza_deaths', 


Trailing space here.

krivard · 2020-10-12T20:21:37Z

@dshemetov Agree on documentation -- @jingjtang can you add a section to DETAILS.md describing the details of:

how often we check for updates from NCHS
how those changes are tracked
how often we upload to the API
how that changeset is generated

dshemetov · 2020-10-13T18:56:27Z

@jingjtang See my first comment for code coverage concerns. I think we should try to make sure we have tests that cover all the code lines, so we don't have surprises down the road. In particular:

run.py
- lines 37-39
pull.py
- lines 76, 81, 90-91

If you're absolutely sure those work, I can relent on this issue. Just trying to enforce some testing consistency.

…sing

…ast-indicators into nchs_mortality

jingjtang · 2020-10-13T19:30:05Z

@jingjtang See my first comment for code coverage concerns. I think we should try to make sure we have tests that cover all the code lines, so we don't have surprises down the road. In particular:

run.py

lines 37-39

pull.py

lines 76, 81, 90-91

If you're absolutely sure those work, I can relent on this issue. Just trying to enforce some testing consistency.

Added a new test case for missing cols. As for others:

run.py 37-39 is added for the automation usage, will check with @korlaxxalrok later
pull.py They are just final sanity checks for the previous code.

dshemetov · 2020-10-23T00:10:34Z

LGTM!

Jingjing Tang and others added 3 commits August 20, 2020 17:30

add dry-run mode and signal

1f462e9

Update DETAILS.md

69bfab2

Update DETAILS.md

6a7acf0

jingjtang mentioned this pull request Aug 21, 2020

Consider adding vintage NCHS mortality data from the CDC #119

Closed

Jingjing Tang added 2 commits August 31, 2020 11:06

Merge branch 'main' into nchs_mortality

e778f62

"Get up-to-date utils"

added code for uploading diffs

aaa04ac

eujing added 3 commits September 16, 2020 12:48

Two archiving, updated tests

6c02d0f

Testing for mon vs non-mon behavior

eafba6b

Added used folders

5e3796b

krivard requested a review from dshemetov September 30, 2020 19:25

jingjtang and others added 2 commits September 30, 2020 16:00

Fixed a spelling error

5ffbaa7

Added missing test directory

ade3feb

dshemetov reviewed Oct 12, 2020

View reviewed changes

Jingjing Tang added 3 commits October 12, 2020 23:16

added unit tests for bad test files

d78ee3a

fixed linter errors

c164a6d

added description for data versioning

b7ec16d

krivard requested a review from dshemetov October 13, 2020 13:48

Jingjing Tang and others added 2 commits October 13, 2020 11:31

added test data for bad cases

3a2adcb

Clarified data versioning documentation

19ecdbc

updated code for adding a test case where some necessary cols are mis…

4dddf0f

…sing

Jingjing Tang added 2 commits October 13, 2020 15:21

Merge branch 'nchs_mortality' of https://github.com/cmu-delphi/covidc…

e25d4c4

…ast-indicators into nchs_mortality

added data for the new test case

4999ced

dshemetov approved these changes Oct 23, 2020

View reviewed changes

krivard merged commit 7f07c41 into main Oct 26, 2020

krivard mentioned this pull request Oct 26, 2020

Release NCHS Mortality #367

Closed

9 tasks

Add percent_of_expected_deaths signal and dry-run mode to NCHS mortality data pipeline #233

Add percent_of_expected_deaths signal and dry-run mode to NCHS mortality data pipeline #233

Uh oh!

Conversation

jingjtang commented Aug 20, 2020 • edited by krivard Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krivard commented Aug 25, 2020

Uh oh!

krivard commented Sep 9, 2020

Uh oh!

dshemetov commented Oct 7, 2020

Uh oh!

dshemetov commented Oct 7, 2020

Uh oh!

jingjtang commented Oct 7, 2020

Uh oh!

dshemetov commented Oct 7, 2020

Uh oh!

jingjtang commented Oct 7, 2020

Uh oh!

dshemetov commented Oct 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dshemetov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dshemetov Oct 9, 2020

Choose a reason for hiding this comment

Uh oh!

dshemetov Oct 9, 2020

Choose a reason for hiding this comment

Uh oh!

dshemetov Oct 9, 2020

Choose a reason for hiding this comment

Uh oh!

krivard commented Oct 12, 2020

Uh oh!

dshemetov commented Oct 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jingjtang commented Oct 13, 2020

Uh oh!

dshemetov commented Oct 23, 2020

Uh oh!

Uh oh!

Add `percent_of_expected_deaths` signal and dry-run mode to NCHS mortality data pipeline #233

Add `percent_of_expected_deaths` signal and dry-run mode to NCHS mortality data pipeline #233

jingjtang commented Aug 20, 2020 •

edited by krivard

Loading

dshemetov commented Oct 7, 2020 •

edited

Loading

dshemetov left a comment •

edited

Loading

dshemetov commented Oct 13, 2020 •

edited

Loading