Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add percent_of_expected_deaths signal and dry-run mode to NCHS mortality data pipeline #233

Merged
merged 18 commits into from
Oct 26, 2020

Conversation

jingjtang
Copy link
Contributor

@jingjtang jingjtang commented Aug 20, 2020

Closes #119

  • Add percent_of_expected_deaths signal
    • values are around 1
    • no metric level 2 (num or prop) for this signal, just report the raw values
  • Add a dry-run mode, so that the test cases are independent from the token

I noticed that they actually update the dataset every weekday. (The column data_as_of is always the date of today) Maybe we want to run the pipeline everyday in order to get the backfill info.(The pipeline will run for only less than 10 seconds every time)

@krivard
Copy link
Contributor

krivard commented Aug 25, 2020

Issue to the API weekly, but track daily updates in S3 using the diff-based archive utility (weekly updates would be tracked in S3 anyway). This will require extending the utility to handle this weird case.

@krivard
Copy link
Contributor

krivard commented Sep 9, 2020

Separate diff tracking into daily diffs and weekly diffs.

@krivard krivard requested a review from dshemetov September 30, 2020 19:25
@dshemetov
Copy link
Contributor

Checking code coverage in the tests:

----------- coverage: platform linux, python 3.8.5-final-0 -----------
Name                                                                                       Stmts   Miss  Cover   Missing
------------------------------------------------------------------------------------------------------------------------
/home/dskel/Code/covidcast-indicators-1/nchs_mortality/delphi_nchs_mortality/pull.py          38      9    76%   46-48, 54-55, 76, 81, 90-91
/home/dskel/Code/covidcast-indicators-1/nchs_mortality/delphi_nchs_mortality/run.py           72      6    92%   37-39, 123, 141-142, 146
------------------------------------------------------------------------------------------------------------------------
TOTAL                                                                                        127     16    87%
  • There are a few ValueExceptions in pull.py that haven't been tested: 54-55, 76, 81, 90-91. It may be worth it to add a simple test case for each of those.
  • The run.py lines are mostly archiver cases, which I think we can assume to have been tested as part of the archiver util.

@dshemetov
Copy link
Contributor

Not part of this set of commits, so maybe this belongs to another issue, but I'm wondering if the lines 63-67 in pull.py are necessary. Doesn't groupby automatically group the unique values and make a multi-index?

    state_list = df["state"].unique()
    date_list = df["timestamp"].unique()
    index_df = pd.MultiIndex.from_product(
        [state_list, date_list], names=['state', 'timestamp']
    )
    df = df.groupby(
            ["state", "timestamp"]).sum().reindex(index_df).reset_index()

@jingjtang
Copy link
Contributor Author

Not part of this set of commits, so maybe this belongs to another issue, but I'm wondering if the lines 63-67 in pull.py are necessary. Doesn't groupby automatically group the unique values and make a multi-index?

    state_list = df["state"].unique()
    date_list = df["timestamp"].unique()
    index_df = pd.MultiIndex.from_product(
        [state_list, date_list], names=['state', 'timestamp']
    )
    df = df.groupby(
            ["state", "timestamp"]).sum().reindex(index_df).reset_index()

This is added in case the values for some states are missing for certain dates.

@dshemetov
Copy link
Contributor

This is added in case the values for some states are missing for certain dates.

I see, so this is to make sure that every state has the same number of reported dates?

@jingjtang
Copy link
Contributor Author

This is added in case the values for some states are missing for certain dates.

I see, so this is to make sure that every state has the same number of reported dates?

Yes.

@dshemetov
Copy link
Contributor

dshemetov commented Oct 7, 2020

Should FIPS here be replaced by state? Line 74 in pull.py

    # each FIPS has same number of rows
    if (len(days_by_states) > 1) or (days_by_states[0] != len(unique_days)):
        raise ValueError("Differing number of days by fips")

Copy link
Contributor

@dshemetov dshemetov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly small linter fixes here.

My only other comment is that the weekly vs daily updates change was tough to understand reading this thread or the code. I think adding some elaboration to the documentation for that would help people coming to this codebase later.

Other than that, all tests pass, so after the linter fixes, I think this PR is good to go.

'pneumonia_deaths', 'pneumonia_and_covid_deaths',
'influenza_deaths', 'pneumonia_influenza_or_covid_19_deaths',
"timestamp", "geo_id", "population"]).all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linter asks for there to be a new line at the end of this file.

@@ -1,49 +1,67 @@
import pytest

from os import listdir
import datetime as dt
from os import listdir, remove
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linter says remove is no longer needed.

]
metrics = [
'covid_deaths', 'total_deaths', 'pneumonia_deaths',
'pneumonia_and_covid_deaths', 'influenza_deaths',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing space here.

@krivard
Copy link
Contributor

krivard commented Oct 12, 2020

@dshemetov Agree on documentation -- @jingjtang can you add a section to DETAILS.md describing the details of:

  • how often we check for updates from NCHS
  • how those changes are tracked
  • how often we upload to the API
  • how that changeset is generated

@krivard krivard requested a review from dshemetov October 13, 2020 13:48
@dshemetov
Copy link
Contributor

dshemetov commented Oct 13, 2020

@jingjtang See my first comment for code coverage concerns. I think we should try to make sure we have tests that cover all the code lines, so we don't have surprises down the road. In particular:

  • run.py
    • lines 37-39
  • pull.py
    • lines 76, 81, 90-91

If you're absolutely sure those work, I can relent on this issue. Just trying to enforce some testing consistency.

@jingjtang
Copy link
Contributor Author

@jingjtang See my first comment for code coverage concerns. I think we should try to make sure we have tests that cover all the code lines, so we don't have surprises down the road. In particular:

  • run.py

    • lines 37-39
  • pull.py

    • lines 76, 81, 90-91

If you're absolutely sure those work, I can relent on this issue. Just trying to enforce some testing consistency.

Added a new test case for missing cols. As for others:

  • run.py 37-39 is added for the automation usage, will check with @korlaxxalrok later
  • pull.py They are just final sanity checks for the previous code.

@dshemetov
Copy link
Contributor

LGTM!

@krivard krivard merged commit 7f07c41 into main Oct 26, 2020
@krivard krivard mentioned this pull request Oct 26, 2020
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider adding vintage NCHS mortality data from the CDC
4 participants