-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JHU changed Puerto Rico death reporting, causing us to fail to report it #179
Comments
See also CSSEGISandData/COVID-19#2889 |
I'm looking into tackling this with #215. I'm thinking of splitting the deaths across the FIPS codes based on population data, so that we don't mix state level data with FIPS level data. Would this cause issues down the pipeline? It would help making the geocoding consistent. We can reaggregate the deaths back into commonwealth level and serve only that at the API. |
@krivard , @capnrefsmmat asked me to ping you. The issue with "Unassigned" counts for JHU data seems to be affecting states like Wyoming and Rhode Island (and presumably all other states as well). As a result, it is affecting our state-level forecasts for all states. (This also seems related to this closed issue.) |
In theory, a behavior where we map unassigned cases/deaths to a megacounty that then gets aggregated into the state figures was added in commit 5ff04c0. This commit is present in the commit log for the version of the JHU indicator in production, so if it's not doing the right thing now, then it's likely this was later overridden when we switched to the new geo aggregator. @dshemetov, investigate? it may be worth boosting the priority on merging #215; what would we need to get that done in the next two weeks? In the meantime, @ajgreen93 iirc we added USAFacts to avoid this exact issue, so you might try switching indicators. |
Just looked: yes, in |
Just looked into the Puerto Rico issues. Like capnrefsmmat reports, we still don't have Puerto Rico cases information in the API after July 17th, despite the data being present in the JHU time_series we pull. Here is the strange thing: the Puerto Rico cases after July 17th are showing up in the receiving folder on the The Puerto Rico deaths issue has been fixed with the megaFIPS fix. |
We are successfully ingesting data from Puerto Rico for the following combinations: (state, county) X (cases, deaths) X (incidence, cumulative) X (num):
For (state) X (cases, deaths) X (incidence, cumulative) X (prop), the
(this is probably related to #227) Since this would cause the ingestion mechanism to reject the whole file, while we're waiting for fixes on #227 and #254 I have a cron job running to pick up the JHU files from receiving on the server, and strip out lines with illegal values and geo identifiers before they reach ingestion. This will probably wreak havoc with the diff-based archive utility once we have fixes in place, but it was a better option than having state cases and deaths ratios be completely unavailable. For (county) X (cases, deaths) X (incidence, cumulative) X (prop), the |
@krivard can you check the same thing but for
It is likely a population divide by zero issue, but not sure why it's day-dependent? |
There are no mentions of `72XXX` counties before 31 August in the success files. We only keep backups of the most recently submitted csv for each day, so
any day with PR for an earlier issue (before the inf bug) will have been
overwritten by a more recent issue (after the inf bug) which has had all
the invalid PR lines filtered out.
Recall however that the issue definition for JHU has changed multiple
times, including:
1. all days going back to 2 February
2. only the last 7 days of raw data and the last 1 day of 7dav data
3. only the new or updated lines in any new or updated file going back to 2
February
so any appearance of day-dependence of a zero population effect may instead
be related to which days of data were in the issue when the population data
first went awry.
…On Fri, Sep 11, 2020 at 12:33 PM Dmitry Shemetov ***@***.***> wrote:
@krivard <https://github.com/krivard> can you check the same thing but
for 20200826? For some reason, our API provides Puerto Rico cases since
20200827, but not before.
>cc_df = covidcast.signal("jhu-csse", "confirmed_incidence_num",
date(2020, 7, 14), date(2020, 9, 11),
geo_type="county")
>cc_df[cc_df["geo_value"].isin([str(x) for x in range(72001, 72999)])]["time_value"].min()
Timestamp('2020-08-27 00:00:00')
It is likely a population divide by zero issue, but I'm not sure why we
have a day-dependent bug.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#179 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAI24CXJ3L7R5O3HEWLCPZLSFJGMJANCNFSM4PTMBBRQ>
.
|
Here's a plot of the number of days in each issue from 2 July to 10 Sept where county data for 72000 (the PR megacounty) or 72001 are available:
So it looks like we supported PR mega counties through mid-July, then picked up on individual county data on 27 August. In theory the diff-based issue generator should reissue PR county data back to 2 Feb as soon as it becomes available and valid; in practice we may have to babysit it a bit. |
I'm having trouble grokking this. My guess is it's because I don't know what issue means. Is that the date when the data was released? And is this definition something set by us or by JHU or both? |
Ah sorry, that's data versioning terminology. Issue like a magazine issue;
a collection of data that was uploaded to receiving and published together.
For daily signals, the issue date is a day. For weekly signals, the issue
date is an epidemiological week ("epiweek"). More info in the API docs here
<https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html#optional> or
the onboarding documentation for Engineering here
<https://docs.google.com/document/d/17WMyQQ-zGtVtB8GLaACxLOkbbUPscMyfweqc1FxW-74/edit?usp=sharing>.
Eventually
we want all indicators to abide by a diff-based issue definition that
includes only the rows that changed during the time period covered by the
issue. Rows that stayed the same are not explicitly confirmed. Rows that
were removed are not currently distinguished from rows that stayed the
same; this will be addressed in a missingness encoding scheme TBD.
…On Fri, Sep 11, 2020 at 6:10 PM Dmitry Shemetov ***@***.***> wrote:
I'm having trouble grokking this. My guess is it's because I don't know
what issue means. Is that the date when the data was released? And is this
definition something set by us or by JHU or both?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#179 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAI24CQDOG4IQJZ6EBLQHIDSFKN6DANCNFSM4PTMBBRQ>
.
|
The deploy-jhu branch generates Puerto Rico (PR) data correctly, which mean there were at least four possibilities: (1) the differ has a bug, (2) the differ is working correctly but the AWS cache is dirty and causing it to fail, (3) the patch job I added to drop the erroneous .00000 and 8xxxx counties has a bug, (4) something is causing the July-August files to fail validity checks. Dmitry checked (1) We are looking for:
|
Successful July/August files that mention PR regions: As expected, no results, except for 31 August -- fair enough. We really just wanted to confirm the gap.
Failed July/August files that mention PR regions: No state files, but county files for all dates (weird?)
These were uploaded on August 28, and include the invalid ".0000" region from #254. |
It seems Dmitry checked that his output matches the production cache, but not whether the production cache was dirty. The cache contains:
The presence of the invalid county codes suggests a dirty cache. There are a couple of ways forward from here:
|
Ah I saw the .0000 counties, but didn't realize they weren't supposed to be there! Feels good to have it narrowed down! |
@eujing @korlaxxalrok do you have thoughts on the best way to reset the S3 cache as above? |
I feel like the cleanest way would be to regenerate the entire jhu cache.
|
There is no Puerto Rico cases or deaths data available in the API since July 17:
The JHU time series of deaths seems to support this, showing 0 deaths for all time in every county in Puerto Rico -- but that's because the deaths are listed under "Unassigned, Puerto Rico". We should be ingesting these deaths.
Meanwhile, their time series of confirmed cases shows plenty of cases, but for some reason we are not reporting them.
This is preventing forecasting from issuing death forecasts for Puerto Rico, and will block case forecasts as well.
The text was updated successfully, but these errors were encountered: