JHU changed Puerto Rico death reporting, causing us to fail to report it #179

capnrefsmmat · 2020-08-03T14:08:59Z

There is no Puerto Rico cases or deaths data available in the API since July 17:

> covidcast_signal("jhu-csse", "deaths_incidence_num", geo_type="state", geo_values="pr", start_day="2020-07-15")
A `covidcast_signal` data frame with 3 rows and 10 columns.

signals     : jhu-csse:deaths_incidence_num
geo_type    : state

  geo_value time_value direction      issue lag value stderr sample_size
1        pr 2020-07-15        NA 2020-07-18   3     2     NA          NA
2        pr 2020-07-16        NA 2020-07-18   2     1     NA          NA
3        pr 2020-07-17        NA 2020-07-18   1     5     NA          NA

> covidcast_signal("jhu-csse", "confirmed_incidence_num", geo_type="state", geo_values="pr", start_day="2020-07-15")
A `covidcast_signal` data frame with 3 rows and 10 columns.

signals     : jhu-csse:confirmed_incidence_num
geo_type    : state

  geo_value time_value direction      issue lag value stderr sample_size
1        pr 2020-07-15        NA 2020-07-18   3   256     NA          NA
2        pr 2020-07-16        NA 2020-07-18   2   195     NA          NA
3        pr 2020-07-17        NA 2020-07-18   1   546     NA          NA

The JHU time series of deaths seems to support this, showing 0 deaths for all time in every county in Puerto Rico -- but that's because the deaths are listed under "Unassigned, Puerto Rico". We should be ingesting these deaths.

Meanwhile, their time series of confirmed cases shows plenty of cases, but for some reason we are not reporting them.

This is preventing forecasting from issuing death forecasts for Puerto Rico, and will block case forecasts as well.

The text was updated successfully, but these errors were encountered:

capnrefsmmat · 2020-08-03T14:09:20Z

See also CSSEGISandData/COVID-19#2889

dshemetov · 2020-08-24T21:16:56Z

I'm looking into tackling this with #215. I'm thinking of splitting the deaths across the FIPS codes based on population data, so that we don't mix state level data with FIPS level data. Would this cause issues down the pipeline? It would help making the geocoding consistent. We can reaggregate the deaths back into commonwealth level and serve only that at the API.

ajgreen93 · 2020-09-07T19:21:46Z

@krivard , @capnrefsmmat asked me to ping you. The issue with "Unassigned" counts for JHU data seems to be affecting states like Wyoming and Rhode Island (and presumably all other states as well). As a result, it is affecting our state-level forecasts for all states.

(This also seems related to this closed issue.)

krivard · 2020-09-08T20:47:47Z

In theory, a behavior where we map unassigned cases/deaths to a megacounty that then gets aggregated into the state figures was added in commit 5ff04c0. This commit is present in the commit log for the version of the JHU indicator in production, so if it's not doing the right thing now, then it's likely this was later overridden when we switched to the new geo aggregator. @dshemetov, investigate? it may be worth boosting the priority on merging #215; what would we need to get that done in the next two weeks?

In the meantime, @ajgreen93 iirc we added USAFacts to avoid this exact issue, so you might try switching indicators.

dshemetov · 2020-09-08T21:09:11Z

Just looked: yes, in deploy-jhu UIDs 840900XX are being mapped to 900XX without being converted to XX000 (the mega-county fix). This is already handled in #217.

dshemetov · 2020-09-08T21:22:13Z

@krivard Two week merging of #215 is likely doable. What's left are HHS, National, and DMA level geocodes. DMA may take some work to track down the crosswalks.

dshemetov · 2020-09-10T01:09:47Z

Just looked into the Puerto Rico issues. Like capnrefsmmat reports, we still don't have Puerto Rico cases information in the API after July 17th, despite the data being present in the JHU time_series we pull. Here is the strange thing: the Puerto Rico cases after July 17th are showing up in the receiving folder on the deploy-jhu branch. So I think the issue must be on the ingestion step after we pull.

The Puerto Rico deaths issue has been fixed with the megaFIPS fix.

krivard · 2020-09-10T14:22:11Z

We are successfully ingesting data from Puerto Rico for the following combinations: (state, county) X (cases, deaths) X (incidence, cumulative) X (num):

$ zgrep -il -e "^72" -e "^pr" /common/covidcast/archive/successful/jhu-csse/20200908*
/common/covidcast/archive/successful/jhu-csse/20200908_county_confirmed_7dav_cumulative_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_county_confirmed_7dav_incidence_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_county_confirmed_cumulative_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_county_confirmed_incidence_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_county_deaths_7dav_cumulative_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_county_deaths_7dav_incidence_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_county_deaths_cumulative_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_county_deaths_incidence_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_state_confirmed_7dav_cumulative_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_state_confirmed_7dav_incidence_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_state_confirmed_cumulative_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_state_confirmed_incidence_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_state_deaths_7dav_cumulative_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_state_deaths_7dav_incidence_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_state_deaths_cumulative_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_state_deaths_incidence_num.csv.gz

For (state) X (cases, deaths) X (incidence, cumulative) X (prop), the deploy-jhu pipeline is generating PR data, but it puts inf in the value column, which is not permitted:

$ grep -i "^pr" bad-jhu/20200908*
bad-jhu/20200908_state_confirmed_7dav_cumulative_prop.csv:PR,inf,NA,NA
bad-jhu/20200908_state_confirmed_7dav_incidence_prop.csv:PR,inf,NA,NA
bad-jhu/20200908_state_confirmed_cumulative_prop.csv:PR,inf,NA,NA
bad-jhu/20200908_state_confirmed_incidence_prop.csv:PR,inf,NA,NA
bad-jhu/20200908_state_deaths_7dav_cumulative_prop.csv:PR,inf,NA,NA
bad-jhu/20200908_state_deaths_cumulative_prop.csv:PR,inf,NA,NA
bad-jhu/20200908_state_deaths_incidence_prop.csv:PR,inf,NA,NA

(this is probably related to #227)

Since this would cause the ingestion mechanism to reject the whole file, while we're waiting for fixes on #227 and #254 I have a cron job running to pick up the JHU files from receiving on the server, and strip out lines with illegal values and geo identifiers before they reach ingestion. This will probably wreak havoc with the diff-based archive utility once we have fixes in place, but it was a better option than having state cases and deaths ratios be completely unavailable.

For (county) X (cases, deaths) X (incidence, cumulative) X (prop), the deploy-jhu pipeline does not appear to be generating any entries for Puerto Rico counties at all.

dshemetov · 2020-09-11T16:32:51Z

@krivard can you check the same thing but for 20200826? For some reason, our API provides Puerto Rico cases since 20200827, but not before.

>cc_df = covidcast.signal("jhu-csse", "confirmed_incidence_num",
                        date(2020, 7, 14), date(2020, 9, 11),
                        geo_type="county")
>cc_df[cc_df["geo_value"].isin([str(x) for x in range(72001, 72999)])]["time_value"].min()
Timestamp('2020-08-27 00:00:00')

It is likely a population divide by zero issue, but not sure why it's day-dependent?

krivard · 2020-09-11T21:43:26Z

There are no mentions of `72XXX` counties before 31 August in the success files. We only keep backups of the most recently submitted csv for each day, so any day with PR for an earlier issue (before the inf bug) will have been overwritten by a more recent issue (after the inf bug) which has had all the invalid PR lines filtered out. Recall however that the issue definition for JHU has changed multiple times, including: 1. all days going back to 2 February 2. only the last 7 days of raw data and the last 1 day of 7dav data 3. only the new or updated lines in any new or updated file going back to 2 February so any appearance of day-dependence of a zero population effect may instead be related to which days of data were in the issue when the population data first went awry.

…

On Fri, Sep 11, 2020 at 12:33 PM Dmitry Shemetov ***@***.***> wrote: @krivard <https://github.com/krivard> can you check the same thing but for 20200826? For some reason, our API provides Puerto Rico cases since 20200827, but not before. >cc_df = covidcast.signal("jhu-csse", "confirmed_incidence_num", date(2020, 7, 14), date(2020, 9, 11), geo_type="county") >cc_df[cc_df["geo_value"].isin([str(x) for x in range(72001, 72999)])]["time_value"].min() Timestamp('2020-08-27 00:00:00') It is likely a population divide by zero issue, but I'm not sure why we have a day-dependent bug. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#179 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAI24CXJ3L7R5O3HEWLCPZLSFJGMJANCNFSM4PTMBBRQ> .

krivard · 2020-09-11T22:01:56Z

Here's a plot of the number of days in each issue from 2 July to 10 Sept where county data for 72000 (the PR megacounty) or 72001 are available:

df <- suppressMessages(covidcast_signal("jhu-csse","confirmed_incidence_num","2020-07-01","2020-09-09",
                       "county",c("72000","72001"),issues=c("2020-07-02","2020-09-10")))
dfn <- group_by(df,geo_value,issue) %>% summarise(n=n())
ggplot(dfn, aes(x=issue, y=n, group=geo_value, color=geo_value)) + geom_line() + ggtitle("number of dates of available data")

So it looks like we supported PR mega counties through mid-July, then picked up on individual county data on 27 August.

In theory the diff-based issue generator should reissue PR county data back to 2 Feb as soon as it becomes available and valid; in practice we may have to babysit it a bit.

dshemetov · 2020-09-11T22:10:26Z

I'm having trouble grokking this. My guess is it's because I don't know what issue means. Is that the date when the data was released? And is this definition something set by us or by JHU or both?

krivard · 2020-09-12T02:24:14Z

Ah sorry, that's data versioning terminology. Issue like a magazine issue; a collection of data that was uploaded to receiving and published together. For daily signals, the issue date is a day. For weekly signals, the issue date is an epidemiological week ("epiweek"). More info in the API docs here <https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html#optional> or the onboarding documentation for Engineering here <https://docs.google.com/document/d/17WMyQQ-zGtVtB8GLaACxLOkbbUPscMyfweqc1FxW-74/edit?usp=sharing>. Eventually we want all indicators to abide by a diff-based issue definition that includes only the rows that changed during the time period covered by the issue. Rows that stayed the same are not explicitly confirmed. Rows that were removed are not currently distinguished from rows that stayed the same; this will be addressed in a missingness encoding scheme TBD.

…

On Fri, Sep 11, 2020 at 6:10 PM Dmitry Shemetov ***@***.***> wrote: I'm having trouble grokking this. My guess is it's because I don't know what issue means. Is that the date when the data was released? And is this definition something set by us or by JHU or both? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#179 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAI24CQDOG4IQJZ6EBLQHIDSFKN6DANCNFSM4PTMBBRQ> .

krivard · 2020-09-17T17:01:36Z

The deploy-jhu branch generates Puerto Rico (PR) data correctly, which mean there were at least four possibilities: (1) the differ has a bug, (2) the differ is working correctly but the AWS cache is dirty and causing it to fail, (3) the patch job I added to drop the erroneous .00000 and 8xxxx counties has a bug, (4) something is causing the July-August files to fail validity checks.

Dmitry checked (1) ~~and (2)~~, and all seems well there. That leaves going through the success/failed archive on the server to see what was actually ingested, and cross-referencing with the ingestion log files to see which of those files was overwritten and when.

We are looking for:

Successful July/August files that mention PR regions. Since PR doesn’t show up in the latest-issue API results, and we don’t have a deletion mechanism in the API, we do not expect to find any such files.
Failed July/August files that mention PR regions, especially after the differ was activated. A failed file is not loaded into the API, but the differ doesn’t know that, so those changes would be erroneously removed from future deliveries to receiving.
Failed files that may have overwritten July/August files that could have mentioned PR regions. Same deal, but it requires a cross-reference with the Automation overwrite log first to get the list.

krivard · 2020-09-17T20:39:31Z

Successful July/August files that mention PR regions: As expected, no results, except for 31 August -- fair enough. We really just wanted to confirm the gap.

$ find archive/successful/jhu-csse/ -name "20200[78]*state*" -exec zgrep "^pr" {} + | grep -v "_wip_"
$ find archive/successful/jhu-csse/ -name "20200[78]*county*" -exec zgrep -m1 "^72" {} + | grep -v "_wip_"
archive/successful/jhu-csse/20200831_county_confirmed_7dav_cumulative_num.csv.gz:72001,119.0,NA,NA
archive/successful/jhu-csse/20200831_county_confirmed_7dav_incidence_num.csv.gz:72001,0.7142857142857143,NA,NA
archive/successful/jhu-csse/20200831_county_confirmed_cumulative_num.csv.gz:72001,121.0,NA,NA
archive/successful/jhu-csse/20200831_county_confirmed_incidence_num.csv.gz:72001,0.0,NA,NA
archive/successful/jhu-csse/20200831_county_deaths_7dav_cumulative_num.csv.gz:72001,0.0,NA,NA
archive/successful/jhu-csse/20200831_county_deaths_7dav_incidence_num.csv.gz:72001,0.0,NA,NA
archive/successful/jhu-csse/20200831_county_deaths_cumulative_num.csv.gz:72001,0.0,NA,NA
archive/successful/jhu-csse/20200831_county_deaths_incidence_num.csv.gz:72001,0.0,NA,NA

Failed July/August files that mention PR regions: No state files, but county files for all dates (weird?)

$ find archive/failed/jhu-csse/ -name "20200[78]*state*" -exec grep -m1 "^pr" {} + | grep -v "_wip_" | sed 's/_.*//' | sort -u
$ find archive/failed/jhu-csse/ -name "20200[78]*county*" -exec grep -m1 "^72" {} + | grep -v "_wip_" | sed 's/_.*//' | sort -u
archive/failed/jhu-csse/20200701
archive/failed/jhu-csse/20200702
archive/failed/jhu-csse/20200703
archive/failed/jhu-csse/20200704
archive/failed/jhu-csse/20200705
archive/failed/jhu-csse/20200706
archive/failed/jhu-csse/20200707
archive/failed/jhu-csse/20200708
archive/failed/jhu-csse/20200709
archive/failed/jhu-csse/20200710
archive/failed/jhu-csse/20200711
archive/failed/jhu-csse/20200712
archive/failed/jhu-csse/20200713
archive/failed/jhu-csse/20200714
archive/failed/jhu-csse/20200715
archive/failed/jhu-csse/20200716
archive/failed/jhu-csse/20200717
archive/failed/jhu-csse/20200718
archive/failed/jhu-csse/20200719
archive/failed/jhu-csse/20200720
archive/failed/jhu-csse/20200721
archive/failed/jhu-csse/20200722
archive/failed/jhu-csse/20200723
archive/failed/jhu-csse/20200724
archive/failed/jhu-csse/20200725
archive/failed/jhu-csse/20200726
archive/failed/jhu-csse/20200727
archive/failed/jhu-csse/20200728
archive/failed/jhu-csse/20200729
archive/failed/jhu-csse/20200730
archive/failed/jhu-csse/20200731
archive/failed/jhu-csse/20200801
archive/failed/jhu-csse/20200802
archive/failed/jhu-csse/20200803
archive/failed/jhu-csse/20200804
archive/failed/jhu-csse/20200805
archive/failed/jhu-csse/20200806
archive/failed/jhu-csse/20200807
archive/failed/jhu-csse/20200808
archive/failed/jhu-csse/20200809
archive/failed/jhu-csse/20200810
archive/failed/jhu-csse/20200811
archive/failed/jhu-csse/20200812
archive/failed/jhu-csse/20200813
archive/failed/jhu-csse/20200814
archive/failed/jhu-csse/20200815
archive/failed/jhu-csse/20200816
archive/failed/jhu-csse/20200817
archive/failed/jhu-csse/20200818
archive/failed/jhu-csse/20200819
archive/failed/jhu-csse/20200820
archive/failed/jhu-csse/20200821
archive/failed/jhu-csse/20200822
archive/failed/jhu-csse/20200823
archive/failed/jhu-csse/20200824
archive/failed/jhu-csse/20200825
archive/failed/jhu-csse/20200826
archive/failed/jhu-csse/20200827
archive/failed/jhu-csse/20200828
archive/failed/jhu-csse/20200829

These were uploaded on August 28, and include the invalid ".0000" region from #254.

krivard · 2020-09-17T20:47:41Z

It seems Dmitry checked that his output matches the production cache, but not whether the production cache was dirty.

The cache contains:

✅ 72xxx records in county files
✅ PR records in state files (though upper case, which is suboptimal)
❌ invalid counties (.0000, 8xxx, 9xxx) in the county files

The presence of the invalid county codes suggests a dirty cache. There are a couple of ways forward from here:

drop and regenerate the entire cache from API calls
surgically alter the cache to remove the PR records and activate the differ for tomorrow's run
drop only invalid cache files

dshemetov · 2020-09-17T22:39:58Z

Ah I saw the .0000 counties, but didn't realize they weren't supposed to be there! Feels good to have it narrowed down!

krivard · 2020-09-18T14:23:14Z

@eujing @korlaxxalrok do you have thoughts on the best way to reset the S3 cache as above?

eujing · 2020-09-18T14:58:50Z

I feel like the cleanest way would be to regenerate the entire jhu cache.
We could run the indicator before tomorrow with the jhu S3 cache deleted, and then manually run it today to have it upload its complete output to S3.
But two problems with this would be

The uploading process will take awhile as it does it serially
We might have to note down this event somewhere if we ever want to reconstruct anything from the S3 object versioning history

capnrefsmmat added the Triage Nominate for inclusion in the next release label Aug 3, 2020

jsharpna self-assigned this Aug 3, 2020

dshemetov self-assigned this Aug 24, 2020

krivard added the bug Something isn't working label Sep 9, 2020

krivard mentioned this issue Sep 17, 2020

Adapt archive utility to run as a separate step in the pipeline #280

Closed

nmdefries added the data quality Missing data, weird data, broken data label Sep 21, 2020

krivard closed this as completed in 0c6b422 Oct 7, 2020

krivard mentioned this issue Oct 15, 2020

GeoMapper produces invalid counties >80000 for JHU #318

Closed

nmdefries mentioned this issue Nov 10, 2020

Validate that provided geo_ids match expectations #424

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JHU changed Puerto Rico death reporting, causing us to fail to report it #179

JHU changed Puerto Rico death reporting, causing us to fail to report it #179

capnrefsmmat commented Aug 3, 2020

capnrefsmmat commented Aug 3, 2020

dshemetov commented Aug 24, 2020

ajgreen93 commented Sep 7, 2020

krivard commented Sep 8, 2020

dshemetov commented Sep 8, 2020

dshemetov commented Sep 8, 2020

dshemetov commented Sep 10, 2020 •

edited

Loading

krivard commented Sep 10, 2020

dshemetov commented Sep 11, 2020 •

edited

Loading

krivard commented Sep 11, 2020 via email •

edited

Loading

krivard commented Sep 11, 2020

dshemetov commented Sep 11, 2020

krivard commented Sep 12, 2020 via email

krivard commented Sep 17, 2020 •

edited

Loading

krivard commented Sep 17, 2020

krivard commented Sep 17, 2020

dshemetov commented Sep 17, 2020

krivard commented Sep 18, 2020

eujing commented Sep 18, 2020 •

edited

Loading

JHU changed Puerto Rico death reporting, causing us to fail to report it #179

JHU changed Puerto Rico death reporting, causing us to fail to report it #179

Comments

capnrefsmmat commented Aug 3, 2020

capnrefsmmat commented Aug 3, 2020

dshemetov commented Aug 24, 2020

ajgreen93 commented Sep 7, 2020

krivard commented Sep 8, 2020

dshemetov commented Sep 8, 2020

dshemetov commented Sep 8, 2020

dshemetov commented Sep 10, 2020 • edited Loading

krivard commented Sep 10, 2020

dshemetov commented Sep 11, 2020 • edited Loading

krivard commented Sep 11, 2020 via email • edited Loading

krivard commented Sep 11, 2020

dshemetov commented Sep 11, 2020

krivard commented Sep 12, 2020 via email

krivard commented Sep 17, 2020 • edited Loading

krivard commented Sep 17, 2020

krivard commented Sep 17, 2020

dshemetov commented Sep 17, 2020

krivard commented Sep 18, 2020

eujing commented Sep 18, 2020 • edited Loading

dshemetov commented Sep 10, 2020 •

edited

Loading

dshemetov commented Sep 11, 2020 •

edited

Loading

krivard commented Sep 11, 2020 via email •

edited

Loading

krivard commented Sep 17, 2020 •

edited

Loading

eujing commented Sep 18, 2020 •

edited

Loading