Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JHU changed Puerto Rico death reporting, causing us to fail to report it #179

Closed
capnrefsmmat opened this issue Aug 3, 2020 · 19 comments
Closed
Assignees
Labels
bug Something isn't working data quality Missing data, weird data, broken data Triage Nominate for inclusion in the next release

Comments

@capnrefsmmat
Copy link
Contributor

There is no Puerto Rico cases or deaths data available in the API since July 17:

> covidcast_signal("jhu-csse", "deaths_incidence_num", geo_type="state", geo_values="pr", start_day="2020-07-15")
A `covidcast_signal` data frame with 3 rows and 10 columns.

signals     : jhu-csse:deaths_incidence_num
geo_type    : state

  geo_value time_value direction      issue lag value stderr sample_size
1        pr 2020-07-15        NA 2020-07-18   3     2     NA          NA
2        pr 2020-07-16        NA 2020-07-18   2     1     NA          NA
3        pr 2020-07-17        NA 2020-07-18   1     5     NA          NA

> covidcast_signal("jhu-csse", "confirmed_incidence_num", geo_type="state", geo_values="pr", start_day="2020-07-15")
A `covidcast_signal` data frame with 3 rows and 10 columns.

signals     : jhu-csse:confirmed_incidence_num
geo_type    : state

  geo_value time_value direction      issue lag value stderr sample_size
1        pr 2020-07-15        NA 2020-07-18   3   256     NA          NA
2        pr 2020-07-16        NA 2020-07-18   2   195     NA          NA
3        pr 2020-07-17        NA 2020-07-18   1   546     NA          NA

The JHU time series of deaths seems to support this, showing 0 deaths for all time in every county in Puerto Rico -- but that's because the deaths are listed under "Unassigned, Puerto Rico". We should be ingesting these deaths.

Meanwhile, their time series of confirmed cases shows plenty of cases, but for some reason we are not reporting them.

This is preventing forecasting from issuing death forecasts for Puerto Rico, and will block case forecasts as well.

@capnrefsmmat capnrefsmmat added the Triage Nominate for inclusion in the next release label Aug 3, 2020
@capnrefsmmat
Copy link
Contributor Author

See also CSSEGISandData/COVID-19#2889

@jsharpna jsharpna self-assigned this Aug 3, 2020
@dshemetov
Copy link
Contributor

I'm looking into tackling this with #215. I'm thinking of splitting the deaths across the FIPS codes based on population data, so that we don't mix state level data with FIPS level data. Would this cause issues down the pipeline? It would help making the geocoding consistent. We can reaggregate the deaths back into commonwealth level and serve only that at the API.

@dshemetov dshemetov self-assigned this Aug 24, 2020
@ajgreen93
Copy link

@krivard , @capnrefsmmat asked me to ping you. The issue with "Unassigned" counts for JHU data seems to be affecting states like Wyoming and Rhode Island (and presumably all other states as well). As a result, it is affecting our state-level forecasts for all states.

(This also seems related to this closed issue.)

@krivard
Copy link
Contributor

krivard commented Sep 8, 2020

In theory, a behavior where we map unassigned cases/deaths to a megacounty that then gets aggregated into the state figures was added in commit 5ff04c0. This commit is present in the commit log for the version of the JHU indicator in production, so if it's not doing the right thing now, then it's likely this was later overridden when we switched to the new geo aggregator. @dshemetov, investigate? it may be worth boosting the priority on merging #215; what would we need to get that done in the next two weeks?

In the meantime, @ajgreen93 iirc we added USAFacts to avoid this exact issue, so you might try switching indicators.

@dshemetov
Copy link
Contributor

Just looked: yes, in deploy-jhu UIDs 840900XX are being mapped to 900XX without being converted to XX000 (the mega-county fix). This is already handled in #217.

@dshemetov
Copy link
Contributor

@krivard Two week merging of #215 is likely doable. What's left are HHS, National, and DMA level geocodes. DMA may take some work to track down the crosswalks.

@krivard krivard added the bug Something isn't working label Sep 9, 2020
@dshemetov
Copy link
Contributor

dshemetov commented Sep 10, 2020

Just looked into the Puerto Rico issues. Like capnrefsmmat reports, we still don't have Puerto Rico cases information in the API after July 17th, despite the data being present in the JHU time_series we pull. Here is the strange thing: the Puerto Rico cases after July 17th are showing up in the receiving folder on the deploy-jhu branch. So I think the issue must be on the ingestion step after we pull.

The Puerto Rico deaths issue has been fixed with the megaFIPS fix.

@krivard
Copy link
Contributor

krivard commented Sep 10, 2020

We are successfully ingesting data from Puerto Rico for the following combinations: (state, county) X (cases, deaths) X (incidence, cumulative) X (num):

$ zgrep -il -e "^72" -e "^pr" /common/covidcast/archive/successful/jhu-csse/20200908*
/common/covidcast/archive/successful/jhu-csse/20200908_county_confirmed_7dav_cumulative_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_county_confirmed_7dav_incidence_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_county_confirmed_cumulative_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_county_confirmed_incidence_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_county_deaths_7dav_cumulative_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_county_deaths_7dav_incidence_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_county_deaths_cumulative_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_county_deaths_incidence_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_state_confirmed_7dav_cumulative_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_state_confirmed_7dav_incidence_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_state_confirmed_cumulative_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_state_confirmed_incidence_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_state_deaths_7dav_cumulative_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_state_deaths_7dav_incidence_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_state_deaths_cumulative_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_state_deaths_incidence_num.csv.gz

For (state) X (cases, deaths) X (incidence, cumulative) X (prop), the deploy-jhu pipeline is generating PR data, but it puts inf in the value column, which is not permitted:

$ grep -i "^pr" bad-jhu/20200908*
bad-jhu/20200908_state_confirmed_7dav_cumulative_prop.csv:PR,inf,NA,NA
bad-jhu/20200908_state_confirmed_7dav_incidence_prop.csv:PR,inf,NA,NA
bad-jhu/20200908_state_confirmed_cumulative_prop.csv:PR,inf,NA,NA
bad-jhu/20200908_state_confirmed_incidence_prop.csv:PR,inf,NA,NA
bad-jhu/20200908_state_deaths_7dav_cumulative_prop.csv:PR,inf,NA,NA
bad-jhu/20200908_state_deaths_cumulative_prop.csv:PR,inf,NA,NA
bad-jhu/20200908_state_deaths_incidence_prop.csv:PR,inf,NA,NA

(this is probably related to #227)

Since this would cause the ingestion mechanism to reject the whole file, while we're waiting for fixes on #227 and #254 I have a cron job running to pick up the JHU files from receiving on the server, and strip out lines with illegal values and geo identifiers before they reach ingestion. This will probably wreak havoc with the diff-based archive utility once we have fixes in place, but it was a better option than having state cases and deaths ratios be completely unavailable.

For (county) X (cases, deaths) X (incidence, cumulative) X (prop), the deploy-jhu pipeline does not appear to be generating any entries for Puerto Rico counties at all.

@dshemetov
Copy link
Contributor

dshemetov commented Sep 11, 2020

@krivard can you check the same thing but for 20200826? For some reason, our API provides Puerto Rico cases since 20200827, but not before.

>cc_df = covidcast.signal("jhu-csse", "confirmed_incidence_num",
                        date(2020, 7, 14), date(2020, 9, 11),
                        geo_type="county")
>cc_df[cc_df["geo_value"].isin([str(x) for x in range(72001, 72999)])]["time_value"].min()
Timestamp('2020-08-27 00:00:00')

It is likely a population divide by zero issue, but not sure why it's day-dependent?

@krivard
Copy link
Contributor

krivard commented Sep 11, 2020 via email

@krivard
Copy link
Contributor

krivard commented Sep 11, 2020

Here's a plot of the number of days in each issue from 2 July to 10 Sept where county data for 72000 (the PR megacounty) or 72001 are available:

df <- suppressMessages(covidcast_signal("jhu-csse","confirmed_incidence_num","2020-07-01","2020-09-09",
                       "county",c("72000","72001"),issues=c("2020-07-02","2020-09-10")))
dfn <- group_by(df,geo_value,issue) %>% summarise(n=n())
ggplot(dfn, aes(x=issue, y=n, group=geo_value, color=geo_value)) + geom_line() + ggtitle("number of dates of available data") 

image

So it looks like we supported PR mega counties through mid-July, then picked up on individual county data on 27 August.

In theory the diff-based issue generator should reissue PR county data back to 2 Feb as soon as it becomes available and valid; in practice we may have to babysit it a bit.

@dshemetov
Copy link
Contributor

I'm having trouble grokking this. My guess is it's because I don't know what issue means. Is that the date when the data was released? And is this definition something set by us or by JHU or both?

@krivard
Copy link
Contributor

krivard commented Sep 12, 2020 via email

@krivard
Copy link
Contributor

krivard commented Sep 17, 2020

The deploy-jhu branch generates Puerto Rico (PR) data correctly, which mean there were at least four possibilities: (1) the differ has a bug, (2) the differ is working correctly but the AWS cache is dirty and causing it to fail, (3) the patch job I added to drop the erroneous .00000 and 8xxxx counties has a bug, (4) something is causing the July-August files to fail validity checks.

Dmitry checked (1) and (2), and all seems well there. That leaves going through the success/failed archive on the server to see what was actually ingested, and cross-referencing with the ingestion log files to see which of those files was overwritten and when.

We are looking for:

  • Successful July/August files that mention PR regions. Since PR doesn’t show up in the latest-issue API results, and we don’t have a deletion mechanism in the API, we do not expect to find any such files.
  • Failed July/August files that mention PR regions, especially after the differ was activated. A failed file is not loaded into the API, but the differ doesn’t know that, so those changes would be erroneously removed from future deliveries to receiving.
  • Failed files that may have overwritten July/August files that could have mentioned PR regions. Same deal, but it requires a cross-reference with the Automation overwrite log first to get the list.

@krivard
Copy link
Contributor

krivard commented Sep 17, 2020

Successful July/August files that mention PR regions: As expected, no results, except for 31 August -- fair enough. We really just wanted to confirm the gap.

$ find archive/successful/jhu-csse/ -name "20200[78]*state*" -exec zgrep "^pr" {} + | grep -v "_wip_"
$ find archive/successful/jhu-csse/ -name "20200[78]*county*" -exec zgrep -m1 "^72" {} + | grep -v "_wip_"
archive/successful/jhu-csse/20200831_county_confirmed_7dav_cumulative_num.csv.gz:72001,119.0,NA,NA
archive/successful/jhu-csse/20200831_county_confirmed_7dav_incidence_num.csv.gz:72001,0.7142857142857143,NA,NA
archive/successful/jhu-csse/20200831_county_confirmed_cumulative_num.csv.gz:72001,121.0,NA,NA
archive/successful/jhu-csse/20200831_county_confirmed_incidence_num.csv.gz:72001,0.0,NA,NA
archive/successful/jhu-csse/20200831_county_deaths_7dav_cumulative_num.csv.gz:72001,0.0,NA,NA
archive/successful/jhu-csse/20200831_county_deaths_7dav_incidence_num.csv.gz:72001,0.0,NA,NA
archive/successful/jhu-csse/20200831_county_deaths_cumulative_num.csv.gz:72001,0.0,NA,NA
archive/successful/jhu-csse/20200831_county_deaths_incidence_num.csv.gz:72001,0.0,NA,NA

Failed July/August files that mention PR regions: No state files, but county files for all dates (weird?)

$ find archive/failed/jhu-csse/ -name "20200[78]*state*" -exec grep -m1 "^pr" {} + | grep -v "_wip_" | sed 's/_.*//' | sort -u
$ find archive/failed/jhu-csse/ -name "20200[78]*county*" -exec grep -m1 "^72" {} + | grep -v "_wip_" | sed 's/_.*//' | sort -u
archive/failed/jhu-csse/20200701
archive/failed/jhu-csse/20200702
archive/failed/jhu-csse/20200703
archive/failed/jhu-csse/20200704
archive/failed/jhu-csse/20200705
archive/failed/jhu-csse/20200706
archive/failed/jhu-csse/20200707
archive/failed/jhu-csse/20200708
archive/failed/jhu-csse/20200709
archive/failed/jhu-csse/20200710
archive/failed/jhu-csse/20200711
archive/failed/jhu-csse/20200712
archive/failed/jhu-csse/20200713
archive/failed/jhu-csse/20200714
archive/failed/jhu-csse/20200715
archive/failed/jhu-csse/20200716
archive/failed/jhu-csse/20200717
archive/failed/jhu-csse/20200718
archive/failed/jhu-csse/20200719
archive/failed/jhu-csse/20200720
archive/failed/jhu-csse/20200721
archive/failed/jhu-csse/20200722
archive/failed/jhu-csse/20200723
archive/failed/jhu-csse/20200724
archive/failed/jhu-csse/20200725
archive/failed/jhu-csse/20200726
archive/failed/jhu-csse/20200727
archive/failed/jhu-csse/20200728
archive/failed/jhu-csse/20200729
archive/failed/jhu-csse/20200730
archive/failed/jhu-csse/20200731
archive/failed/jhu-csse/20200801
archive/failed/jhu-csse/20200802
archive/failed/jhu-csse/20200803
archive/failed/jhu-csse/20200804
archive/failed/jhu-csse/20200805
archive/failed/jhu-csse/20200806
archive/failed/jhu-csse/20200807
archive/failed/jhu-csse/20200808
archive/failed/jhu-csse/20200809
archive/failed/jhu-csse/20200810
archive/failed/jhu-csse/20200811
archive/failed/jhu-csse/20200812
archive/failed/jhu-csse/20200813
archive/failed/jhu-csse/20200814
archive/failed/jhu-csse/20200815
archive/failed/jhu-csse/20200816
archive/failed/jhu-csse/20200817
archive/failed/jhu-csse/20200818
archive/failed/jhu-csse/20200819
archive/failed/jhu-csse/20200820
archive/failed/jhu-csse/20200821
archive/failed/jhu-csse/20200822
archive/failed/jhu-csse/20200823
archive/failed/jhu-csse/20200824
archive/failed/jhu-csse/20200825
archive/failed/jhu-csse/20200826
archive/failed/jhu-csse/20200827
archive/failed/jhu-csse/20200828
archive/failed/jhu-csse/20200829

These were uploaded on August 28, and include the invalid ".0000" region from #254.

@krivard
Copy link
Contributor

krivard commented Sep 17, 2020

It seems Dmitry checked that his output matches the production cache, but not whether the production cache was dirty.

The cache contains:

  • ✅ 72xxx records in county files
  • ✅ PR records in state files (though upper case, which is suboptimal)
  • ❌ invalid counties (.0000, 8xxx, 9xxx) in the county files

The presence of the invalid county codes suggests a dirty cache. There are a couple of ways forward from here:

  • drop and regenerate the entire cache from API calls
  • surgically alter the cache to remove the PR records and activate the differ for tomorrow's run
  • drop only invalid cache files

@dshemetov
Copy link
Contributor

Ah I saw the .0000 counties, but didn't realize they weren't supposed to be there! Feels good to have it narrowed down!

@krivard
Copy link
Contributor

krivard commented Sep 18, 2020

@eujing @korlaxxalrok do you have thoughts on the best way to reset the S3 cache as above?

@eujing
Copy link
Contributor

eujing commented Sep 18, 2020

I feel like the cleanest way would be to regenerate the entire jhu cache.
We could run the indicator before tomorrow with the jhu S3 cache deleted, and then manually run it today to have it upload its complete output to S3.
But two problems with this would be

  1. The uploading process will take awhile as it does it serially
  2. We might have to note down this event somewhere if we ever want to reconstruct anything from the S3 object versioning history

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data quality Missing data, weird data, broken data Triage Nominate for inclusion in the next release
Projects
None yet
Development

No branches or pull requests

7 participants