Fix JHU prop signal #328

dshemetov · 2020-10-18T06:32:25Z

Current solution zeros out the state FIPS population in the new population source file. Don't quite understand why that occurred, yet. Fixes #325.

…opulation source

capnrefsmmat · 2020-10-18T12:47:02Z

Is there an opportunity for unit tests to prevent this from regressing again?

dshemetov · 2020-10-18T20:02:59Z

That's really worth reflecting on. What sort of tests could have caught this sort of issue? First thought are the indicator validation and anomaly detection projects, which are aimed at related problems. Second thought is the qualitative indicator validation notebook I built here.

One challenge in designing tests here is the lack of ground truth - since all reference data may be faulty, we will need to accept sensors with non-zero false positive rates. These sensors will ideally draw our attention to possible flaws for further analysis, but will not block merging. Ideally it will be easy to go from a sensor alert to an interactive REPL/notebook session exploring the suspect data. Another thought is to add correlation tests with a trusted set of indicators.

I would like to learn more about the established best practices for this area. @nmdefries @jsharpna do you have thoughts on this? I'd love to compare notes on your related projects.

capnrefsmmat · 2020-10-18T21:07:09Z

In a test case, you'd provide known input -- for example, you'd provide input that says there are 1,000,000 cases, and compare the answer to what you'd get. The ideal way to facilitate this would be if the proportion calculation were factored out into its own function that takes counts and populations as arguments; then you could provide it some known counts and populations and ensure it returns the right proportions.

dshemetov · 2020-10-18T22:24:24Z

Ah looks like we did have a prop test with synthetic population and values, but it didn't have any population for the state FIPS code.

krivard · 2020-10-19T13:25:02Z

_delphi_utils_python/data_proc/geomap/geo_data_proc.py

@@ -391,6 +391,10 @@ def create_fips_population_table():
    df_pr = df_pr.groupby("fips").sum().reset_index()
    df_pr = df_pr[~df_pr["fips"].isin(census_pop["fips"])]
    census_pop_pr = pd.concat([census_pop, df_pr])
+
+    # Zero out the populations for the state FIPS codes XX000 to avoid double counting
+    megafips_codes = [str(x).zfill(2) + "000" for x in range(1, 73)]


Won't this result in inf results for megafips prop signals?

I don't think so. The denominator in JHU is obtained my merging a FIPS population dataframe on the FIPS-level data, which is then aggregated to the state level. I verified this in this updated notebook, where I compare the new JHU state prop values with USA facts. Rhode Island is the largest outlier because we are now using values from Unassigned that we weren't before.

Let me double check how other indicators like Safegraph do this calculation, though.

I meant county files -- don't we include unassigned as a megafips for county?

We do, in fact:

/common/covidcast/archive/successful/jhu-csse $ zgrep -m1 "^[0-9][0-9]000" *_county_* | head 20200220_county_confirmed_7dav_cumulative_num.csv.gz:60000,0.0,NA,NA 20200220_county_confirmed_7dav_incidence_num.csv.gz:60000,0.0,NA,NA 20200220_county_confirmed_7dav_incidence_prop.csv.gz:01000,0.0,NA,NA 20200220_county_confirmed_7day_avg_cumulative_num.csv.gz:72000,0.0,NA,NA 20200220_county_confirmed_7day_avg_cumulative_prop.csv.gz:72000,0.0,NA,NA 20200220_county_confirmed_7day_avg_incidence_num.csv.gz:72000,0.0,NA,NA 20200220_county_confirmed_7day_avg_incidence_prop.csv.gz:72000,0.0,NA,NA 20200220_county_confirmed_cumulative_num.csv.gz:60000,0.0,NA,NA 20200220_county_confirmed_incidence_num.csv.gz:60000,0.0,NA,NA 20200220_county_confirmed_incidence_prop.csv.gz:01000,0.0,NA,NA

dshemetov · 2020-10-19T20:56:28Z

I have identified why the state FIPS codes started being added into the population totals now. It is because the GeoMapper updated the JHU UID -> FIPS mapping to keep the state FIPS (for the purposes of aggregating Unassigned and Out of State data). Previously, most of them were filtered out, prior to joining the FIPS -> population file.

So while the state FIPS codes are important to keep in JHU UID -> FIPS mapping file and in the FIPS -> state mapping file (so that the Unassigned and Out of State values don't get filtered out), we need to set the state FIPS population to zero to avoid double counting when aggregating up to state.

dshemetov · 2020-10-19T21:03:25Z

Right, so Safegraph performs the aggregation to state similarly. Namely, it truncates the county FIPS code to the first two digits and then aggregates.

It also appears that Safegraph provides Census block group mappings, which should probably move to the geocoding util. That's a separate issue though.

krivard · 2020-10-19T21:27:40Z

I've confirmed that the state prop signals are now on par with usa-facts, but we are now getting inf values for megafips in the county files:

$ grep "^[0-9][0-9]000" *county*prop* | head
20200310_county_confirmed_cumulative_prop.csv:49000,inf,NA,NA
20200310_county_confirmed_cumulative_prop.csv:53000,inf,NA,NA
20200310_county_confirmed_incidence_prop.csv:49000,inf,NA,NA
20200310_county_confirmed_incidence_prop.csv:53000,inf,NA,NA
20200311_county_confirmed_cumulative_prop.csv:49000,inf,NA,NA
20200311_county_confirmed_incidence_prop.csv:53000,-inf,NA,NA
20200312_county_confirmed_cumulative_prop.csv:49000,inf,NA,NA
20200312_county_confirmed_cumulative_prop.csv:53000,inf,NA,NA
20200312_county_confirmed_incidence_prop.csv:53000,inf,NA,NA
20200313_county_confirmed_cumulative_prop.csv:01000,inf,NA,NA

capnrefsmmat · 2020-10-19T21:28:32Z

Definitely a sign that we need a few unit tests for the different combinations: state, county, megacounty, etc.

dshemetov · 2020-10-19T21:40:44Z

@krivard I see what you mean. I wonder if we should filter those state FIPS out from the county .csvs, since for JHU specifically they currently represent a combination of Unassigned and Out of State signals only, which is not something a user would expect. On the other hand, it doesn't make sense to duplicate state resolution values in county but with state FIPS.

krivard · 2020-10-19T22:03:32Z

@dshemetov We don't have a clear picture of what users truly expect from a megafips cases signal. We started publishing Unassigned cases under the megafips some months ago because we wanted there to be a way for API users to retrieve those values. Merging in the Out Of State might be counterintuitive in some sense, but it does maintain access through the API to counts of less-certain location, and it does maintain the presence of megafips in the county signals.

Two options:

Keep megafips population as 0, and drop megafips from JHU county reports. Anyone who wants Unassigned/OOS can get it by subtracting the sum of all actual-county cases from the state case count. If they need prop, they can reconstruct the state population by doing state num/prop. Manually delete all issues of megafips from JHU in the API.
Restore megafips population, but exclude it when aggregating to state. This requires more complex handling in GeoMapper, but will more closely match existing JHU behavior.

I have a slight preference for the latter, but would yield to argument.

dshemetov · 2020-10-19T22:39:12Z

@krivard I see your point about not knowing what users expect there. I have not been able to get a clear picture from the JHU documentation with regard to what actually goes in the Out of State bucket; Unassigned appears to contain a combination of counts that could not be localized to a particular county FIPS and probable deaths released by state estimates. I can see the value in releasing the Unassigned category to users, but am less sure about the Out of State category. I'm also open to separating Unassigned and Out of State into separate megaFIPS from XX000, if we can find a reasonable code for them (e.g. XX888 and XX999 appear to be unclaimed for all FIPS codes; Puerto Rico out of state and unassigned cases are in 72888 and 72999, respectively).

If we do not remove the megafips codes XX000, the GeoMapper changes wouldn't be too bad:

elif geo_res == "state":
  df = df.set_index("fips")
  # Zero out the state FIPS population to avoid double counting.
  state_fips_codes = {str(x).zfill(2) + "000" for x in range(1,73)}
  subset_state_fips_codes = set(df.index.values) & state_fips_codes
  df.loc[subset_state_fips_codes, "population"] = 0
  df = df.reset_index()
  df = gmpr.replace_geocode(df, "fips", "state_id", new_col="geo_id", date_col="timestamp")

dshemetov · 2020-10-19T23:05:33Z

That said, what do you think, @krivard? I could either keep the XX000 codes like above or do the splitting into XX888/XX999 now (should be a small change in the mapping files, but would require updating the API mailing list and the docs).

dshemetov · 2020-10-20T03:37:26Z

To fill the XX888/XX999 idea out some more: we would drop the XX000 county codes from the JHU -> FIPS map. The population file would stay the same and have state FIPS population as well. This would be fine for JHU, since it doesn't normally store any data in XX000.

Other indicators:

USAFacts will be unaffected regardless of what we do here. USAFacts supports XX000 codes and prop signals, but the XX000 codes are excluded from the prop signals, despite providing XX000 populations in the static file.
Safegraph will be unaffected too. It has different prop signals that are not dependent on population. They appear to be proportional per device count and standard errors depend on sample size.
Safegraph patterns will be unaffected too. The data files start coded by ZIP before being converted to FIPS. ZIP doesn't contain any mega values and the translation ZIP to FIPS does not translate to any state FIPS values.

krivard · 2020-10-20T14:37:42Z

@dajmcdon What is most useful for forecasting, in the handling of Unassigned and Out-of-state cases/deaths in JHU county level signals? We currently put Unassigned into the megafips/megacounty, but it's not clear (see above) whether Out-of-state should go in there too or be moved to a separate unused FIPS for that state.

dshemetov · 2020-10-20T18:33:16Z

This table may be helpful context for the whole JHU -> FIPS custom mapping.

dajmcdon · 2020-10-20T18:36:39Z

Our goal is to use JHU as the "truth". The short answer, I think, if @dshemetov has the bandwidth would be to check the script here to try to figure out what Reich Lab is doing with them. I'm also curious how the 000 codes relate to the roll-up from county level (do they include the megacounty or out-of-state?).

I'm going to also tag @jsharpna so that he can tell you when everything I say is wrong.

dshemetov · 2020-10-20T19:09:49Z

The Reich lab looks to be pulling their geocodes from this file, which contains 5-digit county FIPS codes and 2-digit state FIPS codes. The 2-digit state codes are transformed into the XX -> XX000 pattern here. What is mapped to those 2-digit state codes? Well, they cleverly use the 'Province_State` column here to aggregate up to the state level, which lets them avoid a good chunk of our custom tables.

The result is that it pools counties, Out of State, Unassigned, and another state category all into XX000. They do not appear to provide a separate category for Out of State and Unassigned.

As a personal aside, I like get_truth as a function name. Has a very powerful energy.

dajmcdon · 2020-10-20T19:14:34Z

It does have a nice energy/optimism.

So if we want to predict the truth. Do we target the xx000 signal, or do we need to try and grab the "Province_State" sums? Or do these magically equal each other?

dshemetov · 2020-10-20T19:56:00Z

Right, so it's the latter. Just to be clear, XX000 is a code pattern of our own devising that currently holds the Out of State and Unassigned categories from JHU. The Unassigned bin typically contains state gov's estimates of probable cases/deaths, though sometimes it contains confirmed values that just couldn't be localized to a county, like in Rhode Island. Out of State is another bin like that, but less well documented. Do you foresee having any need for these categories separate from their aggregation into states?

E.g. confirmed cases time series

84090044,US,USA,840,90044.0,Unassigned,Rhode Island,US,0.0,0.0,"Unassigned, Rhode Island, US",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,0,0,0,0,0,0,10,12,8,5,23,31,33,71,107,57,171,184,160,80,146,241,238,332,353,274,551,408,649,965,965,1551,756,902,790,1146,1030,861,1000,895,1006,1132,1198,1109,1160,1267,1065,1087,1104,1142,1086,1133,1414,1686,2011,2260,2470,2755,1158,1198,1046,1227,1430,1645,1885,2006,2162,1556,1553,1528,1504,1518,1518,1663,1496,1533,1564,1748,1857,1920,2041,2148,2254,2370,2370,2370,1565,1614,1679,1785,1870,1870,1870,1569,1640,1689,1745,1813,1813,1813,1514,1588,1661,1695,1716,1716,1716,1535,1584,1624,1712,1488,1488,1488,1488,1651,1701,1740,1809,1809,1809,1984,2085,1438,1509,1591,1591,1591,1702,1784,1449,1535,1611,1611,1611,1902,2112,1556,1706,1778,1778,1778,2002,2146,1621,1751,1878,1878,1878,2074,2193,1593,1704,1799,1799,1799,2036,2156,1662,1738,1889,1889,1889,2169,2239,1645,1780,1874,1874,1874,2140,2193,1728,1793,1893,1893,1893,1893,2242,1852,1958,2081,2081,2081,2306,2426,1984,2114,2246,2246,2246,2558,2670,2148,2282,2152,2152,2152,2395,2527,2055,2221,2383,2383,2383,2726,2903,2192,2461,2710,2710,2710,2710,3376,2418,2692,2945,2945,2945,3601

* unnecessary int type casting of population * add dropna flag and default it to a left merge

* drop XX000 FIPS when aggregating to state * refactor pull.py - encapsulate into functions, clarify the diffing code with pandas built-ins, use geomapper for population * remove unused static_file_dir param * improve test all around, add a subset of real JHU data as test file * tests: check for infinites, check to make sure the prop signals denominator matches the county sum total

dajmcdon · 2020-10-21T18:34:52Z

Ok, so I'm still not sure what the right answer is then. We want to predict the "truth". If the truth is XX000, then great. But if XX000 = JHU's province_state, and we don't have JHU's province_state signal, then we need a fix. I don't believe that we actually need the bins separately. (@bnaras or @capolitsch can be more definitive)

Short term (like before Friday), we need something reasonable (whatever was there before is fine). But long term, wrapped up in some of the anomaly detection or data checking, we should make sure that our truth and their truth is the same (unless Delphi doesn't believe their truth)

krivard · 2020-10-21T19:00:46Z

Excellent! Thank you both.

For now, let's keep the XX000 codes and their previous definition of Unassigned + Out of State. Add the logic to drop XX000 population when aggregating up to state.

I've created an issue to track a future effort to figure out whether we're doing the right thing long-term.

krivard

japprove!

jhu/delphi_jhu/pull.py

Drop reference to absent list of exceptions

Fix JHU prop signal by zeroing out the state FIPS population in the p…

8fffe96

…opulation source

dshemetov requested a review from krivard October 18, 2020 06:32

Improve indicator validation notebook template

296d75d

krivard reviewed Oct 19, 2020

View reviewed changes

dshemetov added 3 commits October 20, 2020 18:59

Undo previous geo_data_proc changes

29c01e6

Update add_population_column

34a3585

* unnecessary int type casting of population * add dropna flag and default it to a left merge

krivard mentioned this pull request Oct 21, 2020

Determine whether COVIDCast jhu-csse adequately represents the state data used by Reich #350

Open

krivard previously approved these changes Oct 22, 2020

View reviewed changes

jhu/delphi_jhu/pull.py Outdated Show resolved Hide resolved

Update jhu docstring

ae02415

Drop reference to absent list of exceptions

krivard dismissed their stale review via ae02415 October 22, 2020 20:08

krivard self-requested a review October 22, 2020 20:08

krivard approved these changes Oct 22, 2020

View reviewed changes

krivard merged commit af0b3fb into deploy-jhu Oct 22, 2020

krivard mentioned this pull request Oct 22, 2020

Release JHU fixes #359

Closed

5 tasks

dshemetov mentioned this pull request Oct 22, 2020

Update JHU docs to link to up to date exceptions #360

Merged

krivard deleted the jhu_prop_fix branch October 29, 2020 18:10

dshemetov mentioned this pull request Nov 17, 2020

USAFacts Prop Fix #533

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix JHU prop signal #328

Fix JHU prop signal #328

dshemetov commented Oct 18, 2020 •

edited

Loading

capnrefsmmat commented Oct 18, 2020

dshemetov commented Oct 18, 2020 •

edited

Loading

capnrefsmmat commented Oct 18, 2020

dshemetov commented Oct 18, 2020 •

edited

Loading

krivard Oct 19, 2020

dshemetov Oct 19, 2020

krivard Oct 19, 2020

krivard Oct 19, 2020 •

edited

Loading

dshemetov commented Oct 19, 2020

dshemetov commented Oct 19, 2020

krivard commented Oct 19, 2020

capnrefsmmat commented Oct 19, 2020

dshemetov commented Oct 19, 2020 •

edited

Loading

krivard commented Oct 19, 2020

dshemetov commented Oct 19, 2020

dshemetov commented Oct 19, 2020

dshemetov commented Oct 20, 2020 •

edited

Loading

krivard commented Oct 20, 2020

dshemetov commented Oct 20, 2020

dajmcdon commented Oct 20, 2020 •

edited

Loading

dshemetov commented Oct 20, 2020

dajmcdon commented Oct 20, 2020

dshemetov commented Oct 20, 2020 •

edited

Loading

dajmcdon commented Oct 21, 2020

krivard commented Oct 21, 2020

krivard left a comment

Fix JHU prop signal #328

Fix JHU prop signal #328

Conversation

dshemetov commented Oct 18, 2020 • edited Loading

capnrefsmmat commented Oct 18, 2020

dshemetov commented Oct 18, 2020 • edited Loading

capnrefsmmat commented Oct 18, 2020

dshemetov commented Oct 18, 2020 • edited Loading

krivard Oct 19, 2020

Choose a reason for hiding this comment

dshemetov Oct 19, 2020

Choose a reason for hiding this comment

krivard Oct 19, 2020

Choose a reason for hiding this comment

krivard Oct 19, 2020 • edited Loading

Choose a reason for hiding this comment

dshemetov commented Oct 19, 2020

dshemetov commented Oct 19, 2020

krivard commented Oct 19, 2020

capnrefsmmat commented Oct 19, 2020

dshemetov commented Oct 19, 2020 • edited Loading

krivard commented Oct 19, 2020

dshemetov commented Oct 19, 2020

dshemetov commented Oct 19, 2020

dshemetov commented Oct 20, 2020 • edited Loading

krivard commented Oct 20, 2020

dshemetov commented Oct 20, 2020

dajmcdon commented Oct 20, 2020 • edited Loading

dshemetov commented Oct 20, 2020

dajmcdon commented Oct 20, 2020

dshemetov commented Oct 20, 2020 • edited Loading

dajmcdon commented Oct 21, 2020

krivard commented Oct 21, 2020

krivard left a comment

Choose a reason for hiding this comment

dshemetov commented Oct 18, 2020 •

edited

Loading

dshemetov commented Oct 18, 2020 •

edited

Loading

dshemetov commented Oct 18, 2020 •

edited

Loading

krivard Oct 19, 2020 •

edited

Loading

dshemetov commented Oct 19, 2020 •

edited

Loading

dshemetov commented Oct 20, 2020 •

edited

Loading

dajmcdon commented Oct 20, 2020 •

edited

Loading

dshemetov commented Oct 20, 2020 •

edited

Loading