-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix JHU prop signal #328
Fix JHU prop signal #328
Conversation
Is there an opportunity for unit tests to prevent this from regressing again? |
That's really worth reflecting on. What sort of tests could have caught this sort of issue? First thought are the indicator validation and anomaly detection projects, which are aimed at related problems. Second thought is the qualitative indicator validation notebook I built here. One challenge in designing tests here is the lack of ground truth - since all reference data may be faulty, we will need to accept sensors with non-zero false positive rates. These sensors will ideally draw our attention to possible flaws for further analysis, but will not block merging. Ideally it will be easy to go from a sensor alert to an interactive REPL/notebook session exploring the suspect data. Another thought is to add correlation tests with a trusted set of indicators. I would like to learn more about the established best practices for this area. @nmdefries @jsharpna do you have thoughts on this? I'd love to compare notes on your related projects. |
In a test case, you'd provide known input -- for example, you'd provide input that says there are 1,000,000 cases, and compare the answer to what you'd get. The ideal way to facilitate this would be if the proportion calculation were factored out into its own function that takes counts and populations as arguments; then you could provide it some known counts and populations and ensure it returns the right proportions. |
Ah looks like we did have a prop test with synthetic population and values, but it didn't have any population for the state FIPS code. |
@@ -391,6 +391,10 @@ def create_fips_population_table(): | |||
df_pr = df_pr.groupby("fips").sum().reset_index() | |||
df_pr = df_pr[~df_pr["fips"].isin(census_pop["fips"])] | |||
census_pop_pr = pd.concat([census_pop, df_pr]) | |||
|
|||
# Zero out the populations for the state FIPS codes XX000 to avoid double counting | |||
megafips_codes = [str(x).zfill(2) + "000" for x in range(1, 73)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this result in inf results for megafips prop signals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. The denominator in JHU is obtained my merging a FIPS population dataframe on the FIPS-level data, which is then aggregated to the state level. I verified this in this updated notebook, where I compare the new JHU state prop values with USA facts. Rhode Island is the largest outlier because we are now using values from Unassigned that we weren't before.
Let me double check how other indicators like Safegraph do this calculation, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant county files -- don't we include unassigned as a megafips for county?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do, in fact:
/common/covidcast/archive/successful/jhu-csse $ zgrep -m1 "^[0-9][0-9]000" *_county_* | head
20200220_county_confirmed_7dav_cumulative_num.csv.gz:60000,0.0,NA,NA
20200220_county_confirmed_7dav_incidence_num.csv.gz:60000,0.0,NA,NA
20200220_county_confirmed_7dav_incidence_prop.csv.gz:01000,0.0,NA,NA
20200220_county_confirmed_7day_avg_cumulative_num.csv.gz:72000,0.0,NA,NA
20200220_county_confirmed_7day_avg_cumulative_prop.csv.gz:72000,0.0,NA,NA
20200220_county_confirmed_7day_avg_incidence_num.csv.gz:72000,0.0,NA,NA
20200220_county_confirmed_7day_avg_incidence_prop.csv.gz:72000,0.0,NA,NA
20200220_county_confirmed_cumulative_num.csv.gz:60000,0.0,NA,NA
20200220_county_confirmed_incidence_num.csv.gz:60000,0.0,NA,NA
20200220_county_confirmed_incidence_prop.csv.gz:01000,0.0,NA,NA
I have identified why the state FIPS codes started being added into the population totals now. It is because the GeoMapper updated the JHU UID -> FIPS mapping to keep the state FIPS (for the purposes of aggregating Unassigned and Out of State data). Previously, most of them were filtered out, prior to joining the FIPS -> population file. So while the state FIPS codes are important to keep in JHU UID -> FIPS mapping file and in the FIPS -> state mapping file (so that the Unassigned and Out of State values don't get filtered out), we need to set the state FIPS population to zero to avoid double counting when aggregating up to state. |
Right, so Safegraph performs the aggregation to state similarly. Namely, it truncates the county FIPS code to the first two digits and then aggregates. It also appears that Safegraph provides Census block group mappings, which should probably move to the geocoding util. That's a separate issue though. |
I've confirmed that the state prop signals are now on par with usa-facts, but we are now getting inf values for megafips in the county files:
|
Definitely a sign that we need a few unit tests for the different combinations: state, county, megacounty, etc. |
@krivard I see what you mean. I wonder if we should filter those state FIPS out from the county .csvs, since for JHU specifically they currently represent a combination of Unassigned and Out of State signals only, which is not something a user would expect. On the other hand, it doesn't make sense to duplicate state resolution values in county but with state FIPS. |
@dshemetov We don't have a clear picture of what users truly expect from a megafips cases signal. We started publishing Unassigned cases under the megafips some months ago because we wanted there to be a way for API users to retrieve those values. Merging in the Out Of State might be counterintuitive in some sense, but it does maintain access through the API to counts of less-certain location, and it does maintain the presence of megafips in the county signals. Two options:
I have a slight preference for the latter, but would yield to argument. |
@krivard I see your point about not knowing what users expect there. I have not been able to get a clear picture from the JHU documentation with regard to what actually goes in the Out of State bucket; Unassigned appears to contain a combination of counts that could not be localized to a particular county FIPS and probable deaths released by state estimates. I can see the value in releasing the Unassigned category to users, but am less sure about the Out of State category. I'm also open to separating Unassigned and Out of State into separate megaFIPS from XX000, if we can find a reasonable code for them (e.g. XX888 and XX999 appear to be unclaimed for all FIPS codes; Puerto Rico out of state and unassigned cases are in 72888 and 72999, respectively). If we do not remove the megafips codes XX000, the GeoMapper changes wouldn't be too bad:
|
That said, what do you think, @krivard? I could either keep the XX000 codes like above or do the splitting into XX888/XX999 now (should be a small change in the mapping files, but would require updating the API mailing list and the docs). |
To fill the XX888/XX999 idea out some more: we would drop the XX000 county codes from the JHU -> FIPS map. The population file would stay the same and have state FIPS population as well. This would be fine for JHU, since it doesn't normally store any data in XX000. Other indicators:
|
@dajmcdon What is most useful for forecasting, in the handling of Unassigned and Out-of-state cases/deaths in JHU county level signals? We currently put Unassigned into the megafips/megacounty, but it's not clear (see above) whether Out-of-state should go in there too or be moved to a separate unused FIPS for that state. |
This table may be helpful context for the whole JHU -> FIPS custom mapping. |
Our goal is to use JHU as the "truth". The short answer, I think, if @dshemetov has the bandwidth would be to check the script here to try to figure out what Reich Lab is doing with them. I'm also curious how the 000 codes relate to the roll-up from county level (do they include the megacounty or out-of-state?). I'm going to also tag @jsharpna so that he can tell you when everything I say is wrong. |
The Reich lab looks to be pulling their geocodes from this file, which contains 5-digit county FIPS codes and 2-digit state FIPS codes. The 2-digit state codes are transformed into the XX -> XX000 pattern here. What is mapped to those 2-digit state codes? Well, they cleverly use the 'Province_State` column here to aggregate up to the state level, which lets them avoid a good chunk of our custom tables. The result is that it pools counties, Out of State, Unassigned, and another state category all into XX000. They do not appear to provide a separate category for Out of State and Unassigned. As a personal aside, I like |
It does have a nice energy/optimism. So if we want to predict the truth. Do we target the xx000 signal, or do we need to try and grab the "Province_State" sums? Or do these magically equal each other? |
Right, so it's the latter. Just to be clear, XX000 is a code pattern of our own devising that currently holds the Out of State and Unassigned categories from JHU. The Unassigned bin typically contains state gov's estimates of probable cases/deaths, though sometimes it contains confirmed values that just couldn't be localized to a county, like in Rhode Island. Out of State is another bin like that, but less well documented. Do you foresee having any need for these categories separate from their aggregation into states? E.g. confirmed cases time series
|
* unnecessary int type casting of population * add dropna flag and default it to a left merge
* drop XX000 FIPS when aggregating to state * refactor pull.py - encapsulate into functions, clarify the diffing code with pandas built-ins, use geomapper for population * remove unused static_file_dir param * improve test all around, add a subset of real JHU data as test file * tests: check for infinites, check to make sure the prop signals denominator matches the county sum total
Ok, so I'm still not sure what the right answer is then. We want to predict the "truth". If the truth is XX000, then great. But if XX000 = JHU's province_state, and we don't have JHU's province_state signal, then we need a fix. I don't believe that we actually need the bins separately. (@bnaras or @capolitsch can be more definitive) Short term (like before Friday), we need something reasonable (whatever was there before is fine). But long term, wrapped up in some of the anomaly detection or data checking, we should make sure that our truth and their truth is the same (unless Delphi doesn't believe their truth) |
Excellent! Thank you both. For now, let's keep the XX000 codes and their previous definition of Unassigned + Out of State. Add the logic to drop XX000 population when aggregating up to state. I've created an issue to track a future effort to figure out whether we're doing the right thing long-term. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
japprove!
Drop reference to absent list of exceptions
Current solution zeros out the state FIPS population in the new population source file. Don't quite understand why that occurred, yet. Fixes #325.