-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rescaling linguistic isolation #1750
Changes from 7 commits
3ef5192
f6e161f
4da09e9
04b0882
7566ac7
00137dd
0f6e2d1
f0eae65
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -264,6 +264,7 @@ def _add_percentiles_to_df( | |
df: pd.DataFrame, | ||
input_column_name: str, | ||
output_column_name_root: str, | ||
drop_tracts: list = None, | ||
ascending: bool = True, | ||
) -> pd.DataFrame: | ||
"""Creates percentiles. | ||
|
@@ -298,10 +299,15 @@ def _add_percentiles_to_df( | |
something like "3rd grade reading proficiency" and `output_column_name_root` | ||
may be something like "Low 3rd grade reading proficiency". | ||
""" | ||
if ( | ||
output_column_name_root | ||
!= field_names.EXPECTED_AGRICULTURE_LOSS_RATE_FIELD | ||
): | ||
|
||
# We have two potential options for assessing how to calculate percentiles. | ||
# For the vast majority of columns, we will simply calculate percentiles overall. | ||
# However, for Linguistic Isolation and Agricultural Value Loss, there exist conditions | ||
# for which we drop out tracts from consideration in the percentile. More details on those | ||
# are below, for them, we provide a list of tracts to not include. | ||
# Because of the fancy transformations below, I have removed the urban / rural percentiles, | ||
# which are now deprecated. | ||
if not drop_tracts: | ||
# Create the "basic" percentile. | ||
df[ | ||
f"{output_column_name_root}" | ||
|
@@ -310,61 +316,21 @@ def _add_percentiles_to_df( | |
|
||
else: | ||
# For agricultural loss, we are using whether there is value at all to determine percentile and then | ||
# filling places where the value is False with 0 | ||
# filling places where the value is False with np.NaN (since it doesn't apply) | ||
# For linguistic isolation, we drop PR and then recalculate. We fill PR with np.NaN | ||
df[ | ||
f"{output_column_name_root}" | ||
f"{field_names.PERCENTILE_FIELD_SUFFIX}" | ||
] = ( | ||
df.where( | ||
df[field_names.AGRICULTURAL_VALUE_BOOL_FIELD].astype(float) | ||
== 1.0 | ||
)[input_column_name] | ||
.rank(ascending=ascending, pct=True) | ||
.fillna( | ||
df[field_names.AGRICULTURAL_VALUE_BOOL_FIELD].astype(float) | ||
) | ||
) | ||
|
||
# Create the urban/rural percentiles. | ||
urban_rural_percentile_fields_to_combine = [] | ||
for (urban_or_rural_string, urban_heuristic_bool) in [ | ||
("urban", True), | ||
("rural", False), | ||
]: | ||
# Create a field with only those values | ||
this_category_only_value_field = ( | ||
f"{input_column_name} (value {urban_or_rural_string} only)" | ||
) | ||
df[this_category_only_value_field] = np.where( | ||
df[field_names.URBAN_HEURISTIC_FIELD] == urban_heuristic_bool, | ||
df[input_column_name], | ||
None, | ||
) | ||
|
||
# Calculate the percentile for only this category | ||
this_category_only_percentile_field = ( | ||
f"{output_column_name_root} " | ||
f"(percentile {urban_or_rural_string} only)" | ||
) | ||
df[this_category_only_percentile_field] = df[ | ||
this_category_only_value_field | ||
] = df.groupby(df[field_names.GEOID_TRACT_FIELD].isin(drop_tracts))[ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add a comment explaining this code? Maybe I'm just not fully awake this morning but I have honestly no idea what's going on in this code. Here's my cro-magnon attempt to understand what's happening:
I'm honestly just very confused. It feels like maybe you're using the groupby as a "shortcut" to do something that's not traditionally an aggregation. But this code feels kind of error prone because you're packing so much into a one liner. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure. So here's what is going on: For
In an ideal world, we could do something with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Going with option 1 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated! |
||
input_column_name | ||
].rank( | ||
pct=True, | ||
# Set ascending to the parameter value. | ||
ascending=ascending, | ||
ascending=ascending, pct=True | ||
) | ||
|
||
# Add the field name to this list. Later, we'll combine this list. | ||
urban_rural_percentile_fields_to_combine.append( | ||
this_category_only_percentile_field | ||
) | ||
|
||
# Combine both urban and rural into one field: | ||
df[ | ||
f"{output_column_name_root}{field_names.PERCENTILE_URBAN_RURAL_FIELD_SUFFIX}" | ||
] = df[urban_rural_percentile_fields_to_combine].mean( | ||
axis=1, skipna=True | ||
) | ||
df.loc[ | ||
df[field_names.GEOID_TRACT_FIELD].isin(drop_tracts), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add some comments about this as well? I'm confused about what's going on. Why are you running the percentile before setting the values to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See comment above! |
||
f"{output_column_name_root}" | ||
f"{field_names.PERCENTILE_FIELD_SUFFIX}", | ||
] = np.NaN | ||
|
||
return df | ||
|
||
|
@@ -523,13 +489,47 @@ def _prepare_initial_df(self) -> pd.DataFrame: | |
df_copy[numeric_columns] = df_copy[numeric_columns].apply(pd.to_numeric) | ||
|
||
# Convert all columns to numeric and do math | ||
# Note that we have a few special conditions here, that we handle explicitly. | ||
# For *Linguistic Isolation*, we do NOT want to include Puerto Rico in the percentile | ||
# calculation. This is because linguistic isolation as a category doesn't make much sense | ||
# in Puerto Rico, where Spanish is a recognized language. Thus, we construct a list | ||
# of tracts to drop from the percentile calculation. | ||
# | ||
# For *Expected Agricultural Loss*, we only want to include in the percentile tracts | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great comments! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. lol ty |
||
# in which there is some agricultural value. This helps us adjust the data such that we have | ||
# the ability to discern which tracts truly are at the 90th percentile, since many tracts have 0 value. | ||
|
||
for numeric_column in numeric_columns: | ||
drop_tracts = [] | ||
if ( | ||
numeric_column | ||
== field_names.EXPECTED_AGRICULTURE_LOSS_RATE_FIELD | ||
): | ||
drop_tracts = df_copy[ | ||
~df_copy[field_names.AGRICULTURAL_VALUE_BOOL_FIELD] | ||
.astype(bool) | ||
.fillna(False) | ||
][field_names.GEOID_TRACT_FIELD].to_list() | ||
logger.info( | ||
f"Dropping {len(drop_tracts)} tracts from Agricultural Value Loss" | ||
) | ||
|
||
elif numeric_column == field_names.LINGUISTIC_ISO_FIELD: | ||
drop_tracts = df_copy[ | ||
# 72 is the FIPS code for Puerto Rico | ||
df_copy[field_names.GEOID_TRACT_FIELD].str.startswith("72") | ||
][field_names.GEOID_TRACT_FIELD].to_list() | ||
logger.info( | ||
f"Dropping {len(drop_tracts)} tracts from Linguistic Isolation" | ||
) | ||
|
||
df_copy = self._add_percentiles_to_df( | ||
df=df_copy, | ||
input_column_name=numeric_column, | ||
# For this use case, the input name and output name root are the same. | ||
output_column_name_root=numeric_column, | ||
ascending=True, | ||
drop_tracts=drop_tracts, | ||
) | ||
|
||
# Min-max normalization: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment now misplaced because it refers to agricultural loss?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because agricultural loss is one of the two cases covered here. Does that make sense to you? I can adjust the comment to make it clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it seems like some adjustment could be useful. It also may not be necessary to mention the specific use cases inside this method. Because you've abstracted this all to
drop_tracts
I don't think you need to mention the use cases anymore.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!