-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rescaling linguistic isolation #1750
Merged
emma-nechamkin
merged 8 commits into
emma-nechamkin/release/score-narwhal
from
emma-nechamkin/1748-linguistic-isolation-rescale
Aug 2, 2022
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
3ef5192
updating
emma-nechamkin f6e161f
Merge branch 'emma-nechamkin/release/score-narwhal' of github.com:usd…
emma-nechamkin 4da09e9
merge conflict resolved
emma-nechamkin 04b0882
Merge branch 'emma-nechamkin/release/score-narwhal' of github.com:usd…
emma-nechamkin 7566ac7
first pass
emma-nechamkin 00137dd
first clean draft
emma-nechamkin 0f6e2d1
working currently
emma-nechamkin f0eae65
updated with version 1
emma-nechamkin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -264,6 +264,7 @@ def _add_percentiles_to_df( | |
df: pd.DataFrame, | ||
input_column_name: str, | ||
output_column_name_root: str, | ||
drop_tracts: list = None, | ||
ascending: bool = True, | ||
) -> pd.DataFrame: | ||
"""Creates percentiles. | ||
|
@@ -298,73 +299,39 @@ def _add_percentiles_to_df( | |
something like "3rd grade reading proficiency" and `output_column_name_root` | ||
may be something like "Low 3rd grade reading proficiency". | ||
""" | ||
if ( | ||
output_column_name_root | ||
!= field_names.EXPECTED_AGRICULTURE_LOSS_RATE_FIELD | ||
): | ||
|
||
# We have two potential options for assessing how to calculate percentiles. | ||
# For the vast majority of columns, we will simply calculate percentiles overall. | ||
# However, for Linguistic Isolation and Agricultural Value Loss, there exist conditions | ||
# for which we drop out tracts from consideration in the percentile. More details on those | ||
# are below, for them, we provide a list of tracts to not include. | ||
# Because of the fancy transformations below, I have removed the urban / rural percentiles, | ||
# which are now deprecated. | ||
if not drop_tracts: | ||
# Create the "basic" percentile. | ||
df[ | ||
f"{output_column_name_root}" | ||
f"{field_names.PERCENTILE_FIELD_SUFFIX}" | ||
] = df[input_column_name].rank(pct=True, ascending=ascending) | ||
|
||
else: | ||
# For agricultural loss, we are using whether there is value at all to determine percentile and then | ||
# filling places where the value is False with 0 | ||
tmp_series = df[input_column_name].where( | ||
~df[field_names.GEOID_TRACT_FIELD].isin(drop_tracts), | ||
np.nan, | ||
) | ||
logger.info( | ||
f"Creating special case column for percentiles from {input_column_name}" | ||
) | ||
df[ | ||
f"{output_column_name_root}" | ||
f"{field_names.PERCENTILE_FIELD_SUFFIX}" | ||
] = ( | ||
df.where( | ||
df[field_names.AGRICULTURAL_VALUE_BOOL_FIELD].astype(float) | ||
== 1.0 | ||
)[input_column_name] | ||
.rank(ascending=ascending, pct=True) | ||
.fillna( | ||
df[field_names.AGRICULTURAL_VALUE_BOOL_FIELD].astype(float) | ||
) | ||
) | ||
|
||
# Create the urban/rural percentiles. | ||
urban_rural_percentile_fields_to_combine = [] | ||
for (urban_or_rural_string, urban_heuristic_bool) in [ | ||
("urban", True), | ||
("rural", False), | ||
]: | ||
# Create a field with only those values | ||
this_category_only_value_field = ( | ||
f"{input_column_name} (value {urban_or_rural_string} only)" | ||
) | ||
df[this_category_only_value_field] = np.where( | ||
df[field_names.URBAN_HEURISTIC_FIELD] == urban_heuristic_bool, | ||
df[input_column_name], | ||
None, | ||
) | ||
] = tmp_series.rank(ascending=ascending, pct=True) | ||
|
||
# Calculate the percentile for only this category | ||
this_category_only_percentile_field = ( | ||
f"{output_column_name_root} " | ||
f"(percentile {urban_or_rural_string} only)" | ||
) | ||
df[this_category_only_percentile_field] = df[ | ||
this_category_only_value_field | ||
].rank( | ||
pct=True, | ||
# Set ascending to the parameter value. | ||
ascending=ascending, | ||
) | ||
|
||
# Add the field name to this list. Later, we'll combine this list. | ||
urban_rural_percentile_fields_to_combine.append( | ||
this_category_only_percentile_field | ||
) | ||
|
||
# Combine both urban and rural into one field: | ||
df[ | ||
f"{output_column_name_root}{field_names.PERCENTILE_URBAN_RURAL_FIELD_SUFFIX}" | ||
] = df[urban_rural_percentile_fields_to_combine].mean( | ||
axis=1, skipna=True | ||
) | ||
# Check that "drop tracts" were dropped (quicker than creating a fixture?) | ||
assert df[df[field_names.GEOID_TRACT_FIELD].isin(drop_tracts)][ | ||
f"{output_column_name_root}" | ||
f"{field_names.PERCENTILE_FIELD_SUFFIX}" | ||
].isna().sum() == len(drop_tracts), "Not all tracts were dropped" | ||
|
||
return df | ||
|
||
|
@@ -523,13 +490,47 @@ def _prepare_initial_df(self) -> pd.DataFrame: | |
df_copy[numeric_columns] = df_copy[numeric_columns].apply(pd.to_numeric) | ||
|
||
# Convert all columns to numeric and do math | ||
# Note that we have a few special conditions here, that we handle explicitly. | ||
# For *Linguistic Isolation*, we do NOT want to include Puerto Rico in the percentile | ||
# calculation. This is because linguistic isolation as a category doesn't make much sense | ||
# in Puerto Rico, where Spanish is a recognized language. Thus, we construct a list | ||
# of tracts to drop from the percentile calculation. | ||
# | ||
# For *Expected Agricultural Loss*, we only want to include in the percentile tracts | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great comments! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. lol ty |
||
# in which there is some agricultural value. This helps us adjust the data such that we have | ||
# the ability to discern which tracts truly are at the 90th percentile, since many tracts have 0 value. | ||
|
||
for numeric_column in numeric_columns: | ||
drop_tracts = [] | ||
if ( | ||
numeric_column | ||
== field_names.EXPECTED_AGRICULTURE_LOSS_RATE_FIELD | ||
): | ||
drop_tracts = df_copy[ | ||
~df_copy[field_names.AGRICULTURAL_VALUE_BOOL_FIELD] | ||
.astype(bool) | ||
.fillna(False) | ||
][field_names.GEOID_TRACT_FIELD].to_list() | ||
logger.info( | ||
f"Dropping {len(drop_tracts)} tracts from Agricultural Value Loss" | ||
) | ||
|
||
elif numeric_column == field_names.LINGUISTIC_ISO_FIELD: | ||
drop_tracts = df_copy[ | ||
# 72 is the FIPS code for Puerto Rico | ||
df_copy[field_names.GEOID_TRACT_FIELD].str.startswith("72") | ||
][field_names.GEOID_TRACT_FIELD].to_list() | ||
logger.info( | ||
f"Dropping {len(drop_tracts)} tracts from Linguistic Isolation" | ||
) | ||
|
||
df_copy = self._add_percentiles_to_df( | ||
df=df_copy, | ||
input_column_name=numeric_column, | ||
# For this use case, the input name and output name root are the same. | ||
output_column_name_root=numeric_column, | ||
ascending=True, | ||
drop_tracts=drop_tracts, | ||
) | ||
|
||
# Min-max normalization: | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this instead of a formal test -- this sort of test doesn't seem fit nicely within our fixture testing framework, and these sort of assert statements are compatible / match the rest of the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! Appreciate the spirit of testing even without a formal testing framework.