Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rescaling linguistic isolation #1750

Merged
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 54 additions & 54 deletions data/data-pipeline/data_pipeline/etl/score/etl_score.py
Original file line number Diff line number Diff line change
Expand Up @@ -264,6 +264,7 @@ def _add_percentiles_to_df(
df: pd.DataFrame,
input_column_name: str,
output_column_name_root: str,
drop_tracts: list = None,
ascending: bool = True,
) -> pd.DataFrame:
"""Creates percentiles.
Expand Down Expand Up @@ -298,10 +299,15 @@ def _add_percentiles_to_df(
something like "3rd grade reading proficiency" and `output_column_name_root`
may be something like "Low 3rd grade reading proficiency".
"""
if (
output_column_name_root
!= field_names.EXPECTED_AGRICULTURE_LOSS_RATE_FIELD
):

# We have two potential options for assessing how to calculate percentiles.
# For the vast majority of columns, we will simply calculate percentiles overall.
# However, for Linguistic Isolation and Agricultural Value Loss, there exist conditions
# for which we drop out tracts from consideration in the percentile. More details on those
# are below, for them, we provide a list of tracts to not include.
# Because of the fancy transformations below, I have removed the urban / rural percentiles,
# which are now deprecated.
if not drop_tracts:
# Create the "basic" percentile.
df[
f"{output_column_name_root}"
Expand All @@ -310,61 +316,21 @@ def _add_percentiles_to_df(

else:
# For agricultural loss, we are using whether there is value at all to determine percentile and then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment now misplaced because it refers to agricultural loss?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because agricultural loss is one of the two cases covered here. Does that make sense to you? I can adjust the comment to make it clearer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it seems like some adjustment could be useful. It also may not be necessary to mention the specific use cases inside this method. Because you've abstracted this all to drop_tracts I don't think you need to mention the use cases anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

# filling places where the value is False with 0
# filling places where the value is False with np.NaN (since it doesn't apply)
# For linguistic isolation, we drop PR and then recalculate. We fill PR with np.NaN
df[
f"{output_column_name_root}"
f"{field_names.PERCENTILE_FIELD_SUFFIX}"
] = (
df.where(
df[field_names.AGRICULTURAL_VALUE_BOOL_FIELD].astype(float)
== 1.0
)[input_column_name]
.rank(ascending=ascending, pct=True)
.fillna(
df[field_names.AGRICULTURAL_VALUE_BOOL_FIELD].astype(float)
)
)

# Create the urban/rural percentiles.
urban_rural_percentile_fields_to_combine = []
for (urban_or_rural_string, urban_heuristic_bool) in [
("urban", True),
("rural", False),
]:
# Create a field with only those values
this_category_only_value_field = (
f"{input_column_name} (value {urban_or_rural_string} only)"
)
df[this_category_only_value_field] = np.where(
df[field_names.URBAN_HEURISTIC_FIELD] == urban_heuristic_bool,
df[input_column_name],
None,
)

# Calculate the percentile for only this category
this_category_only_percentile_field = (
f"{output_column_name_root} "
f"(percentile {urban_or_rural_string} only)"
)
df[this_category_only_percentile_field] = df[
this_category_only_value_field
] = df.groupby(df[field_names.GEOID_TRACT_FIELD].isin(drop_tracts))[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment explaining this code? Maybe I'm just not fully awake this morning but I have honestly no idea what's going on in this code.

Here's my cro-magnon attempt to understand what's happening:

  1. You're limiting the data frame to only include tracts that you're intentionally trying to drop? Why, wouldn't it be the other way?

  2. Then you're running a groupby -- I guess grouping by each of the tracts that are in drop_tracts? Why?

  3. Then you're only selecting one column (the input), makes sense

  4. Running a percentile on this

  5. Setting percentile to the output column name / percentile field.

I'm honestly just very confused. It feels like maybe you're using the groupby as a "shortcut" to do something that's not traditionally an aggregation. But this code feels kind of error prone because you're packing so much into a one liner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. So here's what is going on:

For drop_tracts, we want to keep the raw data (e.g., the % of people in linguistic isolation in PR) but not have it included in the percentile. I strongly think we need to keep the raw data because we should have in our output all of the raw data possible. There are two good ways we could do this:

  1. We could make a new column that has NaN for all drop_tracts tract ids, and then build the percentile on that. I opted not to do that because we have so many columns, it's much more verbose, it's slower, and (personally) I think it's a lot less clear (which I think is subjective). I am happy to do this, though, if you have a strong preference.

  2. We can groupby membership in drop_tracts and then calculate percentiles for the variables. This allows us to calculate percentiles solely within drop_tracts and tracts that are not in drop_tracts without the confusing headache of making a new column (but what's super clear to me, a person who spends my entire life constructing groupby statements, may be less clear to everyone else!). Then, because we want the drop_tracts to have no percentiles, we null those percentiles out.

In an ideal world, we could do something with where and rank and have it be crisp and clean, but rank is quite finnicky and the code to do this ends up being kind of a disaster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going with option 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

input_column_name
].rank(
pct=True,
# Set ascending to the parameter value.
ascending=ascending,
ascending=ascending, pct=True
)

# Add the field name to this list. Later, we'll combine this list.
urban_rural_percentile_fields_to_combine.append(
this_category_only_percentile_field
)

# Combine both urban and rural into one field:
df[
f"{output_column_name_root}{field_names.PERCENTILE_URBAN_RURAL_FIELD_SUFFIX}"
] = df[urban_rural_percentile_fields_to_combine].mean(
axis=1, skipna=True
)
df.loc[
df[field_names.GEOID_TRACT_FIELD].isin(drop_tracts),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some comments about this as well? I'm confused about what's going on.

Why are you running the percentile before setting the values to None, and then afterwards setting it to None for those drop_tracts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above!

f"{output_column_name_root}"
f"{field_names.PERCENTILE_FIELD_SUFFIX}",
] = np.NaN

return df

Expand Down Expand Up @@ -523,13 +489,47 @@ def _prepare_initial_df(self) -> pd.DataFrame:
df_copy[numeric_columns] = df_copy[numeric_columns].apply(pd.to_numeric)

# Convert all columns to numeric and do math
# Note that we have a few special conditions here, that we handle explicitly.
# For *Linguistic Isolation*, we do NOT want to include Puerto Rico in the percentile
# calculation. This is because linguistic isolation as a category doesn't make much sense
# in Puerto Rico, where Spanish is a recognized language. Thus, we construct a list
# of tracts to drop from the percentile calculation.
#
# For *Expected Agricultural Loss*, we only want to include in the percentile tracts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great comments!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol ty

# in which there is some agricultural value. This helps us adjust the data such that we have
# the ability to discern which tracts truly are at the 90th percentile, since many tracts have 0 value.

for numeric_column in numeric_columns:
drop_tracts = []
if (
numeric_column
== field_names.EXPECTED_AGRICULTURE_LOSS_RATE_FIELD
):
drop_tracts = df_copy[
~df_copy[field_names.AGRICULTURAL_VALUE_BOOL_FIELD]
.astype(bool)
.fillna(False)
][field_names.GEOID_TRACT_FIELD].to_list()
logger.info(
f"Dropping {len(drop_tracts)} tracts from Agricultural Value Loss"
)

elif numeric_column == field_names.LINGUISTIC_ISO_FIELD:
drop_tracts = df_copy[
# 72 is the FIPS code for Puerto Rico
df_copy[field_names.GEOID_TRACT_FIELD].str.startswith("72")
][field_names.GEOID_TRACT_FIELD].to_list()
logger.info(
f"Dropping {len(drop_tracts)} tracts from Linguistic Isolation"
)

df_copy = self._add_percentiles_to_df(
df=df_copy,
input_column_name=numeric_column,
# For this use case, the input name and output name root are the same.
output_column_name_root=numeric_column,
ascending=True,
drop_tracts=drop_tracts,
)

# Min-max normalization:
Expand Down