Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make tribal overlap set score N #2004

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions data/data-pipeline/data_pipeline/content/config/csv.yml
Original file line number Diff line number Diff line change
Expand Up @@ -59,9 +59,15 @@ fields:
- score_name: Definition N (communities) (based on adjacency index and low income alone)
label: Identified as disadvantaged based on neighbors and relaxed low income threshold only
format: bool
- score_name: Identified as disadvantaged due to tribal overlap
label: Identified as disadvantaged due to tribal overlap
format: bool
- score_name: Definition N community, including adjacency index tracts
label: Identified as disadvantaged
format: bool
- score_name: Percentage of tract that is disadvantaged
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make clear this is area-based?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emma-nechamkin I asked Beth in the ticket; it's good by me, but I leave the actual words to Beth and others

label: Percentage of tract that is disadvantaged by area
format: percentage
- score_name: Definition N (communities) (average of neighbors)
label: Share of neighbors that are identified as disadvantaged
format: percentage
Expand Down
6 changes: 6 additions & 0 deletions data/data-pipeline/data_pipeline/content/config/excel.yml
Original file line number Diff line number Diff line change
Expand Up @@ -63,9 +63,15 @@ sheets:
- score_name: Definition N (communities) (based on adjacency index and low income alone)
label: Identified as disadvantaged based on neighbors and relaxed low income threshold only
format: bool
- score_name: Identified as disadvantaged due to tribal overlap
label: Identified as disadvantaged due to tribal overlap
format: bool
- score_name: Definition N community, including adjacency index tracts
label: Identified as disadvantaged
format: bool
- score_name: Percentage of tract that is disadvantaged
label: Percentage of tract that is disadvantaged by area
format: percentage
- score_name: Definition N (communities) (average of neighbors)
label: Share of neighbors that are identified as disadvantaged
format: percentage
Expand Down
3 changes: 3 additions & 0 deletions data/data-pipeline/data_pipeline/etl/score/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -279,6 +279,8 @@
field_names.SCORE_N_COMMUNITIES
+ field_names.ADJACENT_MEAN_SUFFIX: "SN_DON",
field_names.SCORE_N_COMMUNITIES: "SN_NO_DON",
field_names.IS_TRIBAL_DAC: "SN_T",
field_names.PERCENT_OF_TRACT_IS_DAC: "SN_PERC",
field_names.EXPECTED_POPULATION_LOSS_RATE_LOW_INCOME_FIELD: "EPLRLI",
field_names.EXPECTED_AGRICULTURE_LOSS_RATE_LOW_INCOME_FIELD: "EALRLI",
field_names.EXPECTED_BUILDING_LOSS_RATE_LOW_INCOME_FIELD: "EBLRLI",
Expand Down Expand Up @@ -472,4 +474,5 @@
field_names.AML_BOOLEAN,
field_names.HISTORIC_REDLINING_SCORE_EXCEEDED,
field_names.PERCENT_OF_TRIBAL_AREA_IN_TRACT,
field_names.PERCENT_OF_TRACT_IS_DAC,
]
1 change: 1 addition & 0 deletions data/data-pipeline/data_pipeline/etl/score/etl_score.py
Original file line number Diff line number Diff line change
Expand Up @@ -505,6 +505,7 @@ def _prepare_initial_df(self) -> pd.DataFrame:
field_names.IMPUTED_INCOME_FLAG_FIELD_NAME,
field_names.ELIGIBLE_FUDS_BINARY_FIELD_NAME,
field_names.HISTORIC_REDLINING_SCORE_EXCEEDED,
field_names.IS_TRIBAL_DAC,
]

# For some columns, high values are "good", so we want to reverse the percentile
Expand Down

Large diffs are not rendered by default.

Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
12 changes: 10 additions & 2 deletions data/data-pipeline/data_pipeline/etl/sources/tribal_overlap/etl.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ class TribalOverlapETL(ExtractTransformLoad):
ANNETTE_ISLAND_TRIBAL_NAME = "Annette Island LAR"

CRS_INTEGER = 3857
TRIBAL_OVERLAP_CUTOFF = 0.995 # Percentage of overlap that rounds to 100%

# Define these for easy code completion
def __init__(self):
Expand All @@ -58,6 +59,7 @@ def __init__(self):
field_names.PERCENT_OF_TRIBAL_AREA_IN_TRACT,
field_names.NAMES_OF_TRIBAL_AREAS_IN_TRACT,
field_names.PERCENT_OF_TRIBAL_AREA_IN_TRACT_DISPLAY,
field_names.IS_TRIBAL_DAC,
]

self.OVERALL_TRIBAL_COUNT = "OVERALL_TRIBAL_COUNT"
Expand All @@ -72,16 +74,17 @@ def _create_string_from_list(series: pd.Series) -> str:
str_list = sorted(str_list)
return ", ".join(str_list)

@staticmethod
@classmethod
def _adjust_percentage_for_frontend(
cls,
percentage_float: float,
) -> Optional[float]:
"""Round numbers very close to 0 to 0 and very close to 1 to 1 for display"""
if percentage_float is None:
return None
if percentage_float < 0.01:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have the same cutoff type here? like 0.005 in the vein of 0.995? There's a little bit of asymmetry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decisions were made at different times, so good catch. @BethMattern thoughts on changing the cutoff for "very close to 0%" smaller (.5% instead of 1%)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change this, it'll have both a backend and frontend change, and it's not obvious to me we should, so I've pulled it into its own ticket to explicitly make the decision: #2006

return 0.0
if percentage_float > 0.9995:
if percentage_float > cls.TRIBAL_OVERLAP_CUTOFF:
return 1.0

return percentage_float
Expand Down Expand Up @@ -246,6 +249,11 @@ def transform(self) -> None:
field_names.COUNT_OF_TRIBAL_AREAS_IN_TRACT_CONUS
] = None

merged_output_df[field_names.IS_TRIBAL_DAC] = (
merged_output_df[field_names.PERCENT_OF_TRIBAL_AREA_IN_TRACT]
> self.TRIBAL_OVERLAP_CUTOFF
)

# The very final thing we want to do is produce a string for the front end to show
# We do this here so that all of the logic is included
merged_output_df[
Expand Down
5 changes: 2 additions & 3 deletions data/data-pipeline/data_pipeline/score/field_names.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,6 @@
COUNTY_FIELD = "County Name"

# Definition Narwhal fields
FINAL_SCORE_N_BOOLEAN = (
"Definition M community, including adjacency index tracts"
)
SCORE_N_COMMUNITIES = "Definition N (communities)"
N_CLIMATE = "Climate Factor (Definition N)"
N_ENERGY = "Energy Factor (Definition N)"
Expand Down Expand Up @@ -368,6 +365,8 @@
PERCENT_OF_TRIBAL_AREA_IN_TRACT_DISPLAY = (
"Percent of the Census tract that is within Tribal areas, for display"
)
IS_TRIBAL_DAC = "Identified as disadvantaged due to tribal overlap"
PERCENT_OF_TRACT_IS_DAC = "Percentage of tract that is disadvantaged"

#####
# Names for individual factors being exceeded
Expand Down
32 changes: 32 additions & 0 deletions data/data-pipeline/data_pipeline/score/score_narwhal.py
Original file line number Diff line number Diff line change
Expand Up @@ -997,6 +997,33 @@ def _mark_donut_hole_tracts(self) -> pd.DataFrame:
]
)

def _mark_tribal_dacs(self) -> None:
"""Per the October 7th compromise (#1988),
tracts that are approx 100% tribal are Score N communities.
"""
self.df[field_names.SCORE_N_COMMUNITIES] = np.where(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might also be easier / make a bit more sense to just take the MAX of those two columns? and why overwrite SCORE_N_COMMUNITIES?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's funny, doing MAX there makes less sense to me, but that's probably a programmer vs data scientist thing.

I wanted to overwrite SCORE_N_COMMUNITIES because i wanted the tribes to be taken into account in calculating donut holes slightly later. BUT, if that's a logical mistake lemme know and I'll just be glad you caught it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no not a logical mistake at all! this makes sense to me! i am willy nilly with types

self.df[field_names.IS_TRIBAL_DAC],
True,
self.df[field_names.SCORE_N_COMMUNITIES],
)

def _get_percent_of_tract_that_is_dac(self) -> float:
"""Per the October 7th compromise (#1988),
tracts can be partially DACs if some portion of the tract is tribal land.

Rules are as follows:
If a tract is a SCORE_N_COMMUNITY, it is 100% a DAC
If a tract is not, but contains tribal land, the percent that is tribal land is a DAC.
"""
tribal_percent = self.df[
field_names.PERCENT_OF_TRIBAL_AREA_IN_TRACT
].fillna(0.0)
return np.where(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just...

return self.df[[field_names.PERCENT_OF_TRIBAL_AREA_IN_TRACT, field_names.FINAL_SCORE_N_BOOLEAN]].max(axis=1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above --- where is more obvious to me than max here, since in my mind a bool in not a number. But I do like that line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally valid!

self.df[field_names.FINAL_SCORE_N_BOOLEAN],
1.0,
tribal_percent,
)

def add_columns(self) -> pd.DataFrame:
logger.info("Adding Score Narhwal")
self.df[field_names.THRESHOLD_COUNT] = 0
Expand Down Expand Up @@ -1031,10 +1058,15 @@ def add_columns(self) -> pd.DataFrame:
]
self.df[field_names.CATEGORY_COUNT] = self.df[factors].sum(axis=1)
self.df[field_names.SCORE_N_COMMUNITIES] = self.df[factors].any(axis=1)
self._mark_tribal_dacs()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is nice and clean

self.df[
field_names.SCORE_N_COMMUNITIES
+ field_names.PERCENTILE_FIELD_SUFFIX
] = self.df[field_names.SCORE_N_COMMUNITIES].astype(int)

self._mark_donut_hole_tracts()
self.df[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a quick style comment -- why not modify df in place here as per the mark functions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a question NOT a suggestion to be clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think because I noticed a pattern of marking tracts by mutating in place (e.g., with the donut hole calculation) but getting scores and things like that and assigning them (e.g., self.df[field_names.N_CLIMATE] = self._climate_factor() etc) so I was following that in this code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense to me!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not gonna lie, it's a pretty weak signal, so it could go either way reasonably.

field_names.PERCENT_OF_TRACT_IS_DAC
] = self._get_percent_of_tract_that_is_dac()

return self.df