Skip to content

Commit

Permalink
Recast Total threshold criteria exceeded to int (#1848)
Browse files Browse the repository at this point in the history
In writing tests to verify the output of the tiles csv matches the final
score CSV, I noticed TC/Total threshold criteria exceeded was getting
cast from an int64 to a float64 in the process of PostScoreETL. I
tracked it down to the line where we merge the score dataframe with
constants.DATA_CENSUS_CSV_FILE_PATH --- there where > 100 tracts in the
national census CSV that don't exist in the score, so those ended up
with a Total threshhold count of np.nan, which is a float, and thereby
cast those columns to float. For the moment I just cast it back.
  • Loading branch information
mattbowen-usds committed Aug 29, 2022
1 parent 864fddf commit 6eee8e3
Showing 1 changed file with 8 additions and 1 deletion.
9 changes: 8 additions & 1 deletion data/data-pipeline/data_pipeline/etl/score/etl_score_post.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,9 @@ def _extract_states(self, state_path: Path) -> pd.DataFrame:
def _extract_score(self, score_path: Path) -> pd.DataFrame:
logger.info("Reading Score CSV")
df = pd.read_csv(
score_path, dtype={self.GEOID_TRACT_FIELD_NAME: "string"}
score_path,
dtype={self.GEOID_TRACT_FIELD_NAME: "string"},
low_memory=False,
)

# Convert total population to an int
Expand Down Expand Up @@ -237,6 +239,11 @@ def _create_score_data(
subset=[DISADVANTAGED_COMMUNITIES_FIELD]
)

# recast threshold count to integer
de_duplicated_df[field_names.THRESHOLD_COUNT] = de_duplicated_df[
field_names.THRESHOLD_COUNT
].astype(int)

# set the score to the new df
return de_duplicated_df

Expand Down

0 comments on commit 6eee8e3

Please sign in to comment.