Refactor DOE Energy Burden and COI to use YAML (#1796)

* added tribalId for Supplemental dataset (#1804) * Setting zoom levels for tribal map (#1810) * NRI dataset and initial score YAML configuration (#1534) * update be staging gha * NRI dataset and initial score YAML configuration * checkpoint * adding data checks for release branch * passing tests * adding INPUT_EXTRACTED_FILE_NAME to base class * lint * columns to keep and tests * update be staging gha * checkpoint * update be staging gha * NRI dataset and initial score YAML configuration * checkpoint * adding data checks for release branch * passing tests * adding INPUT_EXTRACTED_FILE_NAME to base class * lint * columns to keep and tests * checkpoint * PR Review * renoving source url * tests * stop execution of ETL if there's a YAML schema issue * update be staging gha * adding source url as class var again * clean up * force cache bust * gha cache bust * dynamically set score vars from YAML * docsctrings * removing last updated year - optional reverse percentile * passing tests * sort order * column ordening * PR review * class level vars * Updating DatasetsConfig * fix pylint errors * moving metadata hint back to code Co-authored-by: lucasmbrown-usds <lucas.m.brown@omb.eop.gov> * Correct copy typo (#1809) * Add basic test suite for COI (#1518) * Update COI to use new yaml (#1518) * Add tests for DOE energy budren (1518 * Add dataset config for energy budren (1518) * Refactor ETL to use datasets.yml (#1518) * Add fake GEOIDs to COI tests (#1518) * Refactor _setup_etl_instance_and_run_extract to base (#1518) For the three classes we've done so far, a generic _setup_etl_instance_and_run_extract will work fine, for the moment we can reuse the same setup method until we decide future classes need more flexibility --- but they can also always subclass so... * Add output-path tests (#1518) * Update YAML to match constant (#1518) * Don't blindly set float format (#1518) * Add defaults for extract (#1518) * Run YAML load on all subclasses (#1518) * Update description fields (#1518) * Update YAML per final format (#1518) * Update fixture tract IDs (#1518) * Update base class refactor (#1518) Now that NRI is final I needed to make a small number of updates to my refactored code. * Remove old comment (#1518) * Fix type signature and return (#1518) * Update per code review (#1518) Co-authored-by: Jorge Escobar <83969469+esfoobar-usds@users.noreply.github.com> Co-authored-by: lucasmbrown-usds <lucas.m.brown@omb.eop.gov> Co-authored-by: Vim <86254807+vim-usds@users.noreply.github.com>
usds · Aug 10, 2022 · 9635ef5 · 9635ef5
1 parent ed9b717
commit 9635ef5
Show file tree

Hide file tree

Showing 44 changed files with 698 additions and 3,640 deletions.
diff --git a/.github/workflows/data-checks.yml b/.github/workflows/data-checks.yml
@@ -2,7 +2,9 @@
 name: Data Checks
 on:
   pull_request:
-    branches: [main] # runs on any PR against main
+    branches:
+      - main
+      - "**/release/**"
     paths:
       - "data/**"
 jobs:
@@ -16,7 +18,7 @@ jobs:
         # checks all of the versions allowed in pyproject.toml
         python-version: [3.8, 3.9]
     steps:
-      # installs python
+      # installs Python
       # one execution of the tests per version listed above
       - uses: actions/checkout@v2
       - name: Set up Python ${{ matrix.python-version }}

diff --git a/.github/workflows/deploy_be_staging.yml b/.github/workflows/deploy_be_staging.yml
@@ -62,7 +62,7 @@ jobs:
       - name: Update PR with deployed Score URLs
         uses: mshick/add-pr-comment@v1
         with:
-          # Deploy to S3 for the staging URL
+          # Deploy to S3 for the Staging URL
           message: |
             ** Score Deployed! **   
             Find it here: 

diff --git a/client/src/components/AreaDetail/tests/__snapshots__/areaDetail.test.tsx.snap b/client/src/components/AreaDetail/tests/__snapshots__/areaDetail.test.tsx.snap
@@ -519,7 +519,7 @@ exports[`rendering of the AreaDetail checks if indicators for NATION is present
             <div>
               Expected building loss rate
               <div>
-                Economic loss rate to agricultural value resulting from natural hazards each year
+                Economic loss rate to building value resulting from natural hazards each year
               </div>
             </div>
             <div>

diff --git a/client/src/data/copy/explore.tsx b/client/src/data/copy/explore.tsx
@@ -590,7 +590,7 @@ export const SIDE_PANEL_INDICATOR_DESCRIPTION = defineMessages({
   },
   EXP_BLD_LOSS: {
     id: 'explore.map.page.side.panel.indicator.description.exp.bld.loss',
-    defaultMessage: 'Economic loss rate to agricultural value resulting from natural hazards each year',
+    defaultMessage: 'Economic loss rate to building value resulting from natural hazards each year',
     description: `Navigate to the explore the map page. When the map is in view, click on the map. The side 
     panel will show an indicator description of Economic loss rate to buildings resulting from natural hazards`,
   },

diff --git a/client/src/intl/en.json b/client/src/intl/en.json
@@ -496,7 +496,7 @@
     "description": "Navigate to the explore the map page. When the map is in view, click on the map. The side panel will show an indicator description of Economic loss rate to agriculture resulting from natural hazards\n    "
   },
   "explore.map.page.side.panel.indicator.description.exp.bld.loss": {
-    "defaultMessage": "Economic loss rate to agricultural value resulting from natural hazards each year",
+    "defaultMessage": "Economic loss rate to building value resulting from natural hazards each year",
     "description": "Navigate to the explore the map page. When the map is in view, click on the map. The side \n    panel will show an indicator description of Economic loss rate to buildings resulting from natural hazards"
   },
   "explore.map.page.side.panel.indicator.description.exp.pop.loss": {

diff --git a/client/src/intl/es.json b/client/src/intl/es.json
@@ -125,7 +125,7 @@
     "explore.map.page.side.panel.indicator.description.dieselPartMatter": "Descarga de gas de motor diésel en el aire",
     "explore.map.page.side.panel.indicator.description.energyBurden": "Costo promedio anual de la energía dividido por el ingreso familiar",
     "explore.map.page.side.panel.indicator.description.exp.ag.loss": "Tasa de pérdida económica en relación con el valor agrícola resultante de peligros naturales cada año",
-    "explore.map.page.side.panel.indicator.description.exp.bld.loss": "Tasa de pérdida económica en relación con el valor agrícola resultante de peligros naturales cada año",
+    "explore.map.page.side.panel.indicator.description.exp.bld.loss": "Tasa de pérdida económica en relación con el valor construcción resultante de peligros naturales cada año",
     "explore.map.page.side.panel.indicator.description.exp.pop.loss": "Tasa de muertes y lesiones resultantes de peligros naturales cada año",
     "explore.map.page.side.panel.indicator.description.heartDisease": "Personas con 18 años cumplidos a quienes se les ha diagnosticado una cardiopatía",
     "explore.map.page.side.panel.indicator.description.high.ed": "Porcentaje de la población con 15 años cumplidos del grupo de bloques del censo que no está inscrita en la universidad, escuela superior o escuela de posgrado",

diff --git a/data/data-pipeline/data_pipeline/etl/base.py b/data/data-pipeline/data_pipeline/etl/base.py
@@ -1,12 +1,15 @@
 import enum
 import pathlib
+import sys
 import typing
 from typing import Optional
 
 import pandas as pd
 
 from data_pipeline.config import settings
+from data_pipeline.etl.score.schemas.datasets import DatasetsConfig
 from data_pipeline.utils import (
+    load_yaml_dict_from_file,
     unzip_file_from_url,
     remove_all_from_dir,
     get_module_logger,
@@ -30,6 +33,9 @@ class ExtractTransformLoad:
     Attributes:
         DATA_PATH (pathlib.Path): Local path where all data will be stored
         TMP_PATH (pathlib.Path): Local path where temporary data will be stored
+
+        TODO: Fill missing attrs here
+
         GEOID_FIELD_NAME (str): The common column name for a Census Block Group identifier
         GEOID_TRACT_FIELD_NAME (str): The common column name for a Census Tract identifier
     """
@@ -40,6 +46,8 @@ class ExtractTransformLoad:
     DATA_PATH: pathlib.Path = APP_ROOT / "data"
     TMP_PATH: pathlib.Path = DATA_PATH / "tmp"
     CONTENT_CONFIG: pathlib.Path = APP_ROOT / "content" / "config"
+    DATASET_CONFIG_PATH: pathlib.Path = APP_ROOT / "etl" / "score" / "config"
+    DATASET_CONFIG: Optional[dict] = None
 
     # Parameters
     GEOID_FIELD_NAME: str = "GEOID10"
@@ -55,6 +63,9 @@ class ExtractTransformLoad:
     # SOURCE_URL is used to extract source data in extract().
     SOURCE_URL: str = None
 
+    # INPUT_EXTRACTED_FILE_NAME is the name of the file after extract().
+    INPUT_EXTRACTED_FILE_NAME: str = None
+
     # GEO_LEVEL is used to identify whether output data is at the unit of the tract or
     # census block group.
     # TODO: add tests that enforce seeing the expected geographic identifier field
@@ -64,6 +75,13 @@ class ExtractTransformLoad:
     # COLUMNS_TO_KEEP is used to identify which columns to keep in the output df.
     COLUMNS_TO_KEEP: typing.List[str] = None
 
+    # INPUT_GEOID_TRACT_FIELD_NAME is the field name that identifies the Census Tract ID
+    # on the input file
+    INPUT_GEOID_TRACT_FIELD_NAME: str = None
+
+    # NULL_REPRESENTATION is how nulls are represented on the input field
+    NULL_REPRESENTATION: str = None
+
     # Thirteen digits in a census block group ID.
     EXPECTED_CENSUS_BLOCK_GROUPS_CHARACTER_LENGTH: int = 13
     # TODO: investigate. Census says there are only 217,740 CBGs in the US. This might
@@ -77,8 +95,56 @@ class ExtractTransformLoad:
     #  periods. https://github.com/usds/justice40-tool/issues/964
     EXPECTED_MAX_CENSUS_TRACTS: int = 74160
 
+    # We use output_df as the final dataframe to use to write to the CSV
+    # It is used on the "load" base class method
     output_df: pd.DataFrame = None
 
+    def __init_subclass__(cls) -> None:
+        cls.DATASET_CONFIG = cls.yaml_config_load()
+
+    @classmethod
+    def yaml_config_load(cls) -> Optional[dict]:
+        """Generate config dictionary and set instance variables from YAML dataset."""
+        if cls.NAME is not None:
+            # check if the class instance has score YAML definitions
+            datasets_config = load_yaml_dict_from_file(
+                cls.DATASET_CONFIG_PATH / "datasets.yml",
+                DatasetsConfig,
+            )
+
+            # get the config for this dataset
+            try:
+                dataset_config = next(
+                    item
+                    for item in datasets_config.get("datasets")
+                    if item["module_name"] == cls.NAME
+                )
+            except StopIteration:
+                # Note: it'd be nice to log the name of the dataframe, but that's not accessible in this scope.
+                logger.error(
+                    f"Exception encountered while extracting dataset config for dataset {cls.NAME}"
+                )
+                sys.exit()
+
+            # set some of the basic fields
+            cls.INPUT_GEOID_TRACT_FIELD_NAME = dataset_config[
+                "input_geoid_tract_field_name"
+            ]
+
+            # get the columns to write on the CSV
+            # and set the constants
+            cls.COLUMNS_TO_KEEP = [
+                cls.GEOID_TRACT_FIELD_NAME,  # always index with geoid tract id
+            ]
+            for field in dataset_config["load_fields"]:
+                cls.COLUMNS_TO_KEEP.append(field["long_name"])
+                setattr(cls, field["df_field_name"], field["long_name"])
+
+                # set the constants for the class
+                setattr(cls, field["df_field_name"], field["long_name"])
+            return dataset_config
+        return None
+
     # This is a classmethod so it can be used by `get_data_frame` without
     # needing to create an instance of the class. This is a use case in `etl_score`.
     @classmethod
@@ -87,16 +153,10 @@ def _get_output_file_path(cls) -> pathlib.Path:
         if cls.NAME is None:
             raise NotImplementedError(
                 f"Child ETL class needs to specify `cls.NAME` (currently "
-                f"{cls.NAME}) and `cls.LAST_UPDATED_YEAR` (currently "
-                f"{cls.LAST_UPDATED_YEAR})."
+                f"{cls.NAME})."
             )
 
-        output_file_path = (
-            cls.DATA_PATH
-            / "dataset"
-            / f"{cls.NAME}_{cls.LAST_UPDATED_YEAR}"
-            / "usa.csv"
-        )
+        output_file_path = cls.DATA_PATH / "dataset" / f"{cls.NAME}" / "usa.csv"
         return output_file_path
 
     def get_tmp_path(self) -> pathlib.Path:
@@ -120,14 +180,18 @@ def extract(
         to get the file from a source url, unzips it and stores it on an
         extract_path."""
 
-        # this can be accessed via super().extract()
-        if source_url and extract_path:
-            unzip_file_from_url(
-                file_url=source_url,
-                download_path=self.get_tmp_path(),
-                unzipped_file_path=extract_path,
-                verify=verify,
-            )
+        if source_url is None:
+            source_url = self.SOURCE_URL
+
+        if extract_path is None:
+            extract_path = self.get_tmp_path()
+
+        unzip_file_from_url(
+            file_url=source_url,
+            download_path=self.get_tmp_path(),
+            unzipped_file_path=extract_path,
+            verify=verify,
+        )
 
     def transform(self) -> None:
         """Transform the data extracted into a format that can be consumed by the
@@ -229,8 +293,7 @@ def load(self, float_format=None) -> None:
 
         Data is written in the specified local data folder or remote AWS S3 bucket.
 
-        Uses the directory from `self.OUTPUT_DIR` and the file name from
-        `self._get_output_file_path`.
+        Uses the directory and the file name from `self._get_output_file_path`.
         """
         logger.info(f"Saving `{self.NAME}` CSV")
 

diff --git a/data/data-pipeline/data_pipeline/etl/score/config/__init__.py b/data/data-pipeline/data_pipeline/etl/score/config/__init__.py
diff --git a/data/data-pipeline/data_pipeline/etl/score/config/datasets.yml b/data/data-pipeline/data_pipeline/etl/score/config/datasets.yml
@@ -0,0 +1,131 @@
+---
+datasets:
+  - long_name: "FEMA National Risk Index"
+    short_name: "nri"
+    module_name: national_risk_index
+    input_geoid_tract_field_name: "TRACTFIPS"
+    load_fields:
+      - short_name: "ex_loss"
+        df_field_name: "RISK_INDEX_EXPECTED_ANNUAL_LOSS_SCORE_FIELD_NAME"
+        long_name: "FEMA Risk Index Expected Annual Loss Score"
+        field_type: float
+        number_of_decimals_in_output: 6
+
+      - short_name: "ex_pop_loss"
+        df_field_name: "EXPECTED_POPULATION_LOSS_RATE_FIELD_NAME"
+        long_name: "Expected population loss rate (Natural Hazards Risk Index)"
+        description_short:
+          "Rate of fatalities and injuries resulting from natural hazards each year"
+        description_long:
+          "Rate relative to the population of fatalities and injuries due to fourteen
+          types of natural hazards each year that have some link to climate change:
+          avalanche, coastal flooding, cold wave, drought, hail, heat wave, hurricane,
+          ice storm, landslide, riverine flooding, strong wind, tornado, wildfire, and
+          winter weather. Population loss is defined as the Spatial Hazard Events and
+          Losses and National Centers for Environmental Information’s (NCEI) reported
+          number of fatalities and injuries caused by the hazard occurrence. To combine
+          fatalities and injuries for the computation of population loss value, an
+          injury is counted as one-tenth (1/10) of a fatality. The NCEI Storm Events
+          Database classifies injuries and fatalities as direct or indirect. Both direct
+          and indirect injuries and fatalities are counted as population loss. This
+          total number of injuries and fatalities is then divided by the population in
+          the census tract to get a per-capita rate of population risk."
+        field_type: float
+        number_of_decimals_in_output: 6
+        include_in_tiles: true
+        include_in_downloadable_files: true
+        create_percentile: true
+
+      - short_name: "ex_ag_loss"
+        df_field_name: "EXPECTED_AGRICULTURE_LOSS_RATE_FIELD_NAME"
+        long_name: "Expected agricultural loss rate (Natural Hazards Risk Index)"
+        description_short:
+          "Economic loss rate to agricultural value resulting from natural hazards each
+          year"
+        description_long:
+          "Percent of agricultural value at risk from losses due to fourteen types of
+          natural hazards that have some link to climate change: avalanche, coastal
+          flooding, cold wave, drought, hail, heat wave, hurricane, ice storm,
+          landslide, riverine flooding, strong wind, tornado, wildfire, and winter
+          weather. Rate calculated by dividing the agricultural value at risk in a
+          census tract by the total agricultural value in that census tract."
+        field_type: float
+        number_of_decimals_in_output: 6
+        include_in_tiles: true
+        include_in_downloadable_files: true
+        create_percentile: true
+
+      - short_name: "ex_bldg_loss"
+        df_field_name: "EXPECTED_BUILDING_LOSS_RATE_FIELD_NAME"
+        long_name: "Expected building loss rate (Natural Hazards Risk Index)"
+        description_short:
+          "Economic loss rate to building value resulting from natural hazards each year"
+        description_long:
+          "Percent of building value at risk from losses due to fourteen types of
+          natural hazards that have some link to climate change: avalanche, coastal
+          flooding, cold wave, drought, hail, heat wave, hurricane, ice storm,
+          landslide, riverine flooding, strong wind, tornado, wildfire, and winter
+          weather. Rate calculated by dividing the building value at risk in a census
+          tract by the total building value in that census tract."
+        field_type: float
+        number_of_decimals_in_output: 6
+        include_in_tiles: true
+        include_in_downloadable_files: true
+        create_percentile: true
+
+      - short_name: "has_ag_val"
+        df_field_name: "CONTAINS_AGRIVALUE"
+        long_name: "Contains agricultural value"
+        field_type: bool
+  - long_name: "Child Opportunity Index 2.0 database"
+    short_name: "coi"
+    module_name: "child_opportunity_index"
+    input_geoid_tract_field_name: "geoid"
+    load_fields:
+      - short_name: "he_heat"
+        df_field_name: "EXTREME_HEAT_FIELD"
+        long_name: "Summer days above 90F" 
+        field_type: float
+        include_in_downloadable_files: true
+        include_in_tiles: true
+      - short_name: "he_food"
+        long_name: "Percent low access to healthy food"
+        df_field_name: "HEALTHY_FOOD_FIELD"
+        field_type: float
+        include_in_downloadable_files: true
+        include_in_tiles: true
+      - short_name: "he_green"
+        long_name: "Percent impenetrable surface areas" 
+        df_field_name: "IMPENETRABLE_SURFACES_FIELD"
+        field_type: float
+        include_in_downloadable_files: true
+        include_in_tiles: true
+      - short_name: "ed_reading"
+        df_field_name: "READING_FIELD"
+        long_name: "Third grade reading proficiency"
+        field_type: float
+        include_in_downloadable_files: true
+        include_in_tiles: true
+  - long_name: "Low-Income Energy Affordabililty Data"
+    short_name: "LEAD"
+    module_name: "doe_energy_burden"
+    input_geoid_tract_field_name: "FIP"
+    load_fields:
+      - short_name: "EBP_PFS"
+        df_field_name: "REVISED_ENERGY_BURDEN_FIELD_NAME"
+        long_name: "Energy burden" 
+        field_type: float
+        include_in_downloadable_files: true
+        include_in_tiles: true
+  - long_name: "Example ETL"
+    short_name: "Example"
+    module_name: "example_dataset"
+    input_geoid_tract_field_name: "GEOID10_TRACT"
+    load_fields:
+      - short_name: "EXAMPLE_FIELD"
+        df_field_name: "Input Field 1"
+        long_name: "Example Field 1" 
+        field_type: float
+        include_in_tiles: true
+        include_in_downloadable_files: true
+
diff --git a/data/data-pipeline/data_pipeline/etl/score/etl_score.py b/data/data-pipeline/data_pipeline/etl/score/etl_score.py
@@ -442,6 +442,7 @@ def _prepare_initial_df(self) -> pd.DataFrame:
             # for instance, 3rd grade reading level : Low 3rd grade reading level.
             # This low field will not exist yet, it is only calculated for the
             # percentile.
+            # TODO: This will come from the YAML dataset config
             ReversePercentile(
                 field_name=field_names.READING_FIELD,
                 low_field_name=field_names.LOW_READING_FIELD,

diff --git a/data/data-pipeline/data_pipeline/etl/score/schemas/__init__.py b/data/data-pipeline/data_pipeline/etl/score/schemas/__init__.py
-Original file line number
+Diff line change
@@ Expand Up @@
                 <div>
                   Expected building loss rate
                   <div>
-                    Economic loss rate to agricultural value resulting from natural hazards each year
+                    Economic loss rate to building value resulting from natural hazards each year
                   </div>
                 </div>
                 <div>
@@ Expand Down @@