Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix build error for epacamd_eia_test #1940

Merged
merged 3 commits into from
Sep 22, 2022
Merged

Conversation

aesharpe
Copy link
Member

@aesharpe aesharpe commented Sep 19, 2022

FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-plants_eia860-185071-185071-185071]
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-pu_eia860-184260-184260-184260]
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-utils_eia860-119158-119158-119158]
FAILED test/validate/epacamd_eia_test.py::test_unique_ids[eia_raw] - Attribut...
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-plants_eia860-185071-185071-185071]
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-pu_eia860-184260-184260-184260]
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-utils_eia860-119158-119158-119158]
FAILED test/validate/epacamd_eia_test.py::test_unique_ids[eia_annual] - Attri...
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-plants_eia860-185071-185071-185071]
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-pu_eia860-184260-184260-184260]
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-utils_eia860-119158-119158-119158]
FAILED test/validate/epacamd_eia_test.py::test_unique_ids[eia_monthly] - Attr...
= 12 failed, 469 passed, 120 skipped, 15 xfailed, 2 xpassed, 29 warnings in 9422.54s (2:37:02) =

@aesharpe aesharpe added the rmi label Sep 19, 2022
@aesharpe aesharpe self-assigned this Sep 19, 2022
@codecov
Copy link

codecov bot commented Sep 19, 2022

Codecov Report

Base: 82.7% // Head: 83.8% // Increases project coverage by +1.0% 🎉

Coverage data is based on head (8dd15ed) compared to base (29a651a).
Patch has no changes to coverable lines.

Additional details and impacted files
@@           Coverage Diff           @@
##             dev   #1940     +/-   ##
=======================================
+ Coverage   82.7%   83.8%   +1.0%     
=======================================
  Files         65      65             
  Lines       7398    8096    +698     
=======================================
+ Hits        6123    6785    +662     
- Misses      1275    1311     +36     
Impacted Files Coverage Δ
src/pudl/transform/ferc714.py 87.2% <0.0%> (-1.1%) ⬇️
src/pudl/analysis/plant_parts_eia.py 96.3% <0.0%> (-0.3%) ⬇️
src/pudl/metadata/enums.py 100.0% <0.0%> (ø)
src/pudl/metadata/fields.py 100.0% <0.0%> (ø)
src/pudl/metadata/resources/eia.py 100.0% <0.0%> (ø)
src/pudl/output/ferc714.py 93.5% <0.0%> (+0.6%) ⬆️
src/pudl/helpers.py 89.1% <0.0%> (+1.4%) ⬆️
src/pudl/metadata/classes.py 85.5% <0.0%> (+3.3%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@zaneselvans
Copy link
Member

Re: the FK errors. I think if the harvesting sees (plant_id_eia, boiler_id) in the columns, then it will try and harvest the values it sees in those columns for inclusion in the boiler entity table, and it will also assume that there should be an FK relationship between those columns and the boiler table, which is correct. I'd think it would ignore NA values in both of those cases, but maybe it doesn't? It seems like this would have come up in other tables too.

If necessary you can tell it to ignore some of the automatically generated FK relationships in the resource (table) definition.

@zaneselvans zaneselvans added the epacems Integration and analysis of the EPA CEMS dataset. label Sep 20, 2022
@aesharpe aesharpe changed the title Fix validation test for epacamd_eia_test Fix build error for epacamd_eia_test Sep 20, 2022
@aesharpe
Copy link
Member Author

Fixed the validation test for test/validate/epacamd_eia_test.py

@aesharpe
Copy link
Member Author

aesharpe commented Sep 20, 2022

Now the issue is the min-max row tests.

I looked at the annual plants_eia860 table which was one of the ones that failed. It appears that the table resulting from the failed build has two less rows than the table resulting from the successful build (Sept 15th).

The rows that show up in the working build table (but not in the failed build table) are:

  • 2022-01-01 | 62844 | Cranberry Point Energy Storage
  • 2022-01-01 | 64835 | Silver Queen Wind Farm

I did some more digging and found out that beyond missing rows, there are 3974 rows that differ between the two tables. I looked at an example (2022-01-01 | 195) and see that the only difference is in the plant_name_eia field. The working build table says Narrows (AR) and the failed build table says Narrows.

Because the failed build actually has fewer rows than the working build, I don't think this is a harvesting issue. I'm also pretty sure that the harvesting happens before the glue tables are ever processed.

@aesharpe
Copy link
Member Author

Re: the FK errors. I think if the harvesting sees (plant_id_eia, boiler_id) in the columns, then it will try and harvest the values it sees in those columns for inclusion in the boiler entity table, and it will also assume that there should be an FK relationship between those columns and the boiler table, which is correct. I'd think it would ignore NA values in both of those cases, but maybe it doesn't? It seems like this would have come up in other tables too.

If necessary you can tell it to ignore some of the automatically generated FK relationships in the resource (table) definition.

Foreign key problems were related to my local errors

@aesharpe
Copy link
Member Author

aesharpe commented Sep 20, 2022

The only thing done to dev between the 15th and the fail was some documentation updates and, and the re-addition of the epacamd_eia analysis module used by Greg. There is no output table attached to the analysis module, it's just some useful functions to restructure the crosswalk and calculate sub-plants.

Screen Shot 2022-09-20 at 5 47 09 PM

@zaneselvans
Copy link
Member

The ~8000 rows that are different maybe that could just be a small number of plants for which the most common plant name changed due to values that were harvested from the crosswalk?

Also plant_name_eia should be in the plants_entity_eia table, not the plants_eia860 table. What differences do you see between those two tables (not using the output functions)

@aesharpe
Copy link
Member Author

The ~8000 rows that are different maybe that could just be a small number of plants for which the most common plant name changed due to values that were harvested from the crosswalk?

If I'm not mistaken, the harvesting process is outlined in the eia.py transform module which gets called in the _etl_eia() function in the etl.py module which happens before (and therefore seperate from) _etl_glue() which processes the crosswalk table.

from the etl() function in etl.py

    if datasets.get("ferc1", False):
        sqlite_dfs.update(_etl_ferc1(datasets["ferc1"], pudl_settings))
    if datasets.get("eia", False):
        sqlite_dfs.update(_etl_eia(datasets["eia"], ds_kwargs))
    if datasets.get("glue", False):
        sqlite_dfs.update(
            _etl_glue(datasets["glue"], ds_kwargs, sqlite_dfs, datasets["eia"])
        )

@zaneselvans
Copy link
Member

The build outputs from Sept. 15th would have reflected anything that got merged into dev on September 14th (that was successful).

If the first failed build was the one that happened the night of Sept. 15/16th, that would have been from anything merged into dev on Sept. 15th. So what happened on the 15th?

@aesharpe
Copy link
Member Author

If the first failed build was the one that happened the night of Sept. 15/16th, that would have been from anything merged into dev on Sept. 15th. So what happened on the 15th?

I thought the last successful build was from the 15th? Does that mean morning of the 15ths?

@zaneselvans
Copy link
Member

The last successful build outputs have a date of September 15th. The builds start at 1am and typically run until 5-6am, Mountain time, or 12-1pm UTC, which is what the GCP dates are in:

 843649024  2022-09-15T12:27:02Z  gs://intake.catalyst.coop/dev/censusdp1tract.sqlite
 761257984  2022-09-15T12:27:04Z  gs://intake.catalyst.coop/dev/ferc1.sqlite
5110330869  2022-09-15T12:28:59Z  gs://intake.catalyst.coop/dev/hourly_emissions_epacems.parquet
 702459904  2022-09-15T12:27:01Z  gs://intake.catalyst.coop/dev/pudl.sqlite

@aesharpe
Copy link
Member Author

aesharpe commented Sep 20, 2022

Also plant_name_eia should be in the plants_entity_eia table, not the plants_eia860 table. What differences do you see between those two tables (not using the output functions)

Ok I tried the same thing but with the the plants_eia860 table from the db (not the output tables) and the tables are the same....

UPDATE: this was an error, I accidentally used the generators_eia860 table. Corrected below.

@zaneselvans
Copy link
Member

zaneselvans commented Sep 20, 2022

Ok I tried the same thing but with the the plants_eia860 table from the db (not the output tables) and the tables are the same....

You said what was different was plant_name_eia and that field is in plants_entity_eia so that is the table to compare between the two DBs if you're looking for plant name discrepancies

This command (executed with dev checked out) will show all the commits that came into dev on 2022-09-15 -- it's just the crosswalk PR.

git log --first-parent --since 2022-09-14 --until 2022-09-15

commit f13904e4b43938d540066b37cfb3ff12e4c11579
Merge: 6413c0fa 80e560b4
Author: Austen Sharpe <49878195+aesharpe@users.noreply.github.com>
Date:   Thu Sep 15 11:00:03 2022 -0600

    Merge pull request #1692 from catalyst-cooperative/add-epacems-crosswalk-to-etl
    
    Add epacamd-eia crosswalk to etl

@aesharpe
Copy link
Member Author

So what happened on the 15th?

I updated the screenshot above to show the commits from the 15th

@aesharpe
Copy link
Member Author

--since 2022-09-14 --until 2022-09-15

Wouldn't that be the 14th?

@zaneselvans
Copy link
Member

Nope. I just ran it.

@aesharpe
Copy link
Member Author

You said what was different was plant_name_eia and that field is in plants_entity_eia so that is the table to compare between the two DBs if you're looking for plant name discrepancies

Those tables are also the same.

@zaneselvans
Copy link
Member

Okay so if you compare pre & post failure pudl_out.plants_eia860() dataframes you find differences in plant_name_eia

But if you compare pre and post failure plants_entity_eia tables straight from the database, you find that they are identical?

@aesharpe
Copy link
Member Author

aesharpe commented Sep 21, 2022

I goofed a little, the entity tables are the same from both successful and failed builds. But the plants_eia860 tables from the db differ in length by two rows (the same rows as above). Excluding those rows, there are 8 more rows that are different between the failed and working builds for the db version of the plants_eia860 table. None of the plant_id_eia values from the rows that differ between builds are in the crosswalk.

@aesharpe aesharpe closed this Sep 21, 2022
@aesharpe aesharpe reopened this Sep 21, 2022
@aesharpe
Copy link
Member Author

aesharpe commented Sep 21, 2022

Neither of the missing plant_id_eia rows 64835 , 62844 are in the crosswalk either.

@zaneselvans
Copy link
Member

So the only thing that you're concerned about now is understanding why the addition of the epacamd_eia crosswalk table seems to have affected the contents of the plants_eia860 table?

@aesharpe
Copy link
Member Author

So the only thing that you're concerned about now is understanding why the addition of the epacamd_eia crosswalk table seems to have affected the contents of the plants_eia860 table?

That was just one example, there were other EIA tables that were affected (failed)

@zaneselvans
Copy link
Member

Okay so in general the concern is that adding the crosswalk table seems to have affected other EIA tables, and that seems weird / wrong?

@zaneselvans
Copy link
Member

Can you diff all of the tables between these two databases you have and identify the differences?

@zaneselvans
Copy link
Member

#1692 did modify more than 30 files so... it does seem like there could be some impacts in there. I will look over the changes.

@aesharpe
Copy link
Member Author

aesharpe commented Sep 21, 2022

Here's a record of the differences for the plants_eia860 table from the db:

  • plant_id_eia: 10012 | report_year: 2002 & 2001--> ferc_cogen_status goes from FALSE to None
  • plant_id_eia: 10012 | report_year: 2002 & 2001--> ferc_exempt_wholesale_generator goes from FALSE to None
  • plant_id_eia: 10012 | report_year: 2002 & 2001--> ferc_small_power_producer_docket_no goes from 86-501-000 to None
  • plant_id_eia: 10012 | report_year: 2002 & 2001--> ferc_small_power_producer goes from TRUE to None
  • plant_id_eia: 7549 | report_year: 2002 & 2001--> ferc_cogen_status goes from FALSE to None
  • plant_id_eia: 7549 | report_year: 2002 & 2001--> ferc_exempt_wholesale_generator goes from FALSE to None
  • plant_id_eia: 7549 | report_year: 2002 & 2001--> ferc_small_power_producer goes from FALSE to None

@zaneselvans
Copy link
Member

Looking back over the PR, the only thing that seems like it might have effects all over the EIA data is the re-write of the leading-zero fixer function. It does look like the new function should be doing the same thing as the old function, but it's notoriously easy to mess up regular expressions.

What is the extent of the differences between the two builds? If it's just a couple of extra rows, or a half dozen individual fields for a couple of random plants 20 years ago, then that doesn't seem scary and we should probably just update the expected number of records.

If there are thousands of rows in multiple tables where the contents have changed as you said above, that seems more concerning. Is that really happening?

@aesharpe
Copy link
Member Author

aesharpe commented Sep 21, 2022

I adjusted the row count to account for the -2 rows in each of the pu_eia860, plants_eia860, and utils_eia860 tables. I still don't know why this is happening.

In case we want to revisit this later (and maybe add some regression analysis), I took stock of the types of columns that are changing in the plants_eia860 output table.

There are 2 fewer rows in the current table (than in the version before adding the epacamd_eia crosswalk from PR #1692). There are also 3974 rows (out of 185071 or 185069 rows--depending on whether you compare the old or new version) that differ between the two tables.

The following table shows, for the plants_eia860 output table, what column data has changed. It compares the 3974 rows that differ between the tables to see which columns (represented as the index here) are the culprits.

  • count shows the total number of rows
  • unique shows the column contains disperate data (1=no, 2=yes)
  • freq shows the number of times that column values from miss-matched rows are the same
  • pct shows the percent of times this column is responsible for the discrepancy between rows in the two versions of the output table

As you can see, the culprit columns are plant_name_eia, longitude, timezone, ferc_cogen_status, ferc_exempt_wholesale_generator, ferc_small_power_producer_docket_no, and ferc_small_power_producer

count unique freq pct
plant_name_eia 3974 2 2981 24.9874
city 3974 1 3974 0
county 3974 1 3974 0
latitude 3974 1 3974 0
longitude 3974 2 2975 25.1384
state 3974 1 3974 0
street_address 3974 1 3974 0
zip_code 3974 1 3974 0
timezone 3974 2 3879 2.39054
ash_impoundment 3974 1 3974 0
ash_impoundment_lined 3974 1 3974 0
ash_impoundment_status 3974 1 3974 0
balancing_authority_code_eia 3974 1 3974 0
balancing_authority_name_eia 3974 1 3974 0
datum 3974 1 3974 0
energy_storage 3974 1 3974 0
ferc_cogen_docket_no 3974 1 3974 0
ferc_cogen_status 3974 2 3970 0.100654
ferc_exempt_wholesale_generator_docket_no 3974 1 3974 0
ferc_exempt_wholesale_generator 3974 2 3970 0.100654
ferc_small_power_producer_docket_no 3974 2 3972 0.0503271
ferc_small_power_producer 3974 2 3970 0.100654
ferc_qualifying_facility_docket_no 3974 1 3974 0
grid_voltage_1_kv 3974 1 3974 0
grid_voltage_2_kv 3974 1 3974 0
grid_voltage_3_kv 3974 1 3974 0
iso_rto_code 3974 1 3974 0
liquefied_natural_gas_storage 3974 1 3974 0
natural_gas_local_distribution_company 3974 1 3974 0
natural_gas_storage 3974 1 3974 0
natural_gas_pipeline_name_1 3974 1 3974 0
natural_gas_pipeline_name_2 3974 1 3974 0
natural_gas_pipeline_name_3 3974 1 3974 0
nerc_region 3974 1 3974 0
net_metering 3974 1 3974 0
pipeline_notes 3974 1 3974 0
primary_purpose_id_naics 3974 1 3974 0
regulatory_status_code 3974 1 3974 0
reporting_frequency_code 3974 1 3974 0
sector_id_eia 3974 1 3974 0
sector_name_eia 3974 1 3974 0
service_area 3974 1 3974 0
transmission_distribution_owner_id 3974 1 3974 0
transmission_distribution_owner_name 3974 1 3974 0
transmission_distribution_owner_state 3974 1 3974 0
utility_id_eia 3974 1 3974 0
water_source 3974 1 3974 0
data_maturity 3974 1 3974 0
plant_id_pudl 3974 1 3974 0
utility_name_eia 3974 1 3974 0
utility_id_pudl 3974 1 3974 0
balancing_authority_code_eia_consistent_rate 3974 1 3974 0

@aesharpe aesharpe marked this pull request as ready for review September 21, 2022 21:30
It used to be a unique id test for plant_id_eia and emissions_unit_id_epa, but the crosswalk has duplicates of those ids due to the m:m nature of the data. Instead of an id test I implemented a min max row test.
@aesharpe aesharpe merged commit e46494e into dev Sep 22, 2022
@aesharpe aesharpe deleted the fix-crosswalk-induced-build-fails branch September 22, 2022 04:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epacems Integration and analysis of the EPA CEMS dataset. rmi
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants