Fix build error for `epacamd_eia_test` #1940

aesharpe · 2022-09-19T21:50:47Z

FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-plants_eia860-185071-185071-185071]
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-pu_eia860-184260-184260-184260]
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-utils_eia860-119158-119158-119158]
FAILED test/validate/epacamd_eia_test.py::test_unique_ids[eia_raw] - Attribut...
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-plants_eia860-185071-185071-185071]
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-pu_eia860-184260-184260-184260]
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-utils_eia860-119158-119158-119158]
FAILED test/validate/epacamd_eia_test.py::test_unique_ids[eia_annual] - Attri...
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-plants_eia860-185071-185071-185071]
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-pu_eia860-184260-184260-184260]
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-utils_eia860-119158-119158-119158]
FAILED test/validate/epacamd_eia_test.py::test_unique_ids[eia_monthly] - Attr...
= 12 failed, 469 passed, 120 skipped, 15 xfailed, 2 xpassed, 29 warnings in 9422.54s (2:37:02) =

codecov · 2022-09-19T22:28:28Z

Codecov Report

Base: 82.7% // Head: 83.8% // Increases project coverage by +1.0% 🎉

Coverage data is based on head (8dd15ed) compared to base (29a651a).
Patch has no changes to coverable lines.

Additional details and impacted files

@@           Coverage Diff           @@
##             dev   #1940     +/-   ##
=======================================
+ Coverage   82.7%   83.8%   +1.0%     
=======================================
  Files         65      65             
  Lines       7398    8096    +698     
=======================================
+ Hits        6123    6785    +662     
- Misses      1275    1311     +36

Impacted Files	Coverage Δ
src/pudl/transform/ferc714.py	`87.2% <0.0%> (-1.1%)`	⬇️
src/pudl/analysis/plant_parts_eia.py	`96.3% <0.0%> (-0.3%)`	⬇️
src/pudl/metadata/enums.py	`100.0% <0.0%> (ø)`
src/pudl/metadata/fields.py	`100.0% <0.0%> (ø)`
src/pudl/metadata/resources/eia.py	`100.0% <0.0%> (ø)`
src/pudl/output/ferc714.py	`93.5% <0.0%> (+0.6%)`	⬆️
src/pudl/helpers.py	`89.1% <0.0%> (+1.4%)`	⬆️
src/pudl/metadata/classes.py	`85.5% <0.0%> (+3.3%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

zaneselvans · 2022-09-20T03:18:44Z

Re: the FK errors. I think if the harvesting sees (plant_id_eia, boiler_id) in the columns, then it will try and harvest the values it sees in those columns for inclusion in the boiler entity table, and it will also assume that there should be an FK relationship between those columns and the boiler table, which is correct. I'd think it would ignore NA values in both of those cases, but maybe it doesn't? It seems like this would have come up in other tables too.

If necessary you can tell it to ignore some of the automatically generated FK relationships in the resource (table) definition.

aesharpe · 2022-09-20T23:12:01Z

Fixed the validation test for test/validate/epacamd_eia_test.py

aesharpe · 2022-09-20T23:18:17Z

Now the issue is the min-max row tests.

I looked at the annual plants_eia860 table which was one of the ones that failed. It appears that the table resulting from the failed build has two less rows than the table resulting from the successful build (Sept 15th).

The rows that show up in the working build table (but not in the failed build table) are:

2022-01-01 | 62844 | Cranberry Point Energy Storage
2022-01-01 | 64835 | Silver Queen Wind Farm

I did some more digging and found out that beyond missing rows, there are 3974 rows that differ between the two tables. I looked at an example (2022-01-01 | 195) and see that the only difference is in the plant_name_eia field. The working build table says Narrows (AR) and the failed build table says Narrows.

Because the failed build actually has fewer rows than the working build, I don't think this is a harvesting issue. I'm also pretty sure that the harvesting happens before the glue tables are ever processed.

aesharpe · 2022-09-20T23:21:05Z

Re: the FK errors. I think if the harvesting sees (plant_id_eia, boiler_id) in the columns, then it will try and harvest the values it sees in those columns for inclusion in the boiler entity table, and it will also assume that there should be an FK relationship between those columns and the boiler table, which is correct. I'd think it would ignore NA values in both of those cases, but maybe it doesn't? It seems like this would have come up in other tables too.

If necessary you can tell it to ignore some of the automatically generated FK relationships in the resource (table) definition.

Foreign key problems were related to my local errors

aesharpe · 2022-09-20T23:30:38Z

The only thing done to dev between the 15th and the fail was some documentation updates and, and the re-addition of the epacamd_eia analysis module used by Greg. There is no output table attached to the analysis module, it's just some useful functions to restructure the crosswalk and calculate sub-plants.

zaneselvans · 2022-09-20T23:31:08Z

The ~8000 rows that are different maybe that could just be a small number of plants for which the most common plant name changed due to values that were harvested from the crosswalk?

Also plant_name_eia should be in the plants_entity_eia table, not the plants_eia860 table. What differences do you see between those two tables (not using the output functions)

aesharpe · 2022-09-20T23:35:45Z

The ~8000 rows that are different maybe that could just be a small number of plants for which the most common plant name changed due to values that were harvested from the crosswalk?

If I'm not mistaken, the harvesting process is outlined in the eia.py transform module which gets called in the _etl_eia() function in the etl.py module which happens before (and therefore seperate from) _etl_glue() which processes the crosswalk table.

from the etl() function in etl.py

    if datasets.get("ferc1", False):
        sqlite_dfs.update(_etl_ferc1(datasets["ferc1"], pudl_settings))
    if datasets.get("eia", False):
        sqlite_dfs.update(_etl_eia(datasets["eia"], ds_kwargs))
    if datasets.get("glue", False):
        sqlite_dfs.update(
            _etl_glue(datasets["glue"], ds_kwargs, sqlite_dfs, datasets["eia"])
        )

zaneselvans · 2022-09-20T23:36:04Z

The build outputs from Sept. 15th would have reflected anything that got merged into dev on September 14th (that was successful).

If the first failed build was the one that happened the night of Sept. 15/16th, that would have been from anything merged into dev on Sept. 15th. So what happened on the 15th?

aesharpe · 2022-09-20T23:37:03Z

If the first failed build was the one that happened the night of Sept. 15/16th, that would have been from anything merged into dev on Sept. 15th. So what happened on the 15th?

I thought the last successful build was from the 15th? Does that mean morning of the 15ths?

zaneselvans · 2022-09-20T23:40:10Z

The last successful build outputs have a date of September 15th. The builds start at 1am and typically run until 5-6am, Mountain time, or 12-1pm UTC, which is what the GCP dates are in:

 843649024  2022-09-15T12:27:02Z  gs://intake.catalyst.coop/dev/censusdp1tract.sqlite
 761257984  2022-09-15T12:27:04Z  gs://intake.catalyst.coop/dev/ferc1.sqlite
5110330869  2022-09-15T12:28:59Z  gs://intake.catalyst.coop/dev/hourly_emissions_epacems.parquet
 702459904  2022-09-15T12:27:01Z  gs://intake.catalyst.coop/dev/pudl.sqlite

aesharpe · 2022-09-20T23:45:01Z

Also plant_name_eia should be in the plants_entity_eia table, not the plants_eia860 table. What differences do you see between those two tables (not using the output functions)

Ok I tried the same thing but with the the plants_eia860 table from the db (not the output tables) and the tables are the same....

UPDATE: this was an error, I accidentally used the generators_eia860 table. Corrected below.

zaneselvans · 2022-09-20T23:49:16Z

Ok I tried the same thing but with the the plants_eia860 table from the db (not the output tables) and the tables are the same....

You said what was different was plant_name_eia and that field is in plants_entity_eia so that is the table to compare between the two DBs if you're looking for plant name discrepancies

This command (executed with dev checked out) will show all the commits that came into dev on 2022-09-15 -- it's just the crosswalk PR.

git log --first-parent --since 2022-09-14 --until 2022-09-15

commit f13904e4b43938d540066b37cfb3ff12e4c11579
Merge: 6413c0fa 80e560b4
Author: Austen Sharpe <49878195+aesharpe@users.noreply.github.com>
Date:   Thu Sep 15 11:00:03 2022 -0600

    Merge pull request #1692 from catalyst-cooperative/add-epacems-crosswalk-to-etl
    
    Add epacamd-eia crosswalk to etl

aesharpe · 2022-09-20T23:49:29Z

So what happened on the 15th?

I updated the screenshot above to show the commits from the 15th

aesharpe · 2022-09-20T23:50:45Z

--since 2022-09-14 --until 2022-09-15

Wouldn't that be the 14th?

zaneselvans · 2022-09-20T23:51:22Z

Nope. I just ran it.

aesharpe · 2022-09-20T23:57:38Z

You said what was different was plant_name_eia and that field is in plants_entity_eia so that is the table to compare between the two DBs if you're looking for plant name discrepancies

Those tables are also the same.

zaneselvans · 2022-09-21T00:01:49Z

Okay so if you compare pre & post failure pudl_out.plants_eia860() dataframes you find differences in plant_name_eia

But if you compare pre and post failure plants_entity_eia tables straight from the database, you find that they are identical?

aesharpe · 2022-09-21T00:02:36Z

I goofed a little, the entity tables are the same from both successful and failed builds. But the plants_eia860 tables from the db differ in length by two rows (the same rows as above). Excluding those rows, there are 8 more rows that are different between the failed and working builds for the db version of the plants_eia860 table. None of the plant_id_eia values from the rows that differ between builds are in the crosswalk.

aesharpe · 2022-09-21T00:11:39Z

Neither of the missing plant_id_eia rows 64835 , 62844 are in the crosswalk either.

zaneselvans · 2022-09-21T00:12:20Z

So the only thing that you're concerned about now is understanding why the addition of the epacamd_eia crosswalk table seems to have affected the contents of the plants_eia860 table?

aesharpe · 2022-09-21T00:13:16Z

So the only thing that you're concerned about now is understanding why the addition of the epacamd_eia crosswalk table seems to have affected the contents of the plants_eia860 table?

That was just one example, there were other EIA tables that were affected (failed)

zaneselvans · 2022-09-21T00:15:13Z

Okay so in general the concern is that adding the crosswalk table seems to have affected other EIA tables, and that seems weird / wrong?

zaneselvans · 2022-09-21T00:15:35Z

Can you diff all of the tables between these two databases you have and identify the differences?

zaneselvans · 2022-09-21T00:16:36Z

#1692 did modify more than 30 files so... it does seem like there could be some impacts in there. I will look over the changes.

aesharpe · 2022-09-21T00:26:19Z

Here's a record of the differences for the plants_eia860 table from the db:

plant_id_eia: 10012 | report_year: 2002 & 2001--> ferc_cogen_status goes from FALSE to None
plant_id_eia: 10012 | report_year: 2002 & 2001--> ferc_exempt_wholesale_generator goes from FALSE to None
plant_id_eia: 10012 | report_year: 2002 & 2001--> ferc_small_power_producer_docket_no goes from 86-501-000 to None
plant_id_eia: 10012 | report_year: 2002 & 2001--> ferc_small_power_producer goes from TRUE to None
plant_id_eia: 7549 | report_year: 2002 & 2001--> ferc_cogen_status goes from FALSE to None
plant_id_eia: 7549 | report_year: 2002 & 2001--> ferc_exempt_wholesale_generator goes from FALSE to None
plant_id_eia: 7549 | report_year: 2002 & 2001--> ferc_small_power_producer goes from FALSE to None

zaneselvans · 2022-09-21T00:47:39Z

Looking back over the PR, the only thing that seems like it might have effects all over the EIA data is the re-write of the leading-zero fixer function. It does look like the new function should be doing the same thing as the old function, but it's notoriously easy to mess up regular expressions.

What is the extent of the differences between the two builds? If it's just a couple of extra rows, or a half dozen individual fields for a couple of random plants 20 years ago, then that doesn't seem scary and we should probably just update the expected number of records.

If there are thousands of rows in multiple tables where the contents have changed as you said above, that seems more concerning. Is that really happening?

…ables

aesharpe · 2022-09-21T21:16:43Z

I adjusted the row count to account for the -2 rows in each of the pu_eia860, plants_eia860, and utils_eia860 tables. I still don't know why this is happening.

In case we want to revisit this later (and maybe add some regression analysis), I took stock of the types of columns that are changing in the plants_eia860 output table.

There are 2 fewer rows in the current table (than in the version before adding the epacamd_eia crosswalk from PR #1692). There are also 3974 rows (out of 185071 or 185069 rows--depending on whether you compare the old or new version) that differ between the two tables.

The following table shows, for the plants_eia860 output table, what column data has changed. It compares the 3974 rows that differ between the tables to see which columns (represented as the index here) are the culprits.

count shows the total number of rows
unique shows the column contains disperate data (1=no, 2=yes)
freq shows the number of times that column values from miss-matched rows are the same
pct shows the percent of times this column is responsible for the discrepancy between rows in the two versions of the output table

As you can see, the culprit columns are plant_name_eia, longitude, timezone, ferc_cogen_status, ferc_exempt_wholesale_generator, ferc_small_power_producer_docket_no, and ferc_small_power_producer

	count	unique	freq	pct
plant_name_eia	3974	2	2981	24.9874
city	3974	1	3974	0
county	3974	1	3974	0
latitude	3974	1	3974	0
longitude	3974	2	2975	25.1384
state	3974	1	3974	0
street_address	3974	1	3974	0
zip_code	3974	1	3974	0
timezone	3974	2	3879	2.39054
ash_impoundment	3974	1	3974	0
ash_impoundment_lined	3974	1	3974	0
ash_impoundment_status	3974	1	3974	0
balancing_authority_code_eia	3974	1	3974	0
balancing_authority_name_eia	3974	1	3974	0
datum	3974	1	3974	0
energy_storage	3974	1	3974	0
ferc_cogen_docket_no	3974	1	3974	0
ferc_cogen_status	3974	2	3970	0.100654
ferc_exempt_wholesale_generator_docket_no	3974	1	3974	0
ferc_exempt_wholesale_generator	3974	2	3970	0.100654
ferc_small_power_producer_docket_no	3974	2	3972	0.0503271
ferc_small_power_producer	3974	2	3970	0.100654
ferc_qualifying_facility_docket_no	3974	1	3974	0
grid_voltage_1_kv	3974	1	3974	0
grid_voltage_2_kv	3974	1	3974	0
grid_voltage_3_kv	3974	1	3974	0
iso_rto_code	3974	1	3974	0
liquefied_natural_gas_storage	3974	1	3974	0
natural_gas_local_distribution_company	3974	1	3974	0
natural_gas_storage	3974	1	3974	0
natural_gas_pipeline_name_1	3974	1	3974	0
natural_gas_pipeline_name_2	3974	1	3974	0
natural_gas_pipeline_name_3	3974	1	3974	0
nerc_region	3974	1	3974	0
net_metering	3974	1	3974	0
pipeline_notes	3974	1	3974	0
primary_purpose_id_naics	3974	1	3974	0
regulatory_status_code	3974	1	3974	0
reporting_frequency_code	3974	1	3974	0
sector_id_eia	3974	1	3974	0
sector_name_eia	3974	1	3974	0
service_area	3974	1	3974	0
transmission_distribution_owner_id	3974	1	3974	0
transmission_distribution_owner_name	3974	1	3974	0
transmission_distribution_owner_state	3974	1	3974	0
utility_id_eia	3974	1	3974	0
water_source	3974	1	3974	0
data_maturity	3974	1	3974	0
plant_id_pudl	3974	1	3974	0
utility_name_eia	3974	1	3974	0
utility_id_pudl	3974	1	3974	0
balancing_authority_code_eia_consistent_rate	3974	1	3974	0

It used to be a unique id test for plant_id_eia and emissions_unit_id_epa, but the crosswalk has duplicates of those ids due to the m:m nature of the data. Instead of an id test I implemented a min max row test.

Fix validation test for epacamd_eia_test

8e97b0b

aesharpe added the rmi label Sep 19, 2022

aesharpe requested a review from zaneselvans September 19, 2022 21:50

aesharpe self-assigned this Sep 19, 2022

zaneselvans approved these changes Sep 20, 2022

View reviewed changes

zaneselvans added the epacems Integration and analysis of the EPA CEMS dataset. label Sep 20, 2022

aesharpe changed the title ~~Fix validation test for epacamd_eia_test~~ Fix build error for epacamd_eia_test Sep 20, 2022

aesharpe closed this Sep 21, 2022

aesharpe reopened this Sep 21, 2022

Adjust row count to reflect current length of pu, utils, and plants t…

751e5dc

…ables

aesharpe marked this pull request as ready for review September 21, 2022 21:30

Change epacamd_eia validation test

8dd15ed

It used to be a unique id test for plant_id_eia and emissions_unit_id_epa, but the crosswalk has duplicates of those ids due to the m:m nature of the data. Instead of an id test I implemented a min max row test.

aesharpe merged commit e46494e into dev Sep 22, 2022

aesharpe deleted the fix-crosswalk-induced-build-fails branch September 22, 2022 04:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix build error for `epacamd_eia_test` #1940

Fix build error for `epacamd_eia_test` #1940

aesharpe commented Sep 19, 2022 •

edited

Loading

codecov bot commented Sep 19, 2022 •

edited

Loading

zaneselvans commented Sep 20, 2022

aesharpe commented Sep 20, 2022

aesharpe commented Sep 20, 2022 •

edited

Loading

aesharpe commented Sep 20, 2022

aesharpe commented Sep 20, 2022 •

edited

Loading

zaneselvans commented Sep 20, 2022

aesharpe commented Sep 20, 2022

zaneselvans commented Sep 20, 2022

aesharpe commented Sep 20, 2022

zaneselvans commented Sep 20, 2022

aesharpe commented Sep 20, 2022 •

edited

Loading

zaneselvans commented Sep 20, 2022 •

edited

Loading

aesharpe commented Sep 20, 2022

aesharpe commented Sep 20, 2022

zaneselvans commented Sep 20, 2022

aesharpe commented Sep 20, 2022

zaneselvans commented Sep 21, 2022

aesharpe commented Sep 21, 2022 •

edited

Loading

aesharpe commented Sep 21, 2022 •

edited

Loading

zaneselvans commented Sep 21, 2022

aesharpe commented Sep 21, 2022

zaneselvans commented Sep 21, 2022

zaneselvans commented Sep 21, 2022

zaneselvans commented Sep 21, 2022

aesharpe commented Sep 21, 2022 •

edited

Loading

zaneselvans commented Sep 21, 2022

aesharpe commented Sep 21, 2022 •

edited

Loading

Fix build error for epacamd_eia_test #1940

Fix build error for epacamd_eia_test #1940

Conversation

aesharpe commented Sep 19, 2022 • edited Loading

codecov bot commented Sep 19, 2022 • edited Loading

Codecov Report

zaneselvans commented Sep 20, 2022

aesharpe commented Sep 20, 2022

aesharpe commented Sep 20, 2022 • edited Loading

aesharpe commented Sep 20, 2022

aesharpe commented Sep 20, 2022 • edited Loading

zaneselvans commented Sep 20, 2022

aesharpe commented Sep 20, 2022

zaneselvans commented Sep 20, 2022

aesharpe commented Sep 20, 2022

zaneselvans commented Sep 20, 2022

aesharpe commented Sep 20, 2022 • edited Loading

zaneselvans commented Sep 20, 2022 • edited Loading

aesharpe commented Sep 20, 2022

aesharpe commented Sep 20, 2022

zaneselvans commented Sep 20, 2022

aesharpe commented Sep 20, 2022

zaneselvans commented Sep 21, 2022

aesharpe commented Sep 21, 2022 • edited Loading

aesharpe commented Sep 21, 2022 • edited Loading

zaneselvans commented Sep 21, 2022

aesharpe commented Sep 21, 2022

zaneselvans commented Sep 21, 2022

zaneselvans commented Sep 21, 2022

zaneselvans commented Sep 21, 2022

aesharpe commented Sep 21, 2022 • edited Loading

zaneselvans commented Sep 21, 2022

aesharpe commented Sep 21, 2022 • edited Loading

Fix build error for `epacamd_eia_test` #1940

Fix build error for `epacamd_eia_test` #1940

aesharpe commented Sep 19, 2022 •

edited

Loading

codecov bot commented Sep 19, 2022 •

edited

Loading

aesharpe commented Sep 20, 2022 •

edited

Loading

aesharpe commented Sep 20, 2022 •

edited

Loading

aesharpe commented Sep 20, 2022 •

edited

Loading

zaneselvans commented Sep 20, 2022 •

edited

Loading

aesharpe commented Sep 21, 2022 •

edited

Loading

aesharpe commented Sep 21, 2022 •

edited

Loading

aesharpe commented Sep 21, 2022 •

edited

Loading

aesharpe commented Sep 21, 2022 •

edited

Loading