-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix build error for epacamd_eia_test
#1940
Conversation
aesharpe
commented
Sep 19, 2022
•
edited
Loading
edited
Codecov ReportBase: 82.7% // Head: 83.8% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## dev #1940 +/- ##
=======================================
+ Coverage 82.7% 83.8% +1.0%
=======================================
Files 65 65
Lines 7398 8096 +698
=======================================
+ Hits 6123 6785 +662
- Misses 1275 1311 +36
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Re: the FK errors. I think if the harvesting sees If necessary you can tell it to ignore some of the automatically generated FK relationships in the resource (table) definition. |
epacamd_eia_test
epacamd_eia_test
Fixed the validation test for |
Now the issue is the min-max row tests. I looked at the annual The rows that show up in the working build table (but not in the failed build table) are:
I did some more digging and found out that beyond missing rows, there are 3974 rows that differ between the two tables. I looked at an example ( Because the failed build actually has fewer rows than the working build, I don't think this is a harvesting issue. I'm also pretty sure that the harvesting happens before the glue tables are ever processed. |
Foreign key problems were related to my local errors |
The only thing done to |
The ~8000 rows that are different maybe that could just be a small number of plants for which the most common plant name changed due to values that were harvested from the crosswalk? Also |
If I'm not mistaken, the harvesting process is outlined in the from the
|
The build outputs from Sept. 15th would have reflected anything that got merged into If the first failed build was the one that happened the night of Sept. 15/16th, that would have been from anything merged into |
I thought the last successful build was from the 15th? Does that mean morning of the 15ths? |
The last successful build outputs have a date of September 15th. The builds start at 1am and typically run until 5-6am, Mountain time, or 12-1pm UTC, which is what the GCP dates are in:
|
Ok I tried the same thing but with the the plants_eia860 table from the db (not the output tables) and the tables are the same.... UPDATE: this was an error, I accidentally used the generators_eia860 table. Corrected below. |
You said what was different was This command (executed with
|
I updated the screenshot above to show the commits from the 15th |
Wouldn't that be the 14th? |
Nope. I just ran it. |
Those tables are also the same. |
Okay so if you compare pre & post failure But if you compare pre and post failure |
I goofed a little, the entity tables are the same from both successful and failed builds. But the |
Neither of the missing |
So the only thing that you're concerned about now is understanding why the addition of the |
That was just one example, there were other EIA tables that were affected (failed) |
Okay so in general the concern is that adding the crosswalk table seems to have affected other EIA tables, and that seems weird / wrong? |
Can you diff all of the tables between these two databases you have and identify the differences? |
#1692 did modify more than 30 files so... it does seem like there could be some impacts in there. I will look over the changes. |
Here's a record of the differences for the
|
Looking back over the PR, the only thing that seems like it might have effects all over the EIA data is the re-write of the leading-zero fixer function. It does look like the new function should be doing the same thing as the old function, but it's notoriously easy to mess up regular expressions. What is the extent of the differences between the two builds? If it's just a couple of extra rows, or a half dozen individual fields for a couple of random plants 20 years ago, then that doesn't seem scary and we should probably just update the expected number of records. If there are thousands of rows in multiple tables where the contents have changed as you said above, that seems more concerning. Is that really happening? |
I adjusted the row count to account for the -2 rows in each of the In case we want to revisit this later (and maybe add some regression analysis), I took stock of the types of columns that are changing in the There are 2 fewer rows in the current table (than in the version before adding the epacamd_eia crosswalk from PR #1692). There are also 3974 rows (out of 185071 or 185069 rows--depending on whether you compare the old or new version) that differ between the two tables. The following table shows, for the
As you can see, the culprit columns are
|
It used to be a unique id test for plant_id_eia and emissions_unit_id_epa, but the crosswalk has duplicates of those ids due to the m:m nature of the data. Instead of an id test I implemented a min max row test.