Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate XBRL taxonomy metadata into plant_in_service transform #2058

Merged
merged 15 commits into from
Nov 15, 2022

Conversation

zaneselvans
Copy link
Member

No description provided.

@zaneselvans zaneselvans added ferc1 Anything having to do with FERC Form 1 xbrl Related to the FERC XBRL transition labels Nov 11, 2022
@zaneselvans zaneselvans linked an issue Nov 11, 2022 that may be closed by this pull request
14 tasks
@codecov
Copy link

codecov bot commented Nov 11, 2022

Codecov Report

Base: 84.6% // Head: 85.1% // Increases project coverage by +0.5% 🎉

Coverage data is based on head (f708893) compared to base (e5c530b).
Patch coverage: 100.0% of modified lines in pull request are covered.

Additional details and impacted files
@@                Coverage Diff                 @@
##           xbrl_integration   #2058     +/-   ##
==================================================
+ Coverage              84.6%   85.1%   +0.5%     
==================================================
  Files                    72      72             
  Lines                  8114    8119      +5     
==================================================
+ Hits                   6865    6910     +45     
+ Misses                 1249    1209     -40     
Impacted Files Coverage Δ
src/pudl/glue/ferc1_eia.py 96.0% <ø> (ø)
src/pudl/metadata/fields.py 100.0% <ø> (ø)
src/pudl/metadata/resources/ferc1.py 100.0% <ø> (ø)
src/pudl/transform/classes.py 93.9% <ø> (ø)
src/pudl/etl.py 89.8% <100.0%> (+<0.1%) ⬆️
src/pudl/extract/ferc1.py 87.6% <100.0%> (+0.2%) ⬆️
src/pudl/transform/ferc1.py 95.6% <100.0%> (+8.7%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@zaneselvans zaneselvans changed the title Fix bad multi-index construction that was scrambling XBRL columns. Integrated XBRL taxonomy metadata into plant_in_service transform Nov 12, 2022
@zaneselvans zaneselvans changed the title Integrated XBRL taxonomy metadata into plant_in_service transform Integrate XBRL taxonomy metadata into plant_in_service transform Nov 12, 2022
@zaneselvans
Copy link
Member Author

Welp, made it most of the way through, but then ran into an error with the plant mapping it looks like, in this test:

test_for_fk_validation_and_unmapped_ids[missing_utility_id_pudl_in_utilities_ferc1]

I would assume this is something from upstream since I haven't touched any of the plant mapping... but it looks like the xbrl_integration branch PR #1665 is passing? Ugh.

I added this unit test while debugging a bunch of glue test failures,
which were somehow being triggered by the addition of the XBRL taxonomy
metadata. Whatever it's reading is from the XBRL metadata in the glue
tests... it's unable to parse as JSON, resulting in a
NotImplementedError.

This problem does *not* arise with the new XBRL metadata integration
test, so it seems there's something weird going on with the glue test
environment in particular.

Another strange thing I noticed is that even if you run the tests with
--live-dbs the glue tests run the ETL anyway! Not sure why this would be
happening. Running

```
pytest --live-dbs test/integration/etl_test.py::test_ferc1_etl
```

also doesn't exhibit this behavior, and it finishes in just a few
seconds.

I split the FERC 1 ETL test into two tests, one for DBF and one for XBRL,
and renamed them:

  * test_ferc1_dbf2sqlite()
  * test_ferc1_xbrl2sqlite()

Rather than "etl" since they're only doing the conversions of the raw data. This lets
you just test one or the other too, which is more convenient when the problem you're
working on is just on one side.
@zaneselvans
Copy link
Member Author

@cmgosnell I think something odd is going on with the glue tests you recently set up. They're running the ETL independently even when the tests are run with --live-dbs, and they're failing on the XBRL metadata extraction step, even though that step works fine in the main ETL tests. You can run the following to see the difference:

# Tries to run the FERC1 XBRL ETL and fails:
pytest --live-dbs test/integration/glue_test.py::test_unmapped_utils_eia

# Also tries to run the FERC 1 XBRL ETL (as expected) and fails:
pytest test/integration/glue_test.py::test_unmapped_utils_eia

# Runs the FERC 1 XBRL extraction, and the JSON normalization in isolation and succeeds:
pytest test/integration/etl_test.py::test_ferc1_xbrl2sqlite

# Runs the whole PUDL ETL including extracting FERC 1 DBF + XBRL and succeeds:
pytest test/integration/etl_test.py::test_pudl_engine

My guess is that in this unexpected (to me) separate ETL process that the glue tests are running, either the input parameters or the output location is different than in the main ETL, and it's reading something unexpected in (or maybe nothing) when it tries to ingest the XBRL taxonomy metadata, and that's why the call to pd.json_normalize() is failing.

I forgot that the glue tests have their own abbreviated version of the
transformers, and they don't need any XBRL metadata. This commit allows
a table transformer to be instantiated without any metadata without
having any issues.
@zaneselvans
Copy link
Member Author

Ah @cmgosnell I figured it out. I forgot that the glue tests have their own abbreviated XBRL transformers that just to the pre-processing so they can grab the plant / utility IDs. I fixed it so that the XBRL taxonomy metadata inputs are optional.

Really it should only be fed into the transformers that need it, like plant_in_service. Or alternatively I guess it could be a class attribute that's stored in the Ferc1AbstractTableTransformer class -- then the same metadata wouldn't end up getting stored in lots of different classes. But then that abstract class would have to know how to read it in. Maybe that would be okay / better?

Initial draft of a process that allows the XBRL taxonomy metadata to be accessed in the
table transforms, and also allows the metadata to be transformed as appropriate for the
individual tables (which is necessary when we are doing reshaping, renaming, etc.

Currently the only way that the metadata is being utilized is in changing the sign
convention of reported values so they can be aggregated using sum() directly.

Some outstanding questions:

 * Should the metadata be stored in a separate normalized table, and combined with
   the data only in output methods / DB views?
 * What metadata columns do we actually want to retain?
 * Where should the bulk normalized metadata that pertains to all of the FERC 1 tables
   be stored while during the transforms? Really it seems like it pertains to the
   Ferc1AbstractTableTransformer() since it's relevant to *all* the tables. But that
   would mean making it a class attribute, and somehow making that class capable of
   reading in the metadata independently. Not sure that's a good idea.
 * Should we keep the calculated values in the table until we're able to aggregate
   and reproduce them? If so then they need to be flagged as calculated so they can
   removed from other calculations.

Known Issues:
 * It looks like FERC accounts that have to get renamed (many in the general plant
   categories) aren't getting renamed everywhere. Their account numbers aren't showing
   up in the metadata we merge in...
 * Actually it looks like the *data* isn't coming through. Need to dig into this wtf.
Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't love that you are using apply_sign_conventions to both apply the sign and merge the metadata into the table. can you add a merge it method and then an apply sign method

your also storing xbrl_metadata_normalized in a few places and editing it and that feels a little bad

if we make the table-specific metadata, we can pass in and cache just the table-specific metadata into the transformer class. Then you could have a merge-in-the-metadata method that grabs the cached metadata, normalizes it and then merges it.

src/pudl/transform/ferc1.py Outdated Show resolved Hide resolved
src/pudl/transform/ferc1.py Outdated Show resolved Hide resolved
Changes in response to comments from @cmgosnell on PR #2058

* Get rid of confusing code that turned empty calculations into NA
  values.
* Use a single consistent name for xbrl_metadata_json throughout.
* Separate metadata merging and application of sign conventions.
* Also moved the translation of "credit" and "debit" into numerical
  weights into the `apply_sign_conventions` method so it's clearer
  where those numbers are coming from (and since that translation
  isn't really metadata normalization...)
@zaneselvans zaneselvans merged commit 14bd607 into xbrl_integration Nov 15, 2022
@zaneselvans zaneselvans deleted the agg-pis-xbrl branch November 15, 2022 06:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ferc1 Anything having to do with FERC Form 1 xbrl Related to the FERC XBRL transition
Projects
No open projects
Status: 🏗 In progress
Development

Successfully merging this pull request may close these issues.

Transform plant_in_srvce xbrl + dbf
2 participants