Allow concatenation of tall DBF with wide XBRL data #2014

zaneselvans · 2022-10-24T22:44:07Z

Tasks

Create a CSV file format that allows us to identify which combinations of (report_year, row_number) should be associated with what XBRL column name for a given table. This file will need to store all unique combinations of (sched_table_name, report_year, row_number, row_literal).
Manually fill in a column in that CSV file indicating the stem of the corresponding XBRL column names (without the suffixes indicating _additions, _retirements etc.) for the f1_plant_in_srvce table. Across all the FERC 1 DBF data, there are 4260 unique combinations. The f1_plant_in_srvce table has 198 combos. So this is tedious but very doable, even for all tables.
Migrate old rename dictionary from the DBF data into the new metadata.
Write a plant_in_service_ferc1 version of merge_instant_and_duration_tables_xbrl() that converts the multiple years of data into starting/ending balance columns (see Reconcile multiple years of data in XBRL instant tables #2021 for notes).
Write a plant_in_service_ferc1 version of process_xbrl() that reshapes from wide to tidy format for concatenation with the DBF data.
Write a plant_in_service_ferc1 version of process_dbf() that aligns report_year and row_number with the account IDs that are used in the XBRL data based on our manual mapping.
Figure out a way to apply the correct signs to the debit vs. credit values such that aggregations can be simple sums. This may need to be done differently for XBRL and DBF data, in which case it should be done prior to concatenation. Otherwise it could be done in transform_main() but potentially referring to the same parameters/mapping/metadata as was used for the reshaping and alignment.
Don't warn about duplicate record_id values when they're found in a reshaped table.

Notes from CSV Compilation

The plant_in_service DBF table has 198 different combinations of year, row number, and row literal. The XBRL table has 96 columns. This DBF table changed once in 2003 and again in 2006.
The XBRL columns for the plant_in_service table include no totals which makes me suspect that there's some kind of structural metadata in the XBRL that says which values should be added up to generate the previously calculated subtotals / totals. This could be really useful if we can use it to programmatically group the old DBF values too.
The XBRL does include columns for the header rows, so I wonder if that's where the totals are stored, or if the metadata indicating which columns should be aggregated to get that number are associated with the header?
There appear to be some new values reported as of 2021, mostly related to energy storage, which makes sense. The one old category where energy storage was previously mentioned also seems to have changed its name as of the XBRL transition... the old "(363) Storage Battery Equipment" seems to have become energy_storage_equipment_distribution_plant. Having the FERC Account numbers would remove a lot of potential ambiguity.
Compiling these mappings without having the FERC Account numbers is more annoying than it needs to be. Getting that metadata programmatically accessible would let us start each table with a 90% filled in draft I think, and just need to fill in the weird gaps.
Within the plant_in_service table there are 2 levels of headers, but only Production Plant is subdivided (by type of generation -- hydro, steam, nuclear, etc.) while all of the other numbers are reported either at the top level (not under any header, for FERC Account 102 and 103) or the first level (a big utility plant category like distribution_plant).
There are some accounts which show up across several different kinds of production plant. Would it be useful to be able to add them up across the different production plant categories? (e.g. all of the structures and improvements, no matter whether they're associated with hydro or steam plants).
The only column that requires negation is FERC Account 102 (it is both electric_plant_purchased and electric_plant_sold). The plant sold is a credit rather than a debit to the plant balance, so all values in that category need to be negated. This is important because in the normal categories the individual columns also have different credit/debit categorizations -- additions, transfers, and adjustments are booked as "debits" and retirements are booked as "credits". It'll be much better if we can
After some further discussion of the XBRL taxonomy with Zach (see comments below) it seems like 100% of the XBRL columns can be mapped to their corresponding DBF rows across all years of DBF data!

The text was updated successfully, but these errors were encountered:

zaneselvans · 2022-10-26T14:28:55Z

@zschira in compiling this column mapping I noticed that the XBRL data has no totals, and I know XBRL lets you specify what collections of values should be added up to create totals and subtotals, so I suspect that they've encoded the totals and subtotals that way. Did you see anything like that in the XBRL metadata / taxonomies when you were poking around? I think they might be called "calculation arcs?"

Also, compiling these maps will definitely be much easier when we can get programmatic access to the FERC Account numbers associated with each of the XBRL columns. I think it'll cover more than 90% of the columns in many tables.

zaneselvans · 2022-10-26T21:27:56Z

Notes from chat with @zschira

Looking at some details of the XBRL taxonomy for the Plant in Service table it appears that the values stored in the instant table are the start-of-year and end-of-year balances.
It could be that the values are "end of last year" and "start of this year" which would explain why they're always exactly the same. But also that would be a strange combination of values to report, so @zschira is going to look into whether those really are the values in the original XBRL data or if there might be an extraction issue. The property Name fields are exactly the same, even though they refer to different quantities which is odd.
There are totals columns in the duration tables, but they aren't labeled as totals -- they're just the category name (e.g. intangible_plant is the total intangible plant at the beginning / end of year, depending on the date associated with it and intangible_plant_additions is the total of all the additions of the sub-categories of intangible plant. Similarly electric_plant_in_service is the total balance at the beginning/end of year and electric_plant_in_service_additions is the total of all the additions across all of the electric_plant_in_service subcategories.
There's clearly structural information in the taxonomy that indicates which categories each of these columns belongs to for aggregation purposes, and that structure can be extracted, but FERC isn't using the calculation arcs that are available in XBRL, so all we can do is calculate our own totals and compare to the reported totals to see if we got the calculation right.
The "sign" of the columns is also reported in the XBRL metadata in a property named Balance which can be either debit or credit. We knew that we would need this kind of information for some weird accounts like (102) Electric Plant Purchased/Sold (purchased being a debit, sold being a credit). BUT it turns out that this property is also applied to other columns. E.g. the _additions columns are debits, but the _retirements columns are credits. So we probably really need to get at this information programmatically.

@zschira

Progress on #2012 #2014 * Fixed a bug in how the DBF row numbers that need to be mapped are identified. Now it looks for any time the row_literal associated with a row number has changed from one year to the next, rather than selecting the first instance of each distinct combination of row_literal and row_number. * Also discovered that there's an obscure row_status field that differentiates between annual (A) and quarterly (Q) row literals, and is part of the f1_row_lit_tbl primary key, but it only shows up in association with the f1_schedules_list table. I integrated it but... maybe that table should just be excluded from the row mapping template? * Added some (janky) helper functions to pudl.transform.ferc1 to manage the generation of the row maps. This location is temporary. They should probably become methods of a Ferc1 abstract transformer class for reshaped tables, or maybe end up in a different module. Not sure how they'll end up getting used yet though. * Updated the dbf_to_xbrl.csv file to include all of the possible rows that could need mapping (4270 in total). * Removed the XBRL specific metadata fields from the dbf_to_xbrl.csv file, since they should (hopefully) be available programmatically from the metadata @zschira is extracting from the XBRL taxonomies, and can be joined to this table based on the xbrl_column_stem. * Updated the plant_in_service transform to use the new row map. Need to test on all of the years.

zaneselvans · 2022-11-02T03:14:10Z

I found and fixed a bug in how the DBF row numbers that need to be mapped are identified. Now it looks for any time the row_literal associated with a row number has changed from one year to the next, rather than selecting the first instance of each distinct combination of row_literal and row_number.
Also discovered that there's an obscure row_status field that differentiates between annual (A) and quarterly (Q) row literals, and is part of the f1_row_lit_tbl primary key, but it only shows up in association with the f1_schedules_list table. I integrated it but... maybe that table should just be excluded from the row mapping template?
I Removed the XBRL specific metadata fields from the dbf_to_xbrl.csv file, since they should (hopefully) be available programmatically from the metadata @zschira is extracting from the XBRL taxonomies, and can be joined to this table based on the xbrl_column_stem.

zaneselvans added ferc1 Anything having to do with FERC Form 1 rmi xbrl Related to the FERC XBRL transition labels Oct 24, 2022

zaneselvans self-assigned this Oct 24, 2022

This was referenced Oct 24, 2022

Compile FERC account attributes for DBF/XBRL rows/cols #2015

Open

Create a general row-to-column table transformer #2012

Closed

Integrate FERC XBRL data into PUDL #1574

Open

zaneselvans linked a pull request Oct 26, 2022 that will close this issue

First draft of plant_in_service_ferc1 DBF + XBRL transform #2025

Merged

zaneselvans mentioned this issue Oct 28, 2022

Reconcile multiple years of data in XBRL instant tables #2021

Closed

zschira mentioned this issue Nov 2, 2022

Extract additional metadata from XBRL taxonomies catalyst-cooperative/ferc-xbrl-extractor#30

Closed

zaneselvans removed a link to a pull request Nov 4, 2022

First draft of plant_in_service_ferc1 DBF + XBRL transform #2025

Merged

cmgosnell closed this as completed Dec 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow concatenation of tall DBF with wide XBRL data #2014

Allow concatenation of tall DBF with wide XBRL data #2014

zaneselvans commented Oct 24, 2022 •

edited

Loading

zaneselvans commented Oct 26, 2022 •

edited

Loading

zaneselvans commented Oct 26, 2022

zaneselvans commented Nov 2, 2022

Allow concatenation of tall DBF with wide XBRL data #2014

Allow concatenation of tall DBF with wide XBRL data #2014

Comments

zaneselvans commented Oct 24, 2022 • edited Loading

Tasks

Notes from CSV Compilation

zaneselvans commented Oct 26, 2022 • edited Loading

zaneselvans commented Oct 26, 2022

Notes from chat with @zschira

zaneselvans commented Nov 2, 2022

zaneselvans commented Oct 24, 2022 •

edited

Loading

zaneselvans commented Oct 26, 2022 •

edited

Loading