Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconcile multiple years of data in XBRL instant tables #2021

Closed
Tracked by #2012
zaneselvans opened this issue Oct 25, 2022 · 4 comments
Closed
Tracked by #2012

Reconcile multiple years of data in XBRL instant tables #2021

zaneselvans opened this issue Oct 25, 2022 · 4 comments
Assignees
Labels
ferc1 Anything having to do with FERC Form 1 rmi xbrl Related to the FERC XBRL transition

Comments

@zaneselvans
Copy link
Member

In some of the XBRL tables representing instant facts, there's more than one value for date associated with a single report_year. For example:

pis_dbf = pd.read_sql("f1_plant_in_srvce", ferc1_engine)
pis_xbrl_duration = pd.read_sql("electric_plant_in_service_204_duration", ferc1_xbrl_engine)
pis_xbrl_instant = pd.read_sql("electric_plant_in_service_204_instant", ferc1_xbrl_engine)
pis_xbrl_instant.date.value_counts()
2021-12-31    208
2020-12-31    206
Name: date, dtype: int64

How should we deal with this? What does it mean?

In the DBF tables there are sometimes "this year" and "last year" data reported next to each other for comparison. Is that what's going on here? If so, then we only need to keep the current year of data. Need to identify which tables report this way in DBF and compare them to the tables with more than 1 year of data in the XBRL to see if they correspond.

  • If they do correspond then we should probably explicitly drop the "previous year" data when we are merging the instant and duration data in the XBRL transformers, rather than implicitly dropping those records in the merge.
  • Also check that this issue isn't coming up in the duration tables with start_date and end_date.
  • Log when we drop this data in the pre-concatenation XBRL processing.
@zaneselvans zaneselvans added ferc1 Anything having to do with FERC Form 1 xbrl Related to the FERC XBRL transition labels Oct 25, 2022
@zaneselvans
Copy link
Member Author

zaneselvans commented Oct 26, 2022

At least in the plant_in_service table the two years of data that are reported are completely identical in both the instant table. Though strangely there are 206 records in one year and 208 in the other. Only a single year of data appears in the duration table -- but it gets doubled up in the merge as we're performing it now when we bring these two tables together.

ferc1_settings = pudl.settings.Ferc1Settings(
    tables=["plant_in_service_ferc1"],
    years=[2020, 2021],
)

raw_dbf = pudl.extract.ferc1.extract_dbf(
    ferc1_settings,
    pudl_settings,
)

raw_xbrl = pudl.extract.ferc1.extract_xbrl(
    ferc1_settings,
    pudl_settings,
)

# 414 records, 206 from 2021-12-31 & 208 from 2020-12-31
pis_xbrl_instant = raw_xbrl["plant_in_service_ferc1"]["instant"]
assert (pis_xbrl_instant.groupby(["entity_id", "date"]).nunique() > 1).sum().sum() == 0

# Only has 202 records, which aren't actually duplicated before the tables get merged.
pis_xbrl_duration = raw_xbrl["plant_in_service_ferc1"]["duration"]
assert (pis_xbrl_duration.groupby(["entity_id", "start_date"]).nunique() > 1).sum().sum() == 0

Update:

I am a dumbass and was grouping-by unique rows so of course everything was the same. Doing this correctly I find that 394 of the 414 records experienced some year-to-year change in the overall aggrete value electric_plant_in_service so this all seems clear now -- the values in the instant table are end-of-year for the year they pertain to, which lines up with the dates that are actually being used, and avoids the dumb start of year / end of year duplication. So, yay FERC?!

pis_xbrl_instant["changed"] = (pis_xbrl_instant.groupby(["entity_id"])["electric_plant_in_service"].transform("nunique") > 1)
pis_xbrl_instant[pis_xbrl_instant.changed]

@zaneselvans
Copy link
Member Author

  • If we want to allow aggregation within a given report year, or we want the XBRL & DBF data to be directly comparable, then we need to translate the end-of-last-year and end-of-this-year data which is currently showing up as single columns with different date values in the instant table into separate columns that correspond to beginning of this year, and end of this year.
  • If we were to keep both of those columns, they would be duplicative, since we should be able to calculate the end-of-year balance based on the starting balance and accumulated adjustments across the year (additions, retirements, etc.).
  • If we want to keep the database normalized (within a given year) then we should store just the beginning balance and the deltas (there's another kind of aggregation happening elsewhere, within accounting categories, but here we're just concerned with the temporal aggregation across a year).
  • For development & validation purposes it'll be useful to keep all these start/end of year and categorical aggregated values so we can check that our aggregations reflect the reported numbers. So don't drop them yet.

@zaneselvans
Copy link
Member Author

  • Not all of the data in the duration and instant tables line up with each other perfectly... There are a few entity_ids that show up ONLY in the instant table -- i.e. they report starting and/or ending balances, but no additions, retirements, etc.
  • Maybe this is what we would expect for a utility that has plant in service but didn't modify it at all during the year.
  • There are also a few utilities that don't report a starting balance, but do report changes and an ending balance. In that case it's hypothetically possible to calculate what the starting balance should have been.

@zaneselvans
Copy link
Member Author

zaneselvans commented Oct 28, 2022

Weird missing data issues aside, I think it's clear what the meaning of the multiple years of data is: end of year balances, for this and last year, which get reported in the same year, but have different instantaneous timestamps / dates associated with them.

We're not currently adopting the instant vs. duration model of time. Maybe we should think about that at some point, but in our current data model we want to be able to easily aggregate data within or potentially across report_year values. In the other tables we're already dropping the start/end dates and keeping report_year. To get starting/ending balances into a single report year I think want to:

  • convert end-of-last-year rows into starting_balance columns
  • convert end-of-this-year rows into ending_balance columns
  • concatenate the instant and duration tables along the column axis based on shared (entity_id, report_year) index values.

This arrangement will allow us to look at a given report_year or series of report_years, and apply the many individual additions / retirements etc to the starting balance and calculate an expected ending balance that we can compare to the reported value (hopefully getting the same value), once we've done the reshaping of the concatenated XBRL table into a tidy format.

It seems like working this into the TableTransformer classes will require a pretty different path in the process_xbrl() method, but hopefully this general path can be applied to many of the tables that require reshaping.

This was mostly a research and exploration issue. I'm going to close it and move on to implementation in #2014.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ferc1 Anything having to do with FERC Form 1 rmi xbrl Related to the FERC XBRL transition
Projects
No open projects
Status: No status
Development

No branches or pull requests

1 participant