Remove COLUMN_DTYPES and switch to field metadata dictionary #1408

zaneselvans · 2022-01-18T23:04:21Z

Fixed a variety of small bugs in the EIA-861 metadata.

Removed the duplicative pudl.constants.COLUMN_DTYPES dictionary, and am now referring exclusively to the data types which come from pudl.metadata.fields

At some level this is circumventing the Field/Resource/Package API by referring directly to the field metadata dictionaries, which isn't great. The functionality that we're going for here is being able to impose data types on a random dataframe which contains some PUDL columns, even if it's not one of the canonical database tables, or contains a mix of fields from several different tables / data sources.

However the more legitimate alternative -- of defining full Resource definitions for these other tables -- feels like a lot of overhead, when we are apparently frequently re-applying types in analyses and dataframe manipulations.

Is there a more robust way to implement this that's still low overhead for ad-hoc use?

See the individual commit messages for additional details on what was changed.

Rather than storing two separate sources of column/field type information, derive all of the pandas data types from the now complete(-ish?) Pydantic metadata. Mostly this is implemented by a small helper function pudl.metadata.fields.get_pandas_dtypes() which refers to the canonical PUDL field metadata (and data group overrides) by default, but which can also use independent field type dictionaries, so it can be used to enforce types on analysis tables with an ad-hoc mix of canonical PUDL fields and other columns that pertain to that analytical context. One (temporary) change that was required to make this work is that all of the _year fields are now being typed as integers, rather than years, to avoid having them be converted into nanoseconds after Jan 1st, 1970. When we try to handle the year, month, and day resolutions of reporting in the harvesting, and merging dataframes of different frequencies together, we'll want to undo this, but for now it maintains the existing behavior and doesn't break anything, and it was a small change. Closes #1270

codecov · 2022-01-18T23:18:14Z

Codecov Report

Merging #1408 (fbe7163) into dev (0e41b1f) will increase coverage by 0.04%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev    #1408      +/-   ##
==========================================
+ Coverage   83.49%   83.53%   +0.04%     
==========================================
  Files          64       64              
  Lines        6910     6921      +11     
==========================================
+ Hits         5769     5781      +12     
+ Misses       1141     1140       -1

Impacted Files	Coverage Δ
src/pudl/extract/eia861.py	`96.15% <ø> (-0.14%)`	⬇️
src/pudl/metadata/enums.py	`100.00% <ø> (ø)`
src/pudl/metadata/resources/eia861.py	`100.00% <ø> (ø)`
src/pudl/analysis/allocate_net_gen.py	`97.71% <100.00%> (+0.02%)`	⬆️
src/pudl/analysis/mcoe.py	`92.73% <100.00%> (+0.13%)`	⬆️
src/pudl/constants.py	`100.00% <100.00%> (ø)`
src/pudl/etl.py	`92.79% <100.00%> (+0.13%)`	⬆️
src/pudl/extract/epacems.py	`97.67% <100.00%> (-0.15%)`	⬇️
src/pudl/glue/eia_epacems.py	`100.00% <100.00%> (ø)`
src/pudl/helpers.py	`85.04% <100.00%> (-0.42%)`	⬇️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e41b1f...fbe7163. Read the comment docs.

bendnorman · 2022-01-20T00:00:37Z

If we want to use the Resource.format_df() class to correct dtypes for arbitrary dataframe that have fields specified in FIELD_METADATA, we could add a from_dataframe() method to the Resource class. Something like this:

class Resource(base):
    ...
    
    @classmethod
    def from_dataframe(cls, df: pd.DataFrame) -> Resource:
        """Create a Resource for a dataframe whose fields exist in the metadata system."""

        fields = []
        for field in df.columns:
            # This will fail if `field` is not specified in FIELD_METADATA
            fields.append(Field.dict_from_id(field))
        schema = {"fields": fields}
        # Add any additional required params for Resource schemas

        return cls(**fields)

Use:

# An arbitrary  dataframe with fields that exist in FIELD_METADATA
gen_df = pd.DataFrame({"fuel_type": ["gas", "coal"], "generator_id": ["1A", "2B"]})

# create the schema
gen_schema = Resource.from_dataframe(gen_df)

# validate and format the dataframe
gen_df = gen_schema.format_df(gen_df)

bendnorman

Left a few comments and the proposal above.

src/pudl/helpers.py

src/pudl/metadata/resources/eia861.py

src/pudl/transform/eia861.py

zaneselvans · 2022-01-20T02:04:45Z

Re: using Resource.format_df() in a lot of cases some but not all of the columns are specified in the FIELDS_METADATA. Is there a straightforward way to make this apply to only the fields which do match, and ignore the rest? Is that really the desired behavior @cmgosnell?

bendnorman · 2022-01-20T02:35:18Z

Resource.from_dataframe() could create Field() objects for each field not in FIELD_METADATA using the field's dtype.

zaneselvans · 2022-01-20T16:27:56Z

Ways that pudl.helpers.get_pudl_dtype() and pudl.helpers.get_pudl_dtypes() are currently being used:

To create dictionaries for use with df.astype()
Often only applying types to a small subset of the fields in a dataframe.
Currently the types are always coming from a single data source (eia)

Tracked down all the instances in which we were using `pudl.helpers.get_pudl_dtypes()` and replaced them with the new function call. This is an incremental step toward standardizing the way we enforce data types on dataframes even when they aren't fully specified by our metadata / database schema information. This affected the way `pudl.helpers.clean_merge_asof()` is called as well. Removed `pudl.helpers.get_pudl_dtype()` (singular) as it was duplicative given the `pudl.metadata.fields.get_pandas_dyptes()` function.

It turns out that in most if not all of the cases where we are applying the canonical PUDL dtypes, what we're actually doing is forcing all non-floats to use nullable Pandas dtypes... and the built-in df.convert_dtypes() method does this. I've hunted down and replaced all usage ofget_pudl_dtypes() with convert_dtypes() which should avoid the need for our own typing infrastructure in almost all cases. Excluding the conversion of floating point values is necessary since the pandas nullable floats don't work with all of the Numpy functions. There may be other instances (e.g. boolean columns with NA values that we use as masks) where this doesn't work either... but haven't come across any just yet.

Consolidated the PUDL data type application functions into pudl.metadata.fields: * get_pudl_dtypes() gets an exhaustive list of all defined fields and their types. * apply_pudl_dtypes(df) applies the relevant dtypes to the columns in a dataframe. Swapped in apply_pudl_dtypes() in the places where I had just switched to using df.convert_dtypes() since that had a fair number of bad side effects, particularly with respect to datetime columns.

In almost all cases simply applying PUDL data types to dataframes is enough to make them behave well in merges etc. The exceptions are currently: * `pudl.transform.eia861_compare_totals()` (which has issues w/ bools & NA values) and * The end of the main eia, eia861, and ferc1 transform steps where we really are cleaning up bad values in addition to applying canonical data types. In these cases it's being called indirectly by `convert_dfs_dict_dtypes()` There are also some calls to convert_cols_dtypes() in test/unit/analysis/allocate_net_gen.py but I'm not sure if those can/should be switched.

@cmgosnell

Removed some additional unnecessary calls to convert_cols_dtypes, and replaced calls to convert_dfs_dict_dtypes with dictionary comprehensions using convert_cols_dtypes. Replaced the use of convert_cols_dtypes that was being applied to the entity tables post harvesting with apply_pudl_dtypes() since all of the input values should have been cleaned up prior to harvesting. The remaining uses of convert_cols_dtypes are: * At the end of the EIA transformations, in the ETL module. * At the end of the EIA-861 transformations (analogous to the above) * At the end of the FERC-1 transformations * One use in the midst of the EIA-861 transformations to deal with bools/NAs. Did a little bit of debugging of #1371 but ultimately left it for @cmgosnell. Did rename the allocate_net_gen test module so pytest will pick it up, but marked the tests xfail. Ran `tox -e nuke` overnight to see if any of the previous changes had affected the data validations. The only changes were a few additional rows showing up in the generation_fuel_eia923 and generation_fuel_nuclear_eia923 tables. I updated the expected row counts appropriately... but not 100% sure what the changes there would be. Should compare with a previous version of the DB.

Also looked into why there are a few more records showing up in the generation_fuel and generation_fuel_nuclear tables now... It's not clear, but they look like valid records, and they are all from 2001. It seems that this change happened after we released PUDL v0.5.0 and before I started messing around with the data types though -- the "additional" records were present in a copy of the PUDL DB that I had from 2021-11-28, but were not there on 2021-11-11.

zaneselvans · 2022-01-21T18:42:01Z

I think building an ad-hoc Resource using from_dataframe() is a good approach. I'm not totally familiar with what format_df does right now though, and I'm going to be messing around with that stuff in the harvesting, so I kind of want to put it off for the moment and go with the current solution.

I think we'll also want to make it easier to inject types for ad-hoc dataframes which don't have Resource definitions, but @cmgosnell said that in the depreciation cases where she was doing this, just using df.convert_dtypes() did the right thing, so it wasn't necessary yet.

zaneselvans added 4 commits January 12, 2022 08:41

Merge in dev. Update pre-commit hooks.

ef2b591

Merge branch 'dev' into eia861-metadata

4511a1a

Fix typos and other minor errors in EIA-861 metadata structures.

0574ff1

zaneselvans requested review from bendnorman and cmgosnell January 18, 2022 23:04

This was linked to issues Jan 18, 2022

Compile eia861 metadata using new system #1250

Closed

Replace COLUMNS_DTYPES dictionary with Pydantic metadata #1270

Closed

zaneselvans added data-types Dtype conversions, standardization and implications of data types eia861 Anything having to do with EIA Form 861 metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. labels Jan 18, 2022

bendnorman requested changes Jan 20, 2022

View reviewed changes

src/pudl/helpers.py Outdated Show resolved Hide resolved

src/pudl/metadata/resources/eia861.py Show resolved Hide resolved

src/pudl/transform/eia861.py Outdated Show resolved Hide resolved

src/pudl/transform/eia861.py Show resolved Hide resolved

zaneselvans added 7 commits January 20, 2022 11:44

Merge branch 'dev' into unify-dtypes

0e7c05f

zaneselvans requested a review from bendnorman January 21, 2022 18:51

bendnorman approved these changes Jan 21, 2022

View reviewed changes

zaneselvans merged commit 22a2f07 into dev Jan 21, 2022

zaneselvans deleted the unify-dtypes branch January 21, 2022 19:31

zaneselvans mentioned this pull request Mar 10, 2022

Release PUDL v0.6.0 #1497

Closed

22 tasks

zaneselvans mentioned this pull request Mar 11, 2022

Update release notes for v0.6.0 #1525

Closed

28 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove COLUMN_DTYPES and switch to field metadata dictionary #1408

Remove COLUMN_DTYPES and switch to field metadata dictionary #1408

zaneselvans commented Jan 18, 2022 •

edited

Loading

codecov bot commented Jan 18, 2022 •

edited

Loading

bendnorman commented Jan 20, 2022

bendnorman left a comment

zaneselvans commented Jan 20, 2022

bendnorman commented Jan 20, 2022

zaneselvans commented Jan 20, 2022

zaneselvans commented Jan 21, 2022

Remove COLUMN_DTYPES and switch to field metadata dictionary #1408

Remove COLUMN_DTYPES and switch to field metadata dictionary #1408

Conversation

zaneselvans commented Jan 18, 2022 • edited Loading

codecov bot commented Jan 18, 2022 • edited Loading

Codecov Report

bendnorman commented Jan 20, 2022

bendnorman left a comment

Choose a reason for hiding this comment

zaneselvans commented Jan 20, 2022

bendnorman commented Jan 20, 2022

zaneselvans commented Jan 20, 2022

zaneselvans commented Jan 21, 2022

zaneselvans commented Jan 18, 2022 •

edited

Loading

codecov bot commented Jan 18, 2022 •

edited

Loading