Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove COLUMN_DTYPES and switch to field metadata dictionary #1408

Merged
merged 11 commits into from
Jan 21, 2022

Conversation

zaneselvans
Copy link
Member

@zaneselvans zaneselvans commented Jan 18, 2022

Fixed a variety of small bugs in the EIA-861 metadata.

Removed the duplicative pudl.constants.COLUMN_DTYPES dictionary, and am now referring exclusively to the data types which come from pudl.metadata.fields

At some level this is circumventing the Field/Resource/Package API by referring directly to the field metadata dictionaries, which isn't great. The functionality that we're going for here is being able to impose data types on a random dataframe which contains some PUDL columns, even if it's not one of the canonical database tables, or contains a mix of fields from several different tables / data sources.

However the more legitimate alternative -- of defining full Resource definitions for these other tables -- feels like a lot of overhead, when we are apparently frequently re-applying types in analyses and dataframe manipulations.

Is there a more robust way to implement this that's still low overhead for ad-hoc use?

See the individual commit messages for additional details on what was changed.

Rather than storing two separate sources of column/field type information, derive all of
the pandas data types from the now complete(-ish?) Pydantic metadata. Mostly this is
implemented by a small helper function pudl.metadata.fields.get_pandas_dtypes() which
refers to the canonical PUDL field metadata (and data group overrides) by default, but
which can also use independent field type dictionaries, so it can be used to enforce
types on analysis tables with an ad-hoc mix of canonical PUDL fields and other columns
that pertain to that analytical context.

One (temporary) change that was required to make this work is that all of the _year
fields are now being typed as integers, rather than years, to avoid having them be
converted into nanoseconds after Jan 1st, 1970.  When we try to handle the year, month,
and day resolutions of reporting in the harvesting, and merging dataframes of different
frequencies together, we'll want to undo this, but for now it maintains the existing
behavior and doesn't break anything, and it was a small change.

Closes #1270
@zaneselvans zaneselvans added data-types Dtype conversions, standardization and implications of data types eia861 Anything having to do with EIA Form 861 metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. labels Jan 18, 2022
@codecov
Copy link

codecov bot commented Jan 18, 2022

Codecov Report

Merging #1408 (fbe7163) into dev (0e41b1f) will increase coverage by 0.04%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev    #1408      +/-   ##
==========================================
+ Coverage   83.49%   83.53%   +0.04%     
==========================================
  Files          64       64              
  Lines        6910     6921      +11     
==========================================
+ Hits         5769     5781      +12     
+ Misses       1141     1140       -1     
Impacted Files Coverage Δ
src/pudl/extract/eia861.py 96.15% <ø> (-0.14%) ⬇️
src/pudl/metadata/enums.py 100.00% <ø> (ø)
src/pudl/metadata/resources/eia861.py 100.00% <ø> (ø)
src/pudl/analysis/allocate_net_gen.py 97.71% <100.00%> (+0.02%) ⬆️
src/pudl/analysis/mcoe.py 92.73% <100.00%> (+0.13%) ⬆️
src/pudl/constants.py 100.00% <100.00%> (ø)
src/pudl/etl.py 92.79% <100.00%> (+0.13%) ⬆️
src/pudl/extract/epacems.py 97.67% <100.00%> (-0.15%) ⬇️
src/pudl/glue/eia_epacems.py 100.00% <100.00%> (ø)
src/pudl/helpers.py 85.04% <100.00%> (-0.42%) ⬇️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e41b1f...fbe7163. Read the comment docs.

@bendnorman
Copy link
Member

If we want to use the Resource.format_df() class to correct dtypes for arbitrary dataframe that have fields specified in FIELD_METADATA, we could add a from_dataframe() method to the Resource class. Something like this:

class Resource(base):
    ...
    
    @classmethod
    def from_dataframe(cls, df: pd.DataFrame) -> Resource:
        """Create a Resource for a dataframe whose fields exist in the metadata system."""

        fields = []
        for field in df.columns:
            # This will fail if `field` is not specified in FIELD_METADATA
            fields.append(Field.dict_from_id(field))
        schema = {"fields": fields}
        # Add any additional required params for Resource schemas

        return cls(**fields)

Use:

# An arbitrary  dataframe with fields that exist in FIELD_METADATA
gen_df = pd.DataFrame({"fuel_type": ["gas", "coal"], "generator_id": ["1A", "2B"]})

# create the schema
gen_schema = Resource.from_dataframe(gen_df)

# validate and format the dataframe
gen_df = gen_schema.format_df(gen_df)

Copy link
Member

@bendnorman bendnorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments and the proposal above.

src/pudl/helpers.py Outdated Show resolved Hide resolved
src/pudl/metadata/resources/eia861.py Show resolved Hide resolved
src/pudl/transform/eia861.py Outdated Show resolved Hide resolved
src/pudl/transform/eia861.py Show resolved Hide resolved
@zaneselvans
Copy link
Member Author

Re: using Resource.format_df() in a lot of cases some but not all of the columns are specified in the FIELDS_METADATA. Is there a straightforward way to make this apply to only the fields which do match, and ignore the rest? Is that really the desired behavior @cmgosnell?

@bendnorman
Copy link
Member

Resource.from_dataframe() could create Field() objects for each field not in FIELD_METADATA using the field's dtype.

@zaneselvans
Copy link
Member Author

Ways that pudl.helpers.get_pudl_dtype() and pudl.helpers.get_pudl_dtypes() are currently being used:

  • To create dictionaries for use with df.astype()
  • Often only applying types to a small subset of the fields in a dataframe.
  • Currently the types are always coming from a single data source (eia)

Tracked down all the instances in which we were using `pudl.helpers.get_pudl_dtypes()`
and replaced them with the new function call. This is an incremental step toward
standardizing the way we enforce data types on dataframes even when they aren't fully
specified by our metadata / database schema information. This affected the way
`pudl.helpers.clean_merge_asof()` is called as well.

Removed `pudl.helpers.get_pudl_dtype()` (singular) as it was duplicative given the
`pudl.metadata.fields.get_pandas_dyptes()` function.
It turns out that in most if not all of the cases where we are applying the canonical
PUDL dtypes, what we're actually doing is forcing all non-floats to use nullable Pandas
dtypes... and the built-in df.convert_dtypes() method does this.

I've hunted down and replaced all usage ofget_pudl_dtypes() with convert_dtypes() which
should avoid the need for our own typing infrastructure in almost all cases.

Excluding the conversion of floating point values is necessary since the pandas nullable
floats don't work with all of the Numpy functions.

There may be other instances (e.g. boolean columns with NA values that we use as masks)
where this doesn't work either... but haven't come across any just yet.
Consolidated the PUDL data type application functions into pudl.metadata.fields:
* get_pudl_dtypes() gets an exhaustive list of all defined fields and their types.
* apply_pudl_dtypes(df) applies the relevant dtypes to the columns in a dataframe.

Swapped in apply_pudl_dtypes() in the places where I had just switched to using
df.convert_dtypes() since that had a fair number of bad side effects, particularly with
respect to datetime columns.
In almost all cases simply applying PUDL data types to dataframes is enough to make them
behave well in merges etc. The exceptions are currently:

* `pudl.transform.eia861_compare_totals()` (which has issues w/ bools & NA values) and
* The end of the main eia, eia861, and ferc1 transform steps where we really are
  cleaning up bad values in addition to applying canonical data types. In these
  cases it's being called indirectly by `convert_dfs_dict_dtypes()`

There are also some calls to convert_cols_dtypes() in
test/unit/analysis/allocate_net_gen.py but I'm not sure if those can/should be switched.
Removed some additional unnecessary calls to convert_cols_dtypes, and replaced calls to
convert_dfs_dict_dtypes with dictionary comprehensions using convert_cols_dtypes.

Replaced the use of convert_cols_dtypes that was being applied to the entity tables post
harvesting with apply_pudl_dtypes() since all of the input values should have been
cleaned up prior to harvesting.

The remaining uses of convert_cols_dtypes are:
  * At the end of the EIA transformations, in the ETL module.
  * At the end of the EIA-861 transformations (analogous to the above)
  * At the end of the FERC-1 transformations
  * One use in the midst of the EIA-861 transformations to deal with bools/NAs.

Did a little bit of debugging of #1371 but ultimately left it for @cmgosnell. Did rename
the allocate_net_gen test module so pytest will pick it up, but marked the tests xfail.

Ran `tox -e nuke` overnight to see if any of the previous changes had affected the data
validations. The only changes were a few additional rows showing up in the
generation_fuel_eia923 and generation_fuel_nuclear_eia923 tables. I updated the expected
row counts appropriately... but not 100% sure what the changes there would be. Should
compare with a previous version of the DB.
Also looked into why there are a few more records showing up in the generation_fuel and
generation_fuel_nuclear tables now... It's not clear, but they look like valid records,
and they are all from 2001.

It seems that this change happened after we released PUDL v0.5.0 and before I started
messing around with the data types though -- the "additional" records were present in a
copy of the PUDL DB that I had from 2021-11-28, but were not there on 2021-11-11.
@zaneselvans
Copy link
Member Author

I think building an ad-hoc Resource using from_dataframe() is a good approach. I'm not totally familiar with what format_df does right now though, and I'm going to be messing around with that stuff in the harvesting, so I kind of want to put it off for the moment and go with the current solution.

I think we'll also want to make it easier to inject types for ad-hoc dataframes which don't have Resource definitions, but @cmgosnell said that in the depreciation cases where she was doing this, just using df.convert_dtypes() did the right thing, so it wasn't necessary yet.

@zaneselvans zaneselvans merged commit 22a2f07 into dev Jan 21, 2022
@zaneselvans zaneselvans deleted the unify-dtypes branch January 21, 2022 19:31
@zaneselvans zaneselvans mentioned this pull request Mar 10, 2022
22 tasks
@zaneselvans zaneselvans mentioned this pull request Mar 11, 2022
28 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-types Dtype conversions, standardization and implications of data types eia861 Anything having to do with EIA Form 861 metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace COLUMNS_DTYPES dictionary with Pydantic metadata Compile eia861 metadata using new system
2 participants